Topic Modeling For Information Retrieval: Distinguishing Word Senses In Hebrew

Orlian, Shira

Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.12202/8250

Title:	Topic Modeling For Information Retrieval: Distinguishing Word Senses In Hebrew
Authors:	Waxman, Joshua Orlian, Shira
Keywords:	modelling Hebrew words, meaning disambiguating Hebrew text Latent Dirichlet Allocation (LDA) topic prediction Text-Fabric
Issue Date:	5-Jun-2022
Publisher:	Yeshiva University
Citation:	Orlian, S. (2022, May 5). Topic Modeling For Information Retrieval: Distinguishing Word Senses In Hebrew. Undergraduate honors thesis, Yeshiva University.
Series/Report no.:	S. Daniel Abraham Honors Student Theses;<ay 5, 2022
Abstract:	Is there a way for a computer to figure out separate categories with which to distinguish the senses of Hebrew words in the Bible? This thesis aims to define the differing meanings of Hebrew words of the Bible based on the grouping of words by topic for use in an information retrieval system. Biblical words often have the same or similar root letters yet can mean different things, so what is the best way to group search results of a Bible search so that a user can focus on a subset of true interest? Using a computer algorithm for topic modeling, as its name suggests, can determine different topics within texts based on word patterns, such as “word frequency and distance between words” (Pascual 2019). After grouping the topics and obtaining their related keywords, the different meanings of a word could be determined by looking at the different topics in which the word appears as a keyword. Using these meaningful topics, search results (the texts) could then be retrieved based on the likelihood of the search query generating the text’s distribution of topics.¶ Existing approaches to classifying words into topics for disambiguating Hebrew text have not focused on classifying words based on their meanings in this way. Existing approaches to classifying Biblical text use Text-Fabric, which has been used in this thesis, as well. In addition, keywords were clustered into topics using the Gensim library. The programs in this thesis were run several times, with minute changes in different runs. Throughout the runs, a Latent Dirichlet Allocation (LDA) topic model was trained using a bag of words. The words were reduced to their root Hebrew letters and then attributed to various topics. Initial runs outputted results with very bad topic coherence (a quantitative metric that shows how good the LDA model is). However, there were certain topics that had pretty good similarities, like one topic had words that were almost all (if not all) numbers in Hebrew. Perhaps the number of topics (20) were too 4 low for turning out proper coherence numbers. A higher number such as 200 would have likely turned out a better topic prediction, but the model took a long time to train with that number of topics. Another consideration could be that the groupings of words, which were done by verse, could have been better grouped. For example, perhaps grouping by Chapter or by Aliyah would have produced better results. Later runs of the code showed proper topic coherence (~.3), yet still a low number. Perhaps, the documents could have been better grouped or the lemmatization could be improved. Aside from the topic coherence metric, another way to better see results would be a graph with the different topics and their associated keywords. If the topic model results scored higher, a future consideration would be to add these topic models as a filter or as an addition to improve the search query retrieval.¶ The end purpose of this thesis was to use the topic modeling technique for Information Retrieval in Hebrew texts. Therefore, I first give some background on Information Retrieval (IR) in general, discuss approaches and details of IR algorithms and then delve a little into topic modeling. Since this thesis deals primarily with the Hebrew Bible, I also discuss word sense disambiguation with examples of Hebrew words and describe previous work performed on the Hebrew Bible. Then, I show my approach to the first step in this thesis: topic modeling to distinguish words senses using machine learning. After going through my code and several outputted results, I discuss the second step to be done, the future steps of this thesis: of implementing the LDA topic model to be used in an Information Retrieval system. I wrap up with the assessment of the topic model and the contribution this thesis makes to the field of searching the Hebrew Bible.
Description:	Undergraduate honors thesis / YU only
URI:	https://hdl.handle.net/20.500.12202/8250
Appears in Collections:	S. Daniel Abraham Honors Student Theses

Files in This Item:

File	Description	Size	Format
Shira Orlian YU Only Topic_Modeling_for_Information_Retrieval May2022.pdf Restricted Access		600.85 kB	Adobe PDF	View/Open

Show full item record Recommend this item

This item is licensed under a Creative Commons License