Download Token Processing in Information Retrieval: Techniques and Algorithms - Prof. Nazli Goharia and more Study notes Computer Science in PDF only on Docsity! 1 1© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing (CS429) Nazli Goharian nazli@ir.iit.edu Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman,Frieder 2© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing Identifying document units for indexing – whole document – chapter – Paragraph – …. Too large unit Cons: potential of having more irrelevant documents & more difficult for the user to find relevant information Too small unit Cons: may loose some relevant docs as the terms are distributed over small units 2 3© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing Documents may belong to various languages. Web: ~ 60% in English A given document may have foreign language terms and phrases. The collection must be indexed! 4© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing Identifying the tokens in a document unit for indexing – Parsing – Stemming – n-grams 5 9© Goharian, Grossman, Frieder 2002, 2005, 2008 Normalization of Tokens (cont’d) • Case folding - reduces term index by ~17%, but a lossy compression – Convert all to lower case (most practical); or some to lower case • Spelling variations (neighbor vs. neighbour; a foreign name) • Accents on letters (naïve vs. naive; many foreign language terms) • Variant transliteration (Den-Haag vs. The Hague) – Use phonetic equivalence, best such algorithm: Soundex! More on normalization under Stemming…. 10© Goharian, Grossman, Frieder 2002, 2005, 2008 Phrase processing • Phrase recognition is based on the goal of indexing meaningful phrases like – “Lincoln Town Car” – “San Francisco” – “apple pie” • Doing this would use word order to assist with effectiveness -- otherwise we are assuming the query and documents are just a “bag of words” • ~ 10% of web queries are explicit phrase queries 6 11© Goharian, Grossman, Frieder 2002, 2005, 2008 Phrase processing • Add phrase terms to the query just like other terms • This really violates independence assumptions but a lot of people do it anyway • Give phrase terms a different weight than query terms 12© Goharian, Grossman, Frieder 2002, 2005, 2008 Constructing Phrases • Start with all 2-word pairs that are not separated by punctuation, stop words, or special characters • Only keep those that occur more than x times – Example: New York; Apple Pie;… 7 13© Goharian, Grossman, Frieder 2002, 2005, 2008 Constructing Phrases Using Part-of-Speech Tagging • Can take advantage of NLP techniques: • Using part-of-Speech tagging to identify key components of a sentence (S-V-OBJ, …) • Use to identify phrases – Keep all noun phrases “Republic of China”, or – Keep adjective followed by noun “Red Carpet” 14© Goharian, Grossman, Frieder 2002, 2005, 2008 Constructing Phrases Using Named Entity Tagging • Finding structured data within an unstructured document – People’s names, organizations, locations, amounts, etc. 10 19© Goharian, Grossman, Frieder 2002, 2005, 2008 Stemming Algorithms • Rule-Based – Porter (1980) – Lovins (1968) • Dictionary-based – K-stem (1989, 1993) • Co-Occurrence-Based (1994) • Others 20© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Stemmer • An incoming word is cleaned up in the initialization phase, one prefix trimming phase then takes place and then five suffix trimming phases occur. • Note: The entire algorithm will not be covered -- we will leave out some obscure rules. 11 21© Goharian, Grossman, Frieder 2002, 2005, 2008 Initialization • First the word is cleaned up. Converted to lower case only letters or digits are kept. • F-16 is converted to f16. 22© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Stemming • Remove prefixes: "kilo", "micro", "milli", "intra", "ultra", "mega", "nano", "pico", "pseudo” So megabyte, kilobyte all become “byte”. 12 23© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Step 1 • Replace “ing” with “e”, if number of consonant-vowels switches, called measure, is grater then 3. – liberating --> liberate, facilating--> facilate • Remove “es” from words that end in “sses” or “ies” – passes --> pass, cries --> cri • Remove “s” from words whose next to last letter is not an “s” – runs --> run, fuss --> fuss • If word has a vowel and ends with “eed” remove the “ed” – agreed --> agre, freed --> freed • Remove “ed” and “ing” from words that have other vowel – dreaded --> dread, red --> red, bothering --> bother, bring --> bring • Remove “d” if word has a vowel and ends with “ated” or “bled” – enabled --> enable, generated --> generate • Replace trailing “y” with an “I” if word has a vowel – satisfy --> satisfi, fly --> fly 24© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Step 2 • With what is left, replace any suffix on the left with suffix on the right- only if the consonant-vowels measure >0 ... tional tion conditional --> condition ization ize nationalization --> nationalize iveness ive effectiveness --> effective fulness ful usefulness --> useful ousness ous nervousness --> nervous ousli ous nervously --> nervous entli ent fervently --> fervent iveness ive inventiveness --> inventive biliti ble sensibility --> sensible ... 15 29© Goharian, Grossman, Frieder 2002, 2005, 2008 Dictionary based approaches (K-Stem) • Using dictionaries to ensure that the generated stem is a valid word. – Develop some candidate words by removing the endings – Find the longest word that is in the dictionary that matches one of the candidates. • Pro: This eliminates the Porter problem that many stems are not words. • Con: Language dependent approach 30© Goharian, Grossman, Frieder 2002, 2005, 2008 Term Co-Occurrence • Use Porter or other stemmer to stem terms • Place words in potential classes • Measure the frequency of co-occurrence of terms in the class • Eliminate words from a class with a low co- occurrence • Remaining classes form stemming rules 16 31© Goharian, Grossman, Frieder 2002, 2005, 2008 Co-Occurrence • Pro – Language independent (no need of dictionary) – Based on assumption that terms in a class will co-occur with other terms “hippo” will co-occur with “hippos” – Improves effectiveness • Con – computationally expensive to build co-occurrence matrix (but you only do it every now and then) 32© Goharian, Grossman, Frieder 2002, 2005, 2008 N-grams • Noise such as OCR (Optical Character Recognition) errors or misspelling lower the query processing accuracy in a term-based search. • The premise is: – Terms are all strings of length n – Substrings of a term may help to find a match in the noise cases • Replace terms with n-grams • Language-independent -- no stemming or stop word removal needed 17 33© Goharian, Grossman, Frieder 2002, 2005, 2008 5-Gram Example • Q: What technique works on noise and misspelled words? • D1: N-grams work on noisy mispelled text. _work _on_no on_noi n_nois spell pelle elled lled_ • 8 terms are matched • No stemming of work, noise • Partial match of misspelled word 34© Goharian, Grossman, Frieder 2002, 2005, 2008 N-gram Summary • Pro – Language independent – Works on garbled text (OCR, etc.) • Con – there can be a LOT of n-grams, dictionary may not fit in memory anymore – query processing requires more resources