Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Token Processing in Information Retrieval: Techniques and Algorithms - Prof. Nazli Goharia, Study notes of Computer Science

Various techniques and algorithms used in token processing during information retrieval. Topics include identifying document units, token identification, stop words, special tokens, normalization of tokens, phrase processing, parser generators, stemming, and co-occurrence. The document also covers the advantages and disadvantages of each approach.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-01t
koofers-user-01t 🇺🇸

10 documents

1 / 18

Toggle sidebar

Related documents


Partial preview of the text

Download Token Processing in Information Retrieval: Techniques and Algorithms - Prof. Nazli Goharia and more Study notes Computer Science in PDF only on Docsity! 1 1© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing (CS429) Nazli Goharian nazli@ir.iit.edu Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman,Frieder 2© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing Identifying document units for indexing – whole document – chapter – Paragraph – …. Too large unit Cons: potential of having more irrelevant documents & more difficult for the user to find relevant information Too small unit Cons: may loose some relevant docs as the terms are distributed over small units 2 3© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing Documents may belong to various languages. Web: ~ 60% in English A given document may have foreign language terms and phrases. The collection must be indexed! 4© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing Identifying the tokens in a document unit for indexing – Parsing – Stemming – n-grams 5 9© Goharian, Grossman, Frieder 2002, 2005, 2008 Normalization of Tokens (cont’d) • Case folding - reduces term index by ~17%, but a lossy compression – Convert all to lower case (most practical); or some to lower case • Spelling variations (neighbor vs. neighbour; a foreign name) • Accents on letters (naïve vs. naive; many foreign language terms) • Variant transliteration (Den-Haag vs. The Hague) – Use phonetic equivalence, best such algorithm: Soundex! More on normalization under Stemming…. 10© Goharian, Grossman, Frieder 2002, 2005, 2008 Phrase processing • Phrase recognition is based on the goal of indexing meaningful phrases like – “Lincoln Town Car” – “San Francisco” – “apple pie” • Doing this would use word order to assist with effectiveness -- otherwise we are assuming the query and documents are just a “bag of words” • ~ 10% of web queries are explicit phrase queries 6 11© Goharian, Grossman, Frieder 2002, 2005, 2008 Phrase processing • Add phrase terms to the query just like other terms • This really violates independence assumptions but a lot of people do it anyway • Give phrase terms a different weight than query terms 12© Goharian, Grossman, Frieder 2002, 2005, 2008 Constructing Phrases • Start with all 2-word pairs that are not separated by punctuation, stop words, or special characters • Only keep those that occur more than x times – Example: New York; Apple Pie;… 7 13© Goharian, Grossman, Frieder 2002, 2005, 2008 Constructing Phrases Using Part-of-Speech Tagging • Can take advantage of NLP techniques: • Using part-of-Speech tagging to identify key components of a sentence (S-V-OBJ, …) • Use to identify phrases – Keep all noun phrases “Republic of China”, or – Keep adjective followed by noun “Red Carpet” 14© Goharian, Grossman, Frieder 2002, 2005, 2008 Constructing Phrases Using Named Entity Tagging • Finding structured data within an unstructured document – People’s names, organizations, locations, amounts, etc. 10 19© Goharian, Grossman, Frieder 2002, 2005, 2008 Stemming Algorithms • Rule-Based – Porter (1980) – Lovins (1968) • Dictionary-based – K-stem (1989, 1993) • Co-Occurrence-Based (1994) • Others 20© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Stemmer • An incoming word is cleaned up in the initialization phase, one prefix trimming phase then takes place and then five suffix trimming phases occur. • Note: The entire algorithm will not be covered -- we will leave out some obscure rules. 11 21© Goharian, Grossman, Frieder 2002, 2005, 2008 Initialization • First the word is cleaned up. Converted to lower case only letters or digits are kept. • F-16 is converted to f16. 22© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Stemming • Remove prefixes: "kilo", "micro", "milli", "intra", "ultra", "mega", "nano", "pico", "pseudo” So megabyte, kilobyte all become “byte”. 12 23© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Step 1 • Replace “ing” with “e”, if number of consonant-vowels switches, called measure, is grater then 3. – liberating --> liberate, facilating--> facilate • Remove “es” from words that end in “sses” or “ies” – passes --> pass, cries --> cri • Remove “s” from words whose next to last letter is not an “s” – runs --> run, fuss --> fuss • If word has a vowel and ends with “eed” remove the “ed” – agreed --> agre, freed --> freed • Remove “ed” and “ing” from words that have other vowel – dreaded --> dread, red --> red, bothering --> bother, bring --> bring • Remove “d” if word has a vowel and ends with “ated” or “bled” – enabled --> enable, generated --> generate • Replace trailing “y” with an “I” if word has a vowel – satisfy --> satisfi, fly --> fly 24© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Step 2 • With what is left, replace any suffix on the left with suffix on the right- only if the consonant-vowels measure >0 ... tional tion conditional --> condition ization ize nationalization --> nationalize iveness ive effectiveness --> effective fulness ful usefulness --> useful ousness ous nervousness --> nervous ousli ous nervously --> nervous entli ent fervently --> fervent iveness ive inventiveness --> inventive biliti ble sensibility --> sensible ... 15 29© Goharian, Grossman, Frieder 2002, 2005, 2008 Dictionary based approaches (K-Stem) • Using dictionaries to ensure that the generated stem is a valid word. – Develop some candidate words by removing the endings – Find the longest word that is in the dictionary that matches one of the candidates. • Pro: This eliminates the Porter problem that many stems are not words. • Con: Language dependent approach 30© Goharian, Grossman, Frieder 2002, 2005, 2008 Term Co-Occurrence • Use Porter or other stemmer to stem terms • Place words in potential classes • Measure the frequency of co-occurrence of terms in the class • Eliminate words from a class with a low co- occurrence • Remaining classes form stemming rules 16 31© Goharian, Grossman, Frieder 2002, 2005, 2008 Co-Occurrence • Pro – Language independent (no need of dictionary) – Based on assumption that terms in a class will co-occur with other terms “hippo” will co-occur with “hippos” – Improves effectiveness • Con – computationally expensive to build co-occurrence matrix (but you only do it every now and then) 32© Goharian, Grossman, Frieder 2002, 2005, 2008 N-grams • Noise such as OCR (Optical Character Recognition) errors or misspelling lower the query processing accuracy in a term-based search. • The premise is: – Terms are all strings of length n – Substrings of a term may help to find a match in the noise cases • Replace terms with n-grams • Language-independent -- no stemming or stop word removal needed 17 33© Goharian, Grossman, Frieder 2002, 2005, 2008 5-Gram Example • Q: What technique works on noise and misspelled words? • D1: N-grams work on noisy mispelled text. _work _on_no on_noi n_nois spell pelle elled lled_ • 8 terms are matched • No stemming of work, noise • Partial match of misspelled word 34© Goharian, Grossman, Frieder 2002, 2005, 2008 N-gram Summary • Pro – Language independent – Works on garbled text (OCR, etc.) • Con – there can be a LOT of n-grams, dictionary may not fit in memory anymore – query processing requires more resources
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved