Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Token Processing in Information Retrieval: Techniques and Algorithms - Prof. Nazli Goharia, Study notes of Computer Science

Illinois Institute of Technology (IIT)Computer Science

Prof. Nazli Goharian

Various techniques and algorithms used in token processing during information retrieval. Topics include identifying document units, token identification, stop words, special tokens, normalization of tokens, phrase processing, parser generators, stemming, and co-occurrence. The document also covers the advantages and disadvantages of each approach.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-01t 🇺🇸

10 documents

1 / 18

Partial preview of the text

Download Token Processing in Information Retrieval: Techniques and Algorithms - Prof. Nazli Goharia and more Study notes Computer Science in PDF only on Docsity! 1 1© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing (CS429) Nazli Goharian nazli@ir.iit.edu Slides are mostly based on Information Retrieval Algorithms and Heuristics, Grossman,Frieder 2© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing Identifying document units for indexing – whole document – chapter – Paragraph – …. Too large unit Cons: potential of having more irrelevant documents & more difficult for the user to find relevant information Too small unit Cons: may loose some relevant docs as the terms are distributed over small units 2 3© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing Documents may belong to various languages. Web: ~ 60% in English A given document may have foreign language terms and phrases. The collection must be indexed! 4© Goharian, Grossman, Frieder 2002, 2005, 2008 Token Processing Identifying the tokens in a document unit for indexing – Parsing – Stemming – n-grams 5 9© Goharian, Grossman, Frieder 2002, 2005, 2008 Normalization of Tokens (cont’d) • Case folding - reduces term index by ~17%, but a lossy compression – Convert all to lower case (most practical); or some to lower case • Spelling variations (neighbor vs. neighbour; a foreign name) • Accents on letters (naïve vs. naive; many foreign language terms) • Variant transliteration (Den-Haag vs. The Hague) – Use phonetic equivalence, best such algorithm: Soundex! More on normalization under Stemming…. 10© Goharian, Grossman, Frieder 2002, 2005, 2008 Phrase processing • Phrase recognition is based on the goal of indexing meaningful phrases like – “Lincoln Town Car” – “San Francisco” – “apple pie” • Doing this would use word order to assist with effectiveness -- otherwise we are assuming the query and documents are just a “bag of words” • ~ 10% of web queries are explicit phrase queries 6 11© Goharian, Grossman, Frieder 2002, 2005, 2008 Phrase processing • Add phrase terms to the query just like other terms • This really violates independence assumptions but a lot of people do it anyway • Give phrase terms a different weight than query terms 12© Goharian, Grossman, Frieder 2002, 2005, 2008 Constructing Phrases • Start with all 2-word pairs that are not separated by punctuation, stop words, or special characters • Only keep those that occur more than x times – Example: New York; Apple Pie;… 7 13© Goharian, Grossman, Frieder 2002, 2005, 2008 Constructing Phrases Using Part-of-Speech Tagging • Can take advantage of NLP techniques: • Using part-of-Speech tagging to identify key components of a sentence (S-V-OBJ, …) • Use to identify phrases – Keep all noun phrases “Republic of China”, or – Keep adjective followed by noun “Red Carpet” 14© Goharian, Grossman, Frieder 2002, 2005, 2008 Constructing Phrases Using Named Entity Tagging • Finding structured data within an unstructured document – People’s names, organizations, locations, amounts, etc. 10 19© Goharian, Grossman, Frieder 2002, 2005, 2008 Stemming Algorithms • Rule-Based – Porter (1980) – Lovins (1968) • Dictionary-based – K-stem (1989, 1993) • Co-Occurrence-Based (1994) • Others 20© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Stemmer • An incoming word is cleaned up in the initialization phase, one prefix trimming phase then takes place and then five suffix trimming phases occur. • Note: The entire algorithm will not be covered -- we will leave out some obscure rules. 11 21© Goharian, Grossman, Frieder 2002, 2005, 2008 Initialization • First the word is cleaned up. Converted to lower case only letters or digits are kept. • F-16 is converted to f16. 22© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Stemming • Remove prefixes: "kilo", "micro", "milli", "intra", "ultra", "mega", "nano", "pico", "pseudo” So megabyte, kilobyte all become “byte”. 12 23© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Step 1 • Replace “ing” with “e”, if number of consonant-vowels switches, called measure, is grater then 3. – liberating --> liberate, facilating--> facilate • Remove “es” from words that end in “sses” or “ies” – passes --> pass, cries --> cri • Remove “s” from words whose next to last letter is not an “s” – runs --> run, fuss --> fuss • If word has a vowel and ends with “eed” remove the “ed” – agreed --> agre, freed --> freed • Remove “ed” and “ing” from words that have other vowel – dreaded --> dread, red --> red, bothering --> bother, bring --> bring • Remove “d” if word has a vowel and ends with “ated” or “bled” – enabled --> enable, generated --> generate • Replace trailing “y” with an “I” if word has a vowel – satisfy --> satisfi, fly --> fly 24© Goharian, Grossman, Frieder 2002, 2005, 2008 Porter Step 2 • With what is left, replace any suffix on the left with suffix on the right- only if the consonant-vowels measure >0 ... tional tion conditional --> condition ization ize nationalization --> nationalize iveness ive effectiveness --> effective fulness ful usefulness --> useful ousness ous nervousness --> nervous ousli ous nervously --> nervous entli ent fervently --> fervent iveness ive inventiveness --> inventive biliti ble sensibility --> sensible ... 15 29© Goharian, Grossman, Frieder 2002, 2005, 2008 Dictionary based approaches (K-Stem) • Using dictionaries to ensure that the generated stem is a valid word. – Develop some candidate words by removing the endings – Find the longest word that is in the dictionary that matches one of the candidates. • Pro: This eliminates the Porter problem that many stems are not words. • Con: Language dependent approach 30© Goharian, Grossman, Frieder 2002, 2005, 2008 Term Co-Occurrence • Use Porter or other stemmer to stem terms • Place words in potential classes • Measure the frequency of co-occurrence of terms in the class • Eliminate words from a class with a low co- occurrence • Remaining classes form stemming rules 16 31© Goharian, Grossman, Frieder 2002, 2005, 2008 Co-Occurrence • Pro – Language independent (no need of dictionary) – Based on assumption that terms in a class will co-occur with other terms “hippo” will co-occur with “hippos” – Improves effectiveness • Con – computationally expensive to build co-occurrence matrix (but you only do it every now and then) 32© Goharian, Grossman, Frieder 2002, 2005, 2008 N-grams • Noise such as OCR (Optical Character Recognition) errors or misspelling lower the query processing accuracy in a term-based search. • The premise is: – Terms are all strings of length n – Substrings of a term may help to find a match in the noise cases • Replace terms with n-grams • Language-independent -- no stemming or stop word removal needed 17 33© Goharian, Grossman, Frieder 2002, 2005, 2008 5-Gram Example • Q: What technique works on noise and misspelled words? • D1: N-grams work on noisy mispelled text. _work _on_no on_noi n_nois spell pelle elled lled_ • 8 terms are matched • No stemming of work, noise • Partial match of misspelled word 34© Goharian, Grossman, Frieder 2002, 2005, 2008 N-gram Summary • Pro – Language independent – Works on garbled text (OCR, etc.) • Con – there can be a LOT of n-grams, dictionary may not fit in memory anymore – query processing requires more resources

Documents

questions

Token Processing in Information Retrieval: Techniques and Algorithms - Prof. Nazli Goharia, Study notes of Computer Science

Related documents

Partial preview of the text