Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Introduction to Computational Linguistics: Information Retrieval, Study notes of Computer Science

An introduction to information retrieval (ir), focusing on text retrieval and its applications. It covers the differences between ir and database querying, relevance as similarity, and various retrieval models. The history of ir is also discussed, from its early beginnings to modern vector space retrieval systems.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-xzy
koofers-user-xzy 🇺🇸

10 documents

1 / 12

Toggle sidebar

Related documents


Partial preview of the text

Download Introduction to Computational Linguistics: Information Retrieval and more Study notes Computer Science in PDF only on Docsity! Introduction to Computational Linguistics Information Retrieval Christof Monz Introduction to Computational Linguistics: IR 1 What is Information Retrieval? • Finding relevant information in large collections of data • In such a collection you may want to find: ◮ ‘Give me information on the history of the Kennedys’ An article about the Kennedys (text retrieval) ◮ ‘What does a brain tumor look like on a CT-scan’ A picture of a brain tumor (image retrieval) ◮ ‘It goes like this: hmm hmm hahmmm . . . ’ A certain song (music retrieval) Introduction to Computational Linguistics: IR 2 Text Retrieval • Online library catalogs (OPAC) • Internet search engines, such as AltaVista, Google • Specialized systems (aka vendors): ◮ MEDLINE (medical articles) ◮ Lexis-Nexis (legal, business, academic, . . . ) ◮ Westlaw (legal articles) ◮ Dialog (business information) Introduction to Computational Linguistics: IR 3 Retrieval vs. Browsing • Popular Web Directories: ◮ Yahoo!, Open Directory Project (dmoz) • The user has to ‘guess’ the ‘right’ directories to find the information ◮ The user has to adapt to the designers’ conceptualization of the directory • The goal of information retrieval is to provide immediate random access to the data ◮ The user can specifiy his information need Introduction to Computational Linguistics: IR 4 IR vs. Database Querying • IR is not the same thing as querying a database • Database querying assumes that the data is in a standardized format • Transforming all information, news articles, web sites into a database format is difficult and impossible for large data collections • Text retrieval can work with plain, unformatted data Introduction to Computational Linguistics: IR 5 Relevance as Similarity • A fundamental idea within IR is: ‘A document is relevant to a query if they are similar ’ • Similarity can be defined as ◮ string matching/comparison ◮ similar vocabulary ◮ same meaning of text Introduction to Computational Linguistics: IR 6 The Ubiquity of IR • Information filtering ◮ E-mail routing ◮ Text categorization • Detecting information structure ◮ Hyperlink generation ◮ Topic/Information detection/screening ◮ Portal development and maintenance • Question Answering Introduction to Computational Linguistics: IR 7 History of IR • 1950: Calvin N. Moors coins the term ‘Information Retrieval’ • 1959: Luhn describes statistical retrieval • 1960: Maron and Kuhns define a probabilistic model of IR • 1966: Cranfield project defines evaluation measures • 1968: Gerard Salton’s first book about the SMART retrieval system • 1972: Lockheed introduces DIALOG as commercial online service • Late 1980’s: First PC systems incorporate retrieval Introduction to Computational Linguistics: IR 8 Automatic Content Representation • Using natural language understanding? ◮ Computationally too expensive in real-world settings ◮ Coverage ◮ Language dependence ◮ The resulting representations may be too explicit to deal with the vagueness of a user’s information need • Alternative: a document is simply an unstructured set of words appearing in it: bag of words Introduction to Computational Linguistics: IR 17 Bag-of-Words Approach • A document is an unordered list of words Grammatical information is lost • Tokenization: What is a word? Is ‘White House’ one or two words? • Case folding ‘President Bush’ becomes ‘president’ , ‘bush’ • Stemming or lemmatization Morphological information is thrown away ‘agreements’ becomes ‘agreement’ (lemmatization) or even ‘agree’ (stemming) Introduction to Computational Linguistics: IR 18 Example Bag of Words Scientists have found compelling new evidence of possible ancient microscopic life on Mars, derived from magnetic crystals in a meteorite that fell to Earth from the red planet, NASA announced on Monday. a, ancient, announced, compelling, crystals, derived, earth, evidence, fell, found, from (2×), have, in, life, magnetic, mars, meteorite, microscopic, monday, nasa, new, of, on (2×), planet, possible, red, scientists, that, the, to Introduction to Computational Linguistics: IR 19 What is this about? ? added, al, an, and, ballots, been, completed, count, county (2×), even, former, gore, ground, had, hand, have (2×), he, if, in (2×), independent, lost, many, miami- dade, might, new, not, of, president, presidential, requested, shows, study, that, the, vice, votes, would = An independent study shows former Vice President Al Gore would not have added many new votes in Miami-Dade County and might even have lost ground in that county, if the hand count of presidential ballots he requested had been completed. Introduction to Computational Linguistics: IR 20 Boolean Retrieval • Boolean operators are: AND (NEAR), OR, NOT • The semantics of the Boolean operators: ◮ t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)} Documents whose representation contains t1 and t2 ◮ t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)} Documents whose representation contains t1 or t2 ◮ NOT t1 = {d | t1 6∈ r(d)} Documents whose representation doesn’t contain t1 Introduction to Computational Linguistics: IR 21 Boolean Retrieval • Information need: President Bill Clinton • Boolean query: clinton AND (bill OR president) bill clinton president Introduction to Computational Linguistics: IR 22 Zipf’s Law n o . o cc u rr en ce s words (sorted by freq.) • only a few words occur many times • a lot of words occur only once (hapax legomina) Introduction to Computational Linguistics: IR 23 Searching the Collection • Finding a word by linear search can be inefficient • Solution: Construct a matrix which indexes documents and words ◮ Matrix can be extremely large ◮ The matrix will be sparse (Zipf’s law) • Better: Construct an inverted index ◮ A word points to the documents in which it occurs • This is an implementational (not a modeling) issue! Introduction to Computational Linguistics: IR 24 Pros and Cons of Boolean Retrieval • Pros of Boolean Retrieval: + Clean and simple formalism + Firm grip on query formulation • Cons of Boolean Retrieval: − Most non-experts cannot handle boolean expressions, and query formulation may be time consuming − No ranking of retrieved documents − Exact matching may lead to too few or too many retrieved documents Introduction to Computational Linguistics: IR 25 Alternatives to Boolean Retrieval • Vector Space Retrieval: ◮ Users can enter free text ◮ Documents are ranked ◮ Best match instead of exact match • Probalistic Retrieval Introduction to Computational Linguistics: IR 26 Vector Space Retrieval • By far the most common modern retrieval system • Features: ◮ Users can enter free text ◮ Documents are ranked ◮ Relaxation of the matching criterion • Key idea: Everything (documents, queries, terms) is a vector in a high-dimensional space Introduction to Computational Linguistics: IR 27 Vector Space Representation t1 t2 t3 t4 . . . d1 1 0 0 1 . . . d2 0 1 0 1 . . . d3 0 0 1 1 . . . d4 1 1 1 0 . . . ... ... ... ... ... ... • Documents are vectors of terms • Terms are vectors of documents • Similarly, a query is a vector of terms Introduction to Computational Linguistics: IR 28 tf.idf-score • tf-score: tf i,j = frequency of term i in document j • idf-score: idf i = log ( N ni ) , where ◮ N is the size of the collection (no. documents) ◮ ni is the number of documents in which term i occurs ◮ the logarithm is used for dampening • the term weight of term i in document j is then computed as: tf i,j · idf i Introduction to Computational Linguistics: IR 37 Evaluation • Retrieval systems contain many parameters: ◮ term weights (different ways of weighting) ◮ document length normalization (apply yes/no) ◮ what to index (stop word removal) • How do we know what is best way to set those parameters? • Evaluation! Introduction to Computational Linguistics: IR 38 Precision and Recall retrieved relevant documents relevant retrieved & documents (REL) (RETR) Collection • Note that the set REL is not known in advance • Test collections are used where REL is known Introduction to Computational Linguistics: IR 39 Precision and Recall • Precision is the fraction of the retrieved documents (RETR) which is relevant: precision = |RETR ∩ REL| |RETR| • Recall is the fraction of the relevant documents (RETR) which has been retrieved: recall = |RETR ∩ REL| |REL| Introduction to Computational Linguistics: IR 40 Single Value Summaries • Harmonic mean (F-score): ◮ The harmonic mean combines precision and recall into a single number ranging from 0 to 1 F = 2 · prec. · rec. prec. + rec. • E-measure ◮ Importance of precision and recall can be varied E = 1 − 1 + b 2 b2/rec. + 1/prec. ◮ b > 1 emphasizes precision, b < 1 emphasizes recall Introduction to Computational Linguistics: IR 41 Other Measures • R-Precision ◮ If |RELq| is the number of relevant documents for a query q ◮ Compute the precision at rank |RELq| ◮ Varies with each query • p@n ◮ Compute the precision at a fixed rank n for every query ◮ For instance useful when evaluating search engines Introduction to Computational Linguistics: IR 42 Mean Average Precision • Topic q has |RELq| documents • Let rank(d) = n, where d ∈ RELq. The function rank(d) returns the rank of document d in the ranking returned by a system • MAP for query q is defined as: ∑ d∈RELq p@rank(d) |RELq| • if d 6∈ RETR, then rank(d) = 0 and p@0 = 0. Introduction to Computational Linguistics: IR 43 Stemming Word-based Language (baseline) Stemmed % change Lemmatized % change Dutch 0.4482 0.4535 +1.2% – English 0.4460 0.4639 +4.0% 0.4003 −10.2% Finnish 0.2545 0.3308 +30.0%N – French 0.4296 0.4348 +1.2% 0.4116 −4.2% German 0.3886 0.4171 +7.3%△ 0.4118 +6.0%△ Italian 0.4049 0.4248 +4.9% 0.4146 +2.4% Spanish 0.4537 0.5013 +10.5%N – Swedish 0.3203 0.3256 +1.7% – Introduction to Computational Linguistics: IR 44 Decompounding Word-based Stemmed Language (baseline) Split % change (baseline) Split+Stem % change Dutch 0.4482 0.4662 +4.0% 0.4535 0.4698 +3.6% Finnish 0.2545 0.3020+18.7%△ 0.3308 0.3633 +9.8% German 0.3886 0.4360+12.2%△ 0.4171 0.4816 +15.5%N Swedish 0.3203 0.3395 +6.0% 0.3256 0.4080 +25.3%N Introduction to Computational Linguistics: IR 45 N-grams Word-based 4-gram 5-gram Language (baseline) (within) (within) Dutch 0.4482 0.4495 (+0.3%) 0.4401 (−1.8%) English 0.4460 0.4793 (+7.3%) 0.4341 (−2.7%) Finnish 0.2545 0.3536 (+37.4%)N 0.3762 (+47.8%)N French 0.4296 0.4583 (+6.7%) 0.4348 (+1.2%) German 0.3886 0.4679 (+20.3%)N 0.4699 (+20.9%)N Italian 0.4049 0.4355 (+7.6%)N 0.4140 (+2.3%) Spanish 0.4537 0.4605 (+1.5%) 0.4648 (+2.5%) Swedish 0.3203 0.4080 (+27.4%)N 0.3854 (+20.3%)△ Introduction to Computational Linguistics: IR 46
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved