Download Introduction to Computational Linguistics: Information Retrieval and more Study notes Computer Science in PDF only on Docsity! Introduction to Computational Linguistics Information Retrieval Christof Monz Introduction to Computational Linguistics: IR 1 What is Information Retrieval? • Finding relevant information in large collections of data • In such a collection you may want to find: ◮ ‘Give me information on the history of the Kennedys’ An article about the Kennedys (text retrieval) ◮ ‘What does a brain tumor look like on a CT-scan’ A picture of a brain tumor (image retrieval) ◮ ‘It goes like this: hmm hmm hahmmm . . . ’ A certain song (music retrieval) Introduction to Computational Linguistics: IR 2 Text Retrieval • Online library catalogs (OPAC) • Internet search engines, such as AltaVista, Google • Specialized systems (aka vendors): ◮ MEDLINE (medical articles) ◮ Lexis-Nexis (legal, business, academic, . . . ) ◮ Westlaw (legal articles) ◮ Dialog (business information) Introduction to Computational Linguistics: IR 3 Retrieval vs. Browsing • Popular Web Directories: ◮ Yahoo!, Open Directory Project (dmoz) • The user has to ‘guess’ the ‘right’ directories to find the information ◮ The user has to adapt to the designers’ conceptualization of the directory • The goal of information retrieval is to provide immediate random access to the data ◮ The user can specifiy his information need Introduction to Computational Linguistics: IR 4 IR vs. Database Querying • IR is not the same thing as querying a database • Database querying assumes that the data is in a standardized format • Transforming all information, news articles, web sites into a database format is difficult and impossible for large data collections • Text retrieval can work with plain, unformatted data Introduction to Computational Linguistics: IR 5 Relevance as Similarity • A fundamental idea within IR is: ‘A document is relevant to a query if they are similar ’ • Similarity can be defined as ◮ string matching/comparison ◮ similar vocabulary ◮ same meaning of text Introduction to Computational Linguistics: IR 6 The Ubiquity of IR • Information filtering ◮ E-mail routing ◮ Text categorization • Detecting information structure ◮ Hyperlink generation ◮ Topic/Information detection/screening ◮ Portal development and maintenance • Question Answering Introduction to Computational Linguistics: IR 7 History of IR • 1950: Calvin N. Moors coins the term ‘Information Retrieval’ • 1959: Luhn describes statistical retrieval • 1960: Maron and Kuhns define a probabilistic model of IR • 1966: Cranfield project defines evaluation measures • 1968: Gerard Salton’s first book about the SMART retrieval system • 1972: Lockheed introduces DIALOG as commercial online service • Late 1980’s: First PC systems incorporate retrieval Introduction to Computational Linguistics: IR 8 Automatic Content Representation • Using natural language understanding? ◮ Computationally too expensive in real-world settings ◮ Coverage ◮ Language dependence ◮ The resulting representations may be too explicit to deal with the vagueness of a user’s information need • Alternative: a document is simply an unstructured set of words appearing in it: bag of words Introduction to Computational Linguistics: IR 17 Bag-of-Words Approach • A document is an unordered list of words Grammatical information is lost • Tokenization: What is a word? Is ‘White House’ one or two words? • Case folding ‘President Bush’ becomes ‘president’ , ‘bush’ • Stemming or lemmatization Morphological information is thrown away ‘agreements’ becomes ‘agreement’ (lemmatization) or even ‘agree’ (stemming) Introduction to Computational Linguistics: IR 18 Example Bag of Words Scientists have found compelling new evidence of possible ancient microscopic life on Mars, derived from magnetic crystals in a meteorite that fell to Earth from the red planet, NASA announced on Monday. a, ancient, announced, compelling, crystals, derived, earth, evidence, fell, found, from (2×), have, in, life, magnetic, mars, meteorite, microscopic, monday, nasa, new, of, on (2×), planet, possible, red, scientists, that, the, to Introduction to Computational Linguistics: IR 19 What is this about? ? added, al, an, and, ballots, been, completed, count, county (2×), even, former, gore, ground, had, hand, have (2×), he, if, in (2×), independent, lost, many, miami- dade, might, new, not, of, president, presidential, requested, shows, study, that, the, vice, votes, would = An independent study shows former Vice President Al Gore would not have added many new votes in Miami-Dade County and might even have lost ground in that county, if the hand count of presidential ballots he requested had been completed. Introduction to Computational Linguistics: IR 20 Boolean Retrieval • Boolean operators are: AND (NEAR), OR, NOT • The semantics of the Boolean operators: ◮ t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)} Documents whose representation contains t1 and t2 ◮ t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)} Documents whose representation contains t1 or t2 ◮ NOT t1 = {d | t1 6∈ r(d)} Documents whose representation doesn’t contain t1 Introduction to Computational Linguistics: IR 21 Boolean Retrieval • Information need: President Bill Clinton • Boolean query: clinton AND (bill OR president) bill clinton president Introduction to Computational Linguistics: IR 22 Zipf’s Law n o . o cc u rr en ce s words (sorted by freq.) • only a few words occur many times • a lot of words occur only once (hapax legomina) Introduction to Computational Linguistics: IR 23 Searching the Collection • Finding a word by linear search can be inefficient • Solution: Construct a matrix which indexes documents and words ◮ Matrix can be extremely large ◮ The matrix will be sparse (Zipf’s law) • Better: Construct an inverted index ◮ A word points to the documents in which it occurs • This is an implementational (not a modeling) issue! Introduction to Computational Linguistics: IR 24 Pros and Cons of Boolean Retrieval • Pros of Boolean Retrieval: + Clean and simple formalism + Firm grip on query formulation • Cons of Boolean Retrieval: − Most non-experts cannot handle boolean expressions, and query formulation may be time consuming − No ranking of retrieved documents − Exact matching may lead to too few or too many retrieved documents Introduction to Computational Linguistics: IR 25 Alternatives to Boolean Retrieval • Vector Space Retrieval: ◮ Users can enter free text ◮ Documents are ranked ◮ Best match instead of exact match • Probalistic Retrieval Introduction to Computational Linguistics: IR 26 Vector Space Retrieval • By far the most common modern retrieval system • Features: ◮ Users can enter free text ◮ Documents are ranked ◮ Relaxation of the matching criterion • Key idea: Everything (documents, queries, terms) is a vector in a high-dimensional space Introduction to Computational Linguistics: IR 27 Vector Space Representation t1 t2 t3 t4 . . . d1 1 0 0 1 . . . d2 0 1 0 1 . . . d3 0 0 1 1 . . . d4 1 1 1 0 . . . ... ... ... ... ... ... • Documents are vectors of terms • Terms are vectors of documents • Similarly, a query is a vector of terms Introduction to Computational Linguistics: IR 28 tf.idf-score • tf-score: tf i,j = frequency of term i in document j • idf-score: idf i = log ( N ni ) , where ◮ N is the size of the collection (no. documents) ◮ ni is the number of documents in which term i occurs ◮ the logarithm is used for dampening • the term weight of term i in document j is then computed as: tf i,j · idf i Introduction to Computational Linguistics: IR 37 Evaluation • Retrieval systems contain many parameters: ◮ term weights (different ways of weighting) ◮ document length normalization (apply yes/no) ◮ what to index (stop word removal) • How do we know what is best way to set those parameters? • Evaluation! Introduction to Computational Linguistics: IR 38 Precision and Recall retrieved relevant documents relevant retrieved & documents (REL) (RETR) Collection • Note that the set REL is not known in advance • Test collections are used where REL is known Introduction to Computational Linguistics: IR 39 Precision and Recall • Precision is the fraction of the retrieved documents (RETR) which is relevant: precision = |RETR ∩ REL| |RETR| • Recall is the fraction of the relevant documents (RETR) which has been retrieved: recall = |RETR ∩ REL| |REL| Introduction to Computational Linguistics: IR 40 Single Value Summaries • Harmonic mean (F-score): ◮ The harmonic mean combines precision and recall into a single number ranging from 0 to 1 F = 2 · prec. · rec. prec. + rec. • E-measure ◮ Importance of precision and recall can be varied E = 1 − 1 + b 2 b2/rec. + 1/prec. ◮ b > 1 emphasizes precision, b < 1 emphasizes recall Introduction to Computational Linguistics: IR 41 Other Measures • R-Precision ◮ If |RELq| is the number of relevant documents for a query q ◮ Compute the precision at rank |RELq| ◮ Varies with each query • p@n ◮ Compute the precision at a fixed rank n for every query ◮ For instance useful when evaluating search engines Introduction to Computational Linguistics: IR 42 Mean Average Precision • Topic q has |RELq| documents • Let rank(d) = n, where d ∈ RELq. The function rank(d) returns the rank of document d in the ranking returned by a system • MAP for query q is defined as: ∑ d∈RELq p@rank(d) |RELq| • if d 6∈ RETR, then rank(d) = 0 and p@0 = 0. Introduction to Computational Linguistics: IR 43 Stemming Word-based Language (baseline) Stemmed % change Lemmatized % change Dutch 0.4482 0.4535 +1.2% – English 0.4460 0.4639 +4.0% 0.4003 −10.2% Finnish 0.2545 0.3308 +30.0%N – French 0.4296 0.4348 +1.2% 0.4116 −4.2% German 0.3886 0.4171 +7.3%△ 0.4118 +6.0%△ Italian 0.4049 0.4248 +4.9% 0.4146 +2.4% Spanish 0.4537 0.5013 +10.5%N – Swedish 0.3203 0.3256 +1.7% – Introduction to Computational Linguistics: IR 44 Decompounding Word-based Stemmed Language (baseline) Split % change (baseline) Split+Stem % change Dutch 0.4482 0.4662 +4.0% 0.4535 0.4698 +3.6% Finnish 0.2545 0.3020+18.7%△ 0.3308 0.3633 +9.8% German 0.3886 0.4360+12.2%△ 0.4171 0.4816 +15.5%N Swedish 0.3203 0.3395 +6.0% 0.3256 0.4080 +25.3%N Introduction to Computational Linguistics: IR 45 N-grams Word-based 4-gram 5-gram Language (baseline) (within) (within) Dutch 0.4482 0.4495 (+0.3%) 0.4401 (−1.8%) English 0.4460 0.4793 (+7.3%) 0.4341 (−2.7%) Finnish 0.2545 0.3536 (+37.4%)N 0.3762 (+47.8%)N French 0.4296 0.4583 (+6.7%) 0.4348 (+1.2%) German 0.3886 0.4679 (+20.3%)N 0.4699 (+20.9%)N Italian 0.4049 0.4355 (+7.6%)N 0.4140 (+2.3%) Spanish 0.4537 0.4605 (+1.5%) 0.4648 (+2.5%) Swedish 0.3203 0.4080 (+27.4%)N 0.3854 (+20.3%)△ Introduction to Computational Linguistics: IR 46