Download Information Retrieval: Scoring Documents and Data Structures - Prof. James Allan and more Study notes Computer Science in PDF only on Docsity! Blah 1 Information Retrieval James Allan University of Massachusetts Amherst Indexing a Collection File Organization CMPSCI 646 Fall 2007 Scoring documents: brute force • Sketch of approach – For each document in collection • Score the document – Sort the scores and select the top few • Unusual in modern IR systems – Was common in old IR systems with few documents – Still feasible for small (relative to memory) collections • Not really discussed further 2 Blah 2 Scoring documents: narrowing • Two passes • Rapidly find plausibly relevant documents to score • Sketch of approach – Select approximate set of documents (somehow) – For each document in set • Score the document – Sort the scores and select the top few • Reasonable approach 3 • If set is good enough, might even skip rescoring • Not all that common, but occurs Scoring documents: as needed • Don’t score documents that would get default score – For VS, if score would be zero, don’t consider – For LM if score would be P(D|GE) don’t consider , , • Similar to “narrowing” but in a single pass • Sketch of approach – For each feature that might signal relevance • For each document containing that feature – Generate and accumulate partial score – Sort the scores and select the top few 4 • Most common approach • Implemented with some form of index – Inverted file is the most common Blah 5 Signature files • Consider document bitmap vectors – Bit set for each feature occurring in document – (Cf bit set for each document containing feature). • Most features do not occur in a given document – Very wasteful • So… why not re-use some of the bits? – Let elephant and telemetry share the same bit • Could save much space • Could result in false alarms 9 – Perhaps clean them up on a second pass if needed? • Leads to “signature files” aka “superimposed coding” Signature file basics • Bag-of-words only • For each term, allocate fixed sized s-bit vector (signature) • Define hash function(s) – Single function: term → 1..2s [sets all s-bits] – Multiple functions: term → 1..s [selects which bits to set] • Each term has an s-bit signature – May not be unique 10 Blah 6 Signature File Example 11 16 bit signatures [Managing Gigabytes] Documents in signature file • How to represent documents? – Bit-wise OR the term signatures to form document signature • Long documents are a problem (why?) – Usually segment them into smaller pieces 12 [Managing Gigabytes] Blah 7 Signature file querying • At query time: – Lookup signature for query (how?) – If all corresponding 1-bits are “on” in document signature, document probably contains that term – How can this be implemented efficiently? • Vary s to control P(false alarm) – Note space tradeoff – Optimal s changes as collection grows 13 • Many variations • Widely studied • Not widely used Signature file trivia, setup • False positive – Something identified as true when it is actually false – For example something retrieved that is not relevant , • Also called “false drop”, but why? • Some definitions of “false drop” on the web* – “document that is retrieved by a search but is not relevant to the searcher’s needs. False drops occur because of words that are written the same but have different meanings (for example, ‘squash’ can refer to a game, a vegetable or an action).” members optusnet com au/~webindexing/Webbook2Ed/glossary htm 14 . . . . – “A web page retrieved from a search engine or directory which is not relevant to the query used.” www.mbgj.org/glossary_se_terms.htm *Courtesy of Google’s “define” feature Blah 10 Inverted Files 19 [Managing Gigabytes] Word-Level Inverted File 20 [Managing Gigabytes] Blah 11 • Assume query likelihood approach Using indexes for LM approach • Jelinek-Mercer smoothing • Probably use logs to avoid excessively tiny numbers 21 Brute force document-based approach • For each document D in collection – Calculate log P(Q|MD) • Sort scores • Drawbacks – Most documents have no query terms – Very slow 22 Blah 12 Use inverted list to narrow • Simple approach to using inverted list Use list to find documents containing any query term• – All others assumed to have low and constant probability • For each document in that pool – Calculate log P(Q|MD) • Sort scores 23 • Better – Only plausible documents considered – Still requires accessing entire document Better use of inverted lists • Recall score being calculated • Can be done in parts – Do q1 for every document – Then q2 then q3 then … • Keep array Score[ ] with cumulative scores 24 Blah 15 Accessing inverted lists • Given term, how is a file of inverted lists accessed? • B-Tree (B+ Tree, B* Tree, etc) – Supports exact-match and range-based lookup – O(log n) lookups to find a list – Usually easy to expand • Hash table – Supports exact-match lookup 29 – O(1) lookups to find a list – May be complex to expand • Will examine efficiency issues later Supporting wildcards • X* is probably easy (why? when not?) • What about *X, *X*, X*Y? • Permuterm index – Prefix each term X with a ╠ – Rotate each augmented term cyclically (with wraparound) by one character, to produce n new terms – Append an ╣ to the end of each word form • Not absolutely needed, but handy for some searches – Insert all forms in the dictionary • All point to the same inverted list 30 • Example – term indexed as ╠term╣, m╠ter╣, rm╠te╣, erm╠t╣, term╠╣ – team indexed as ╠team╣, m╠tea╣, am╠te╣, eam╠t╣, team╠╣ Blah 16 Wildcard lookups • Exactly X: search for ╠X╣ – team matches ╠term╣ (only time that ╣is useful) • Rest require prefix matching • X*: search for all terms beginning with ╠X – te* uses “╠te…” to match ╠term╣, ╠team╣ • *X: search for all terms beginning with X╠ – *am uses “am╠…” to match am╠te╣ 31 • *X*: search for all terms beginning with X – *r* uses “r…” to match rm╠te╣ • X*Y: search for all terms beginning with Y╠X – t*m uses “m╠t…” to match m╠ter╣, m╠tea╣ Building indexes • For each document to be indexed – Tokenize, normalize, stem, weight, … – For each token (term) • Fetch list for term so far • Add new information to end (sorted by document accession number) • Numerous efficiency issues – Repeated fetching of common lists should be avoided – Stage stuff in memory, dump to disk as needed, merge later – 32 … • Will cover some later Blah 17 Updating indexes • Indexes expensive to update; usually done in batches • Typical build/update procedure: – One or more documents arrive to be added / updated – Documents parsed to generate index modifications – Each inverted list updated for all documents in the batch • Concurrency control required – To synchronize changes to documents and index – To prevent readers and writers from colliding • Common to split index into static / dynamic 33 components – All updates to dynamic components – Search both static and dynamic component – Periodically merge dynamic into static Summary Inverted Signature Characteristics Files Bitmaps Files Ease of update (edit a doc) - + +- Query evaluation speed + +- +- Uncompressed space efficiency - - + Compressed space efficiency + + - 34 Index fidelity + + - Can store word positions + - -