Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Information Retrieval: Scoring Documents and Data Structures - Prof. James Allan, Study notes of Computer Science

Various approaches to scoring documents in information retrieval (ir) systems, including brute force, narrowing, and as needed. It also covers data structures such as bitmaps, signature files, and inverted lists, and their advantages and limitations.

Typology: Study notes

Pre 2010

Uploaded on 08/18/2009

koofers-user-pc2-1
koofers-user-pc2-1 🇺🇸

10 documents

1 / 17

Toggle sidebar

Related documents


Partial preview of the text

Download Information Retrieval: Scoring Documents and Data Structures - Prof. James Allan and more Study notes Computer Science in PDF only on Docsity! Blah 1 Information Retrieval James Allan University of Massachusetts Amherst Indexing a Collection File Organization CMPSCI 646 Fall 2007 Scoring documents: brute force • Sketch of approach – For each document in collection • Score the document – Sort the scores and select the top few • Unusual in modern IR systems – Was common in old IR systems with few documents – Still feasible for small (relative to memory) collections • Not really discussed further 2 Blah 2 Scoring documents: narrowing • Two passes • Rapidly find plausibly relevant documents to score • Sketch of approach – Select approximate set of documents (somehow) – For each document in set • Score the document – Sort the scores and select the top few • Reasonable approach 3 • If set is good enough, might even skip rescoring • Not all that common, but occurs Scoring documents: as needed • Don’t score documents that would get default score – For VS, if score would be zero, don’t consider – For LM if score would be P(D|GE) don’t consider , , • Similar to “narrowing” but in a single pass • Sketch of approach – For each feature that might signal relevance • For each document containing that feature – Generate and accumulate partial score – Sort the scores and select the top few 4 • Most common approach • Implemented with some form of index – Inverted file is the most common Blah 5 Signature files • Consider document bitmap vectors – Bit set for each feature occurring in document – (Cf bit set for each document containing feature). • Most features do not occur in a given document – Very wasteful • So… why not re-use some of the bits? – Let elephant and telemetry share the same bit • Could save much space • Could result in false alarms 9 – Perhaps clean them up on a second pass if needed? • Leads to “signature files” aka “superimposed coding” Signature file basics • Bag-of-words only • For each term, allocate fixed sized s-bit vector (signature) • Define hash function(s) – Single function: term → 1..2s [sets all s-bits] – Multiple functions: term → 1..s [selects which bits to set] • Each term has an s-bit signature – May not be unique 10 Blah 6 Signature File Example 11 16 bit signatures [Managing Gigabytes] Documents in signature file • How to represent documents? – Bit-wise OR the term signatures to form document signature • Long documents are a problem (why?) – Usually segment them into smaller pieces 12 [Managing Gigabytes] Blah 7 Signature file querying • At query time: – Lookup signature for query (how?) – If all corresponding 1-bits are “on” in document signature, document probably contains that term – How can this be implemented efficiently? • Vary s to control P(false alarm) – Note space tradeoff – Optimal s changes as collection grows 13 • Many variations • Widely studied • Not widely used Signature file trivia, setup • False positive – Something identified as true when it is actually false – For example something retrieved that is not relevant , • Also called “false drop”, but why? • Some definitions of “false drop” on the web* – “document that is retrieved by a search but is not relevant to the searcher’s needs. False drops occur because of words that are written the same but have different meanings (for example, ‘squash’ can refer to a game, a vegetable or an action).” members optusnet com au/~webindexing/Webbook2Ed/glossary htm 14 . . . . – “A web page retrieved from a search engine or directory which is not relevant to the query used.” www.mbgj.org/glossary_se_terms.htm *Courtesy of Google’s “define” feature Blah 10 Inverted Files 19 [Managing Gigabytes] Word-Level Inverted File 20 [Managing Gigabytes] Blah 11 • Assume query likelihood approach Using indexes for LM approach • Jelinek-Mercer smoothing • Probably use logs to avoid excessively tiny numbers 21 Brute force document-based approach • For each document D in collection – Calculate log P(Q|MD) • Sort scores • Drawbacks – Most documents have no query terms – Very slow 22 Blah 12 Use inverted list to narrow • Simple approach to using inverted list Use list to find documents containing any query term• – All others assumed to have low and constant probability • For each document in that pool – Calculate log P(Q|MD) • Sort scores 23 • Better – Only plausible documents considered – Still requires accessing entire document Better use of inverted lists • Recall score being calculated • Can be done in parts – Do q1 for every document – Then q2 then q3 then … • Keep array Score[ ] with cumulative scores 24 Blah 15 Accessing inverted lists • Given term, how is a file of inverted lists accessed? • B-Tree (B+ Tree, B* Tree, etc) – Supports exact-match and range-based lookup – O(log n) lookups to find a list – Usually easy to expand • Hash table – Supports exact-match lookup 29 – O(1) lookups to find a list – May be complex to expand • Will examine efficiency issues later Supporting wildcards • X* is probably easy (why? when not?) • What about *X, *X*, X*Y? • Permuterm index – Prefix each term X with a ╠ – Rotate each augmented term cyclically (with wraparound) by one character, to produce n new terms – Append an ╣ to the end of each word form • Not absolutely needed, but handy for some searches – Insert all forms in the dictionary • All point to the same inverted list 30 • Example – term indexed as ╠term╣, m╠ter╣, rm╠te╣, erm╠t╣, term╠╣ – team indexed as ╠team╣, m╠tea╣, am╠te╣, eam╠t╣, team╠╣ Blah 16 Wildcard lookups • Exactly X: search for ╠X╣ – team matches ╠term╣ (only time that ╣is useful) • Rest require prefix matching • X*: search for all terms beginning with ╠X – te* uses “╠te…” to match ╠term╣, ╠team╣ • *X: search for all terms beginning with X╠ – *am uses “am╠…” to match am╠te╣ 31 • *X*: search for all terms beginning with X – *r* uses “r…” to match rm╠te╣ • X*Y: search for all terms beginning with Y╠X – t*m uses “m╠t…” to match m╠ter╣, m╠tea╣ Building indexes • For each document to be indexed – Tokenize, normalize, stem, weight, … – For each token (term) • Fetch list for term so far • Add new information to end (sorted by document accession number) • Numerous efficiency issues – Repeated fetching of common lists should be avoided – Stage stuff in memory, dump to disk as needed, merge later – 32 … • Will cover some later Blah 17 Updating indexes • Indexes expensive to update; usually done in batches • Typical build/update procedure: – One or more documents arrive to be added / updated – Documents parsed to generate index modifications – Each inverted list updated for all documents in the batch • Concurrency control required – To synchronize changes to documents and index – To prevent readers and writers from colliding • Common to split index into static / dynamic 33 components – All updates to dynamic components – Search both static and dynamic component – Periodically merge dynamic into static Summary Inverted Signature Characteristics Files Bitmaps Files Ease of update (edit a doc) - + +- Query evaluation speed + +- +- Uncompressed space efficiency - - + Compressed space efficiency + + - 34 Index fidelity + + - Can store word positions + - -
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved