Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Information Retrieval 2, Exercises - Computer Science, Exercises of Artificial Intelligence

Prof.Paul McNamee, Information Retrieval,Computer Science, Artificial Intelligence, Johns Hopkins University, Information Retrieval, Exercises - Computer Science, Prof. Paul McNamee, Inverted Files, Construction

Typology: Exercises

2010/2011

Uploaded on 11/09/2011

stagist
stagist 🇺🇸

4.1

(27)

30 documents

1 / 2

Toggle sidebar

Related documents


Partial preview of the text

Download Information Retrieval 2, Exercises - Computer Science and more Exercises Artificial Intelligence in PDF only on Docsity! 605.744 Information Retrieval Spring 2011 – Paul McNamee Homework #2 (due in 2 weeks) Inverted Files (70 points) Inverted files are the primary data structure to support the efficient determination of which documents contain specified terms. The objective of this assignment is to process the bible-asv corpus from the course web page, much as in the first assignment, but this time you must build an inverted file that contains a postings list for each dictionary term. Your implementation should model a useable real-world indexing system; in particular, this means that your inverted file structure must be written to disk as a binary file, your dictionary must be written to disk, for each word in the lexicon you must store a file offset to the corresponding posting list, and finally, you should process the raw text collection only once (many real-word collections are so big that the cost of multiple scans is prohibitive). Construction (55 points) Process the collection and create the dictionary and inverted file. You may use more than one 'main' program for this assignment if you prefer, but only process the original raw text one time. For example, you may have a program that reads the input file and writes out records like “apple doc1432 2” and “orange doc4 3” to represent the fact that apple occurs in document 1432 twice and orange occurs in document 4 three times. These records must be sorted (by term, then by docid) and you might use a separate program just for this. Finally, you have to write out the sorted entries as an inverted file. It may not be easiest to use three programs as in this example; however, in real-world systems, there is usually at least a scan that produces a temporary file and a merging of sorted temporary files. The reason for this is because typically records for the entire collection will not fit in the memory of a single machine. Since our electronic text is very small, it is possible to create an index using a single program without relying on auxiliary storage. This can be achieved by following the memory-based inversion algorithm in the notes (Algorithm A) and directly writing out the postings after all documents (i.e., paragraphs) have been read. For full credit your inverted file should be a binary file1. I suggest as a baseline, 4-byte integers for document ids and 4- byte integers for the document term frequency. (However, for this collection, 2-byte integers should suffice). I suggest you store the length of the postings list (i.e., also known as document frequency, or the number of documents that the term occurs in) with the other information in your dictionary – it will be useful in HW #3. In addition to submitting a printout of your source code, you should: (1) Write a brief overview of what you did and explain the format of your lexicon and your inverted file; (2) Report the number of documents, size of the vocabulary (i.e., number of unique terms), and total number of terms observed in the collection (this information was also required on the first assignment); and, (3) Report the file sizes for your dictionary and the inverted file (in bytes). Is your index smaller than the original text? Which takes up more space, the dictionary or the inverted file? For HW #3 you will be given queries with the goal of ranking documents using a similarity metric such as the vector cosine method. To succeed on that assignment, it is crucial that you are able to reload a lexicon from disk and retrieve a postings list from the inverted file for any specified term. Testing (15 points) Demonstrate the ability to identify which documents a word occurs in and the number of times that the word occurs in each. (For full credit do this by reading back in your binary file formats.) Print out the document frequency and postings list for terms: “anchor", "Amos", and "blameless". Give document frequency, but do not print postings for the words “lovingkindness”, “Moses”, and, “sing” (these postings lists are longer). Questions (30 points: 10 points each) [1] Would you recommend trying to reduce the size of the inverted file by compressing the file with 'gzip'? Why is this, or why is this not a good idea? [2] Express the numbers {8, 13, and 513} three ways: using a 16-bit binary representation, and using the gamma and delta codes discussed in class. 1 If using Java, consider the class java.io.RandomAccessFile and the writeInt method in java.io.DataOutputStream
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved