Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Information Retrieval 2, Exercises - Computer Science, Exercises of Artificial Intelligence

Johns Hopkins University (JHU)Artificial Intelligence

Prof.Paul McNamee, Information Retrieval,Computer Science, Artificial Intelligence, Johns Hopkins University, Information Retrieval, Exercises - Computer Science, Prof. Paul McNamee, Inverted Files, Construction

Typology: Exercises

2010/2011

Uploaded on 11/09/2011

stagist 🇺🇸

4.1

(27)

30 documents

1 / 2

Partial preview of the text

Download Information Retrieval 2, Exercises - Computer Science and more Exercises Artificial Intelligence in PDF only on Docsity! 605.744 Information Retrieval Spring 2011 – Paul McNamee Homework #2 (due in 2 weeks) Inverted Files (70 points) Inverted files are the primary data structure to support the efficient determination of which documents contain specified terms. The objective of this assignment is to process the bible-asv corpus from the course web page, much as in the first assignment, but this time you must build an inverted file that contains a postings list for each dictionary term. Your implementation should model a useable real-world indexing system; in particular, this means that your inverted file structure must be written to disk as a binary file, your dictionary must be written to disk, for each word in the lexicon you must store a file offset to the corresponding posting list, and finally, you should process the raw text collection only once (many real-word collections are so big that the cost of multiple scans is prohibitive). Construction (55 points) Process the collection and create the dictionary and inverted file. You may use more than one 'main' program for this assignment if you prefer, but only process the original raw text one time. For example, you may have a program that reads the input file and writes out records like “apple doc1432 2” and “orange doc4 3” to represent the fact that apple occurs in document 1432 twice and orange occurs in document 4 three times. These records must be sorted (by term, then by docid) and you might use a separate program just for this. Finally, you have to write out the sorted entries as an inverted file. It may not be easiest to use three programs as in this example; however, in real-world systems, there is usually at least a scan that produces a temporary file and a merging of sorted temporary files. The reason for this is because typically records for the entire collection will not fit in the memory of a single machine. Since our electronic text is very small, it is possible to create an index using a single program without relying on auxiliary storage. This can be achieved by following the memory-based inversion algorithm in the notes (Algorithm A) and directly writing out the postings after all documents (i.e., paragraphs) have been read. For full credit your inverted file should be a binary file1. I suggest as a baseline, 4-byte integers for document ids and 4- byte integers for the document term frequency. (However, for this collection, 2-byte integers should suffice). I suggest you store the length of the postings list (i.e., also known as document frequency, or the number of documents that the term occurs in) with the other information in your dictionary – it will be useful in HW #3. In addition to submitting a printout of your source code, you should: (1) Write a brief overview of what you did and explain the format of your lexicon and your inverted file; (2) Report the number of documents, size of the vocabulary (i.e., number of unique terms), and total number of terms observed in the collection (this information was also required on the first assignment); and, (3) Report the file sizes for your dictionary and the inverted file (in bytes). Is your index smaller than the original text? Which takes up more space, the dictionary or the inverted file? For HW #3 you will be given queries with the goal of ranking documents using a similarity metric such as the vector cosine method. To succeed on that assignment, it is crucial that you are able to reload a lexicon from disk and retrieve a postings list from the inverted file for any specified term. Testing (15 points) Demonstrate the ability to identify which documents a word occurs in and the number of times that the word occurs in each. (For full credit do this by reading back in your binary file formats.) Print out the document frequency and postings list for terms: “anchor", "Amos", and "blameless". Give document frequency, but do not print postings for the words “lovingkindness”, “Moses”, and, “sing” (these postings lists are longer). Questions (30 points: 10 points each) [1] Would you recommend trying to reduce the size of the inverted file by compressing the file with 'gzip'? Why is this, or why is this not a good idea? [2] Express the numbers {8, 13, and 513} three ways: using a 16-bit binary representation, and using the gamma and delta codes discussed in class. 1 If using Java, consider the class java.io.RandomAccessFile and the writeInt method in java.io.DataOutputStream

Documents

questions

Information Retrieval 2, Exercises - Computer Science, Exercises of Artificial Intelligence

Related documents

Partial preview of the text