Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Information Retrieval 5, Exercises - Computer Science, Exercises of Artificial Intelligence

Johns Hopkins University (JHU)Artificial Intelligence

Prof.Paul McNamee, Information Retrieval,Computer Science, Artificial Intelligence, Johns Hopkins University, Information Retrieval, Exercises - Computer Science, Prof. Paul McNamee,Web Query Log Analysis

Typology: Exercises

2010/2011

On special offer

~~30 Points~~

Limited-time offer

Uploaded on 11/09/2011

stagist 🇺🇸

4.1

(27)

30 documents

1 / 2

On special offer

Partial preview of the text

Download Information Retrieval 5, Exercises - Computer Science and more Exercises Artificial Intelligence in PDF only on Docsity! 605.744 Information Retrieval Spring 2011 – McNamee Homework #5 (due in 3 weeks) Questions (40 points) [1] Explain how detection of exact duplicate documents (web or otherwise) can be performed more efficiently than near duplicate detection, and without directly comparing the full text of each document to all other documents. Directly comparing against an entire collection of documents would be prohibitively expensive. (Hint: the method I am looking for can be applied to images or videos as well as text documents.) [2] Explain how shingling is used to identify near duplicate documents in a large collection. [3] Given the following directed graph of webpages, perform two iterations of PageRank computations. The arcs indicate outbound links between webpages. Initially give each page a PageRank score of 0.2. Use a ‘teleport’ (or transition) probability of 0.10. (Put differently, 90% of the time the random surfer follows a link to get to a new page). Show the PageRank scores for all pages after each of the two iterations. Web Query Log Analysis (60 points) On the course web site I have put a file containing queries that were submitted to an Internet search engine. (Note: the file is large, over 35 MB even compressed.) The data comes from the Excite search engine from 12/20/1999. This exercise asks you to work with this data. You can use any tools you want to perform your analysis. You can use: programs developed for earlier exercises; new programs; commercial or public domain tools (e.g., Excel, mySQL, Perl); or, Unix commands (e.g., grep, wc). Basically, use whatever tool(s) you see fit. The data in the log file is unfiltered and may contain objectionable content. The data are in four tab-separated fields: timestamp; hashed user id; results-starting-point; and the query string (which may contain punctuation and spaces). Analyze the data in the query log. I will leave the exact particulars to you, but I would expect you to include the easier items in the list below and some of the more interesting, but harder ones. You may also come up with other ways of looking at the data besides these. Of course I don't expect you to investigate all of the questions below. o What is the mean number of queries per user id? o Analyze the variability of query length (i.e., in words or in characters) o What percentage of queries are mixed case? Upper case? Lower case? o What percent of the time does a user request only the top 10 results? Top 20 results? o Count the number of questions (look for patterns such as starting with Wh-words, or ending with a '?' symbol). What percentage of queries do questions make up? What is the most common type of question? o What are the most common queries issued? o What percent of queries contain stopwords like ‘and’, ‘the’, ‘of’, ‘in’, ‘for’? o How often is ‘query’ syntax used, like phrases in quotes, or ‘+’ or ‘-‘ signs? o What are the 10 most common words appearing in queries that contain the word download? o What are the most common k words appearing in queries.(say for k=20)? o What percentage of queries were asked by only one user? o How often is a consecutive query a reformulation of the previous one? (Not the same query to greater depth.) o What kind of spelling mistakes do users make? o Which occurs more often "Al Gore" or "Johns Hopkins"? "Johns Hopkins" or "John Hopkins"? o What percentage of queries contain a person's name? Cuil Google Bing Yahoo AV

Documents

questions

Information Retrieval 5, Exercises - Computer Science, Exercises of Artificial Intelligence

Related documents

Partial preview of the text