Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Information Retrieval 5, Exercises - Computer Science, Exercises of Artificial Intelligence

Prof.Paul McNamee, Information Retrieval,Computer Science, Artificial Intelligence, Johns Hopkins University, Information Retrieval, Exercises - Computer Science, Prof. Paul McNamee,Web Query Log Analysis

Typology: Exercises

2010/2011
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 11/09/2011

stagist
stagist 🇺🇸

4.1

(27)

30 documents

1 / 2

Toggle sidebar
Discount

On special offer

Related documents


Partial preview of the text

Download Information Retrieval 5, Exercises - Computer Science and more Exercises Artificial Intelligence in PDF only on Docsity! 605.744 Information Retrieval Spring 2011 – McNamee Homework #5 (due in 3 weeks) Questions (40 points) [1] Explain how detection of exact duplicate documents (web or otherwise) can be performed more efficiently than near duplicate detection, and without directly comparing the full text of each document to all other documents. Directly comparing against an entire collection of documents would be prohibitively expensive. (Hint: the method I am looking for can be applied to images or videos as well as text documents.) [2] Explain how shingling is used to identify near duplicate documents in a large collection. [3] Given the following directed graph of webpages, perform two iterations of PageRank computations. The arcs indicate outbound links between webpages. Initially give each page a PageRank score of 0.2. Use a ‘teleport’ (or transition) probability of 0.10. (Put differently, 90% of the time the random surfer follows a link to get to a new page). Show the PageRank scores for all pages after each of the two iterations. Web Query Log Analysis (60 points) On the course web site I have put a file containing queries that were submitted to an Internet search engine. (Note: the file is large, over 35 MB even compressed.) The data comes from the Excite search engine from 12/20/1999. This exercise asks you to work with this data. You can use any tools you want to perform your analysis. You can use: programs developed for earlier exercises; new programs; commercial or public domain tools (e.g., Excel, mySQL, Perl); or, Unix commands (e.g., grep, wc). Basically, use whatever tool(s) you see fit. The data in the log file is unfiltered and may contain objectionable content. The data are in four tab-separated fields: timestamp; hashed user id; results-starting-point; and the query string (which may contain punctuation and spaces). Analyze the data in the query log. I will leave the exact particulars to you, but I would expect you to include the easier items in the list below and some of the more interesting, but harder ones. You may also come up with other ways of looking at the data besides these. Of course I don't expect you to investigate all of the questions below. o What is the mean number of queries per user id? o Analyze the variability of query length (i.e., in words or in characters) o What percentage of queries are mixed case? Upper case? Lower case? o What percent of the time does a user request only the top 10 results? Top 20 results? o Count the number of questions (look for patterns such as starting with Wh-words, or ending with a '?' symbol). What percentage of queries do questions make up? What is the most common type of question? o What are the most common queries issued? o What percent of queries contain stopwords like ‘and’, ‘the’, ‘of’, ‘in’, ‘for’? o How often is ‘query’ syntax used, like phrases in quotes, or ‘+’ or ‘-‘ signs? o What are the 10 most common words appearing in queries that contain the word download? o What are the most common k words appearing in queries.(say for k=20)? o What percentage of queries were asked by only one user? o How often is a consecutive query a reformulation of the previous one? (Not the same query to greater depth.) o What kind of spelling mistakes do users make? o Which occurs more often "Al Gore" or "Johns Hopkins"? "Johns Hopkins" or "John Hopkins"? o What percentage of queries contain a person's name? Cuil Google Bing Yahoo AV
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved