Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

User Familiarity & Interest in Topic-Specific Document Ranking: Comparative Analysis of Fe, Study notes of Art

A study on improving document ranking based on user familiarity and interest. The researchers hypothesized that users with little topic knowledge prefer readable documents and that discriminative queries are effective for users with previous experience. They also introduced a clarification form to collect user data and compared various techniques, including Pseudo-Relevance Feedback, Flesch readability scores, Representative and Discriminative queries, and Motivating terms. The study found that Pseudo-Relevance Feedback was the most successful approach, while Motivating terms harmed the baseline ranking for most topics.

Typology: Study notes

2021/2022

Uploaded on 09/27/2022

mancity4ever
mancity4ever 🇬🇧

4.5

(15)

35 documents

1 / 10

Toggle sidebar

Related documents


Partial preview of the text

Download User Familiarity & Interest in Topic-Specific Document Ranking: Comparative Analysis of Fe and more Study notes Art in PDF only on Docsity! University of Strathclyde at TREC HARD Mark Baillie, David Elsweiler, Emma Nicol, Ian Ruthven, Simon Sweeney, Murat Yakici, Fabio Crestani, and Monica Landoni i-lab group Department of Computer and Information Science University of Strathclyde Glasgow, UK {mb, dce, emma, ir, simon, murat, fabioc, monica}@cis.strath.ac.uk 1 Motivation The motivation behind the University of Strathclyde’s approach to this years HARD track was inspired from previous experiences by other participants, in particular research by [1], [3] and [4]. A running theme throughout these papers was the underlying hypothesis that a user’s familiarity in a topic (i.e. their previous experience searching a subject), will form the basis for what type or style of document they will perceive as relevant. In other words, the user’s context with regards to their previous search experience will determine what type of document(s) they wish to retrieve. 1.1 Previous Research Belkin et al. stated that searchers “who are familiar with a topic will want to see documents that are detailed and terminologically specific, and people who are unfamiliar with a topic will want to see general and relatively simple documents”[1]. Documents in the corpus were assessed by how “readable” they were, using a standard measure called the Flesch readability score[2]. The Flesch score for a document is derived from the mean number of syllables per word and the number of words per sentence. For each topic, a documents Flesch score was combined with the corresponding Retrieval Status Value (RSV) estimated from the initial document ranking, also known as the baseline. It was discovered that this combination (of the normalised Flesch readability and estimated relevance scores) gave greater weight to readable documents in the ranking, aiding those user’s with low topic familiarity. In a similar theme, Harper et al. hypothesised that a “user’s familiar with a topic will prefer docu- ments in which highly representative terms occur, and user’s with a topic will prefer documents in which highly discriminating terms will occur”[3]. In other words, by identifying terms very specific to a topic, those documents with detailed information (e.g. highly technical documents) were pushed up the origi- nal document ranking. Conversely, for those user’s not familiar with the subject, expanding the original query with terms very general to the topic will boost those documents that provide an overview. Depend- ing on the user’s context, the original baseline ranking can be reordered providing more importance to those documents with a high proportion of either representative or familiar terms, respectively. Analysis of the performance found that Discriminative queries were (on average) effective at improving the orig- inal baseline document ranking, particulary when a user had previous knowledge of the topic they were searching. Kelly et al. measured user familiarity gained from the meta-data of 2004s topics, against the number of times the person had searched for information about this topic in the past[4]. Expectedly, they found a degree of agreement between the number of times a user searched previously on this topic and their topic familiarity. In other words, those user’s familiar with the TREC topic had on average searched more times previously on this subject than those who stated unfamiliarity. A clarification form was designed to obtain information both on user topic familiarity and also useful information that could be utilised for improving the original baseline ranking. Questions included - what information the user previously knew about the topic, what information they wanted to know, and also a large text box was provided to allow the user to add any further keywords that they felt described the topic. This information was then utilised both to determine a user’s topic familiarity, from a newly derived measure derived from the typical the length of answer as well as other responses to the clarification form, and also to expand the original queries. The original queries submitted for each the topic were expanded using different combinations of feedback from the clarification form, with varying degrees of success. 1.2 Summary The common theme running through the above works can be summarised in the following Hypothesis: – H1: A user’s familiarity in a topic will have an impact on what type of documents they will find relevant. A user with very little background knowledge (of a topic) will, potentially, find an overview document more helpful initially than a document with very specific (possibly technical) content. As a consequence, documents that are in someway general to the topic (e.g. little technical detail, simple introduction pieces, etc.) will more likely be judged as relevant by the user. In comparison, a user that has a high degree of familiarity or background knowledge , will (potentially) prefer documents that contain detailed comment on the topic. To expand our motivation further, for HARD we examined another aspect of the user’s context. The user’s in the HARD track are not typical searchers, but TREC assessors. In some regards, these asses- sors do not own either the topic or the original query submitted to the Information Retrieval system. Potentially, an assessor may have both little knowledge of the topic, and importantly, little interest in re- searching that topic. We believe that this factor (a user’s interest), will also have an impact on the type and style of document they wish to retrieve. We posit that a user with little interest and background knowl- edge of a subject, would prefer to read documents with little technical “jargon”, accessible to read, short in length, and also stylistically pleasing (with a high number of motivating terms and phrases). Surmising our motivation, we assume that a user (in this case a TREC assessor) will often be assigned a task to search that they will have little interest in and/or previous knowledge of. Therefore we formulate the new hypothesis: – H2: A user’s interest in a topic will have an impact on what type of documents they will find relevant. A user with little interesting in reading about a topic will prefer documents with little technical material and are stylistically pleasing to read. In order to investigate both hypotheses (H1 and H2), we compared a number of approaches including Pseudo-relevance feedback, Flesch readability scores, Representative and Discriminative queries, and also a new approach that expands the original query with Motivating terms. In the following sections we introduce these techniques, and in particular the concept of expanding the original query with motivating terms. We then report the results and findings of each technique, before concluding our first attempt at both the HARD track, and TREC. 2 Methodology In this section we introduce the different approaches that we investigated. All algorithms were imple- mented using the Lemur Information Retrieval framework[6]. Also, for the HARD track evaluation, the submitted runs from all groups were compared against a baseline run, declared by each participating measure is then used to determine what each terms contribution to the topic is in the corpus vocabulary. KL is typically used for measuring the difference between two probability distributions[5]. When applied to the problem of measuring the distance between two term distributions (Language Models), KL esti- mates the relative entropy between the probability of a term t occurring in the actual collection Θa (i.e. p(t|Θa)), and the probability of the term t occurring in the estimated Topic Language Model (LM) Θe (i.e. p(t|Θe)). KL is defined as, KL(Θe||Θa) = ∑ t∈V p(t|Θe)log p(t|Θe) p(t|Θa) (4) where, p(t|Θa) = n(t, Θa)∑ t∈Θa n(t, Θa) (5) and, p(t|Θe) = ∑ d∈Θe n(t, d) + α∑ t( ∑ d∈Θe n(t, d) + α) (6) where n(t, d) is the number of times t occurs in a document d and α is a small non-zero constant (Laplace smoothing). The smaller the KL divergence the closer the topic is to the actual collection, with a zero KL score indicating two identical distributions. To account for the sparsity within the Θe, Laplace smoothing was applied to alleviate the zero probability problem[7]. Instead of determining the difference between two term distributions (i.e. the collection and topic LM), we are interested in the individual term contribution to the topic LM. A term contribution being the KL score for the term t. The greater the contribution to the topic model the higher the KL score. Therefore, for each term t the contribution was calculated by, KL(t) = p(t|Θe)log p(t|Θe) p(t|Θa) (7) The top C ranked terms in a topic model are then ranked further according to each terms “representa- tive” and “discriminative” properties. To rank a terms discriminative property (e.g. how specific a term is to the topic), the KL discriminative score for term t is calculated by, KLd(t) = log p(t|Θe) p(t|Θa) (8) To calculate how general a term is to the topic LM, the KL representative score is used, calculated by, KLr(t) = p(t|Θe) (9) For each topic, either the top K ranked terms corresponding to either the KL-representation or KL- discrimination (equations 8 and 9 respectively), are then used to expand the query. For those user’s with low familiarity and/or topic interest, the top Q ranked representative terms are applied. 2.5 Query Expansion using Motivating Terms One of the main assumptions stated earlier, is that we believe a user’s interest in a topic would have a bearing on the types of documents they will find relevant. For example, a user who is searching a topic they have little interest in, would possibly find documents that are stylistically pleasing better in comparison to very verbose, technically specific documents. A particular stylistic technique for drawing a readers attention is to make use of motivating terms and phrases within the text. We therefore assume that those documents that contain a high number of motivating terms may provide a more suitable entry point into the topic for those user’s with little interest in searching on the subject. In order to determine what motivating terms to include in the expanded query, a list of typical moti- vating terms and phrases were collated. A list was manually compiled of words that indicate or portray emotion. This list of motivating terms and phrases was then ranked according to each terms contribution to the topic model, which was formed from the top N ranked documents, for each query. This approach is similar to that outlined in section 2.4, however, only motivating terms are ranked according using the KL score (see equation 7). The top ranked motivating terms for each topic were then used to expand the original query. The approach we adopted is outlined below: 1. Form a topic LM with the top N ranked documents. 2. Smooth the topic LM with the reference collection using Laplace smoothing. 3. For each term t̀ in the predefined motivating term list, we calculate the KL score (see equation 7). 4. All terms t̀ are then ranked with respect to the KL score (highest to lowest), 5. At this stage two strategies were implemented: – Expand the original query with those terms with a positive contribution e.g. KL(t̀) > 0 – Expand the query with the top Q ranked terms for each topic. We now illustrate some examples of an original query being expanded by motivation terms. Below is two topics in this years HARD track. For the first topic (Number 322), the original query submitted to the IR system was “International Art Crime”. A topic model was formed from the top 10 ranked documents. The list of motivating terms were then ranked based on their contribution to the topic. Table 2 presents the top three ranked motivating terms for this topic. Each term could be considered related to the subject of crime. It is believed by expanding the original query with these terms we will push up those document that may be of more interest to the user, thus providing a higher likelihood being relevant. For a different topic, the title query submitted was “Black Bear Attacks”. The top three ranked motivating terms , see Table 2, for topic number 336, were “wild”, “stirring” and “dangerous”. All three terms could be associated with the description of aggressive animal behaviour or characteristics. Table 2. Top ranked motivating terms for TREC Topic Numbers 322 and 336 respectively Topic Num: 322 Topic Num 336 Term KL score Term KL score dangerous 0.00175 wild 0.0018 suspicious 0.0012 stirring 0.0014 significant 0.0004 dangerous 0.00068 3 Evaluation Results In this Section, we discuss the results from the official runs submitted for HARD. 3.1 Submitted Runs A summary of the runs submitted for HARD can be found in Table 3. STRA1 was our baseline sub- mission, which was the OKAPI retrieval method[6]. All other submissions were compared against this baseline. For each submitted run, we fixed the parameters for each attempt to provide a fair comparison <num> Number: 322 <Title> International Art Crime <narr> Narrative: A relevant document is any report that identifies an instance of fraud or embezzlement in the international buying or selling of art objects.... <num> Number: 336 <title> Black Bear Attacks <narr> Narrative: It has been reported that food or cosmetics sometimes attract hungry black bears, causing them to viciously attack humans.... Fig. 1. A summarised description of TREC topics 322 and 336. across each different technique. However, after the official results were released, we re-examined a num- ber of new runs over a wider ranger of parameters settings. The results from each run, across a wider range of varying parameters will be released as a technical report once the analysis has been completed. For query expansion using Motivating terms, two runs were submitted. In Section 2.1, the assessor for each topic was placed into one of four groups depending on their responses in the clarification form (see Table 1). We would posit that the performance of both runs would be better for those topics placed in groups G3-G4, who stated little interest in reading about the topic, than the other two groups. For forming the topic model prior to ranking the motivating terms, the top ten (N = 10) documents ranked by the baseline run were used. Then for the first submission (STRAxmta), each original query was expanded using the same number of terms (the top Q = 6 ranked terms). For the second run (STRAxmtg), the query was expanded with all top ranked motivating terms that recorded a KL score greater than zero i.e. KL(t̀) > 0. For the Pseudo-relevance feedback submission, the top N documents were also used to re-rank the first 1000 ranked documents (STRAxprfb). For consistency in our comparisons with other approaches, we again fixed N to be 10. For comparing the Discriminative and Representative queries (STRAxqedt and STRAxqert respec- tively), we submitted one run each that expanded the original query with either the top sixth ranked Dis- criminative or Representative terms. For both runs, we would expect improved performance for groups G1 and G3 (those assessors familiar with the topic) using discriminative queries, and for groups G2 and G4, representative queries. Such a result would indicate that Representative queries rank general overview documents higher for those user’s with low familiarity, while Discriminative terms would push up doc- uments very specific to the topic. However, by submitting both sets of queries to all user’s, we can also examine the effect of using discriminative queries for those user’s with low familiarity, and vice versa. Again, N was set for 10 for ranking the topic terms and the original query was expanded with the top 6 Representative or Discriminative terms. We also submitted two runs that combined the document RSV values estimated during the baseline with their Readability score. For submission STRAxreada, the same value was used for the weight α. This would help evaluate the effect of using the readability score across all groups. For the second run, STRAxreadg, we varied α, providing more to those groups with low topic familiarity and interest. 3.2 Results Table 4 provides an overview of the performance of each of the official submissions. In the Table, we also include the proportion of topics where there was an increase over the baseline R-precision for a topic (a success), as well as the proportion of topics where an approach harmed the baseline R-precision (a fail).
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved