Download Question Answering in Information Retrieval: From Document Retrieval to Answer Retrieval - and more Study notes Computer Science in PDF only on Docsity! 1 Information Retrieval James Allan University of Massachusetts Amherst Question Answering CMPSCI 646 Fall 2007 All slides copyright © James Allan Question answering motivation • IR typically retrieves or works with documents – Find documents that are relevant – Group documents on the same topic • People often want a sentence fragment or phrase as the answer to their question – Who was the first man to set foot on the moon? – What is the moon made of? – How many members are in the U.S. Congress? – What is the dark side of the moon? CMPSCI 646 Copyright © James Allan • Move IR from document retrieval to answer retrieval – Document retrieval is still valuable – Extends breadth of active IR research 2 Some TREC History • QA begun in TREC-8 (’99) and was similar in 2000 • First focused on “factoid” questions from unrestricted domain – Now includes other classes of questions (definitions, lists, …) • Run against a large collection of newswire • Guaranteed that answer exists in the collection • Return short text passage that contains and supports answer – 250- or 50-byte passages • Return 5 “answers” (passages) ranked by chance of CMPSCI 646 Copyright © James Allan having answer • Evaluation based on mean reciprocal rank of first correct answer Judgment issues • Correctness of answer not always obvious • Applied several rules to simplify problem • Lists of possible answers (“answer stuffing”) – Not considered correct even if correct answer in there • Answer had to be “responsive” – If “$500” was correct answer, than “500” was incorrect – If “5.5 billion” was correct, then “5 5 billion” was not • Ambiguous references refer to famous one “Wh t i th h i ht f th M tt h ?” th i th Al CMPSCI 646 Copyright © James Allan – a s e e g o e a er orn means e one n e ps – “What is the height of the Matterhorn at Disneyland?” is other 5 Main task • 500 questions – No “definition” questions (needed a pilot study first) – No answers required (49 of 500 ended up with no answer) – Taken from MSNsearch and AskJeeves logs donated in 2001 – Some spelling errors in questions corrected, but not all • When to stop: Is a misplaced apostrophe a spelling error? • Requirements on answers – Precisely one exact answer required (not five like before) – System must indicate confidence in answer – Could optionally submitted a justification string E l i i fid i h d i i CMPSCI 646 Copyright © James Allan • va uat on s con ence-we g te average prec s on – Rank answers to all questions by confidence TREC 2003 QA tasks • Main task (“factoid” question answering) – 413 questions posed against AQUAINT corpus – 54 runs from 25 groups (also did next two types) – Scored by fraction of responses that were correct (accuracy) • List task – 37 questions with no specification of how many answers in list • List the names of chewing gums • What Chinese provinces have a McDonald’s restaurant? – Scored by instance recall/precision and F1 measure • Definition task CMPSCI 646 Copyright © James Allan – 50 questions – Facet-based recall measure, length-based precision measure • Passages task – 250-byte extract containing answer or nil if none exists – 21 runs from 11 groups 6 General QA approach Close to traditional IR Find candidate passages T t Extract possible answers Rank answers Determine question type Question CMPSCI 646 Copyright © James Allan ex collection Answer(s) Key points for success • Good passage retrieval – QA included evaluation specifically on passage retrieval, too • Recognizing question type is critical – Requires having ability to recognize those entities • Some sample entities that a system might find – person, organization, location, time, date, money, percent – duration, frequency, age, number, decimal, ordinal, equation – weight, length, temperature, angle, area, capacity, speed, rate – product, software, address, email, phone, fax, telex, www CMPSCI 646 Copyright © James Allan – subtypes (company, government agency, school) • Better performing systems almost always have better entity recognizers and large numbers of entity types 7 Passage retrieval • Not every system depends on this, but most do • Given query, find passages likely to contain answer • Most successful approaches use question patterns to find alternative ways to phrase things – To greatly increase recall • Start with a question and a known answer – When was Bill Clinton elected President? 1992 • Look for all occurrences of that answer and declarative form of question throughout text – Bill Clinton was elected president in 1992 – The election was won by Bill Clinton in 1992 CMPSCI 646 Copyright © James Allan – Clinton defeated Bush in 1992 – Clinton won the electoral college in 1992 • Extract patterns that occur frequently • Now more likely to be able to answer similar questions – When did George Bush become president? Query expansion? • Question expansion – Process that adds related words to a query – Improves recall – Relevant documents using slightly different vocabulary • Seems appropriate here and it does work • Difficulty is need for answer justification CMPSCI 646 Copyright © James Allan 10 Putting those all together • Want to estimate P(correct|Q,A) • They did this by a mixture model • Easy to look up values in tables built from training CMPSCI 646 Copyright © James Allan BBN’s use of the Web (TREC 2002 and 2003) • Several systems used the Web to help – Huge source of text that might answer question • BBN formed two queries – One rewrites the question into a declarative sentence – Another just uses the content-based words • Mine the returned snippets rather than pages (for efficiency) for candidate answers – Must be of correct type • Select best answer (next slide) CMPSCI 646 Copyright © James Allan • To get justification, find TREC document that contains the selected answer 11 Using Web (cont) • First approach just uses Web results and q-type • Second approach boosts scores that were also retrieved by non-Web approach in TREC corpus – P(correct|F,in-trec) – Clear from training data that h i th i TREC in-trec true CMPSCI 646 Copyright © James Allan av ng e answer n corpus provides useful information in-trec false How well did it all work? • Decent performance (middle of the pack) • Confidence scores are fairly good – Upper bound shows impact of perfect estimates • Using the Web made a huge difference CMPSCI 646 Copyright © James Allan • Validating in TREC corpus helped some 12 Some systems more complex • U.Waterloo (Canada) incorporates much more (TREC 2002) – Stores known facts in a database – Includes a corpus of trivia – Uses Web to find candidate answers • Provides numerous sources of evidence – Early answers require CMPSCI 646 Copyright © James Allan justification in corpus • Combines candidate answers Waterloo’s AnswerDB • Collection of tables with information on a bunch of topics CMPSCI 646 Copyright © James Allan 15 Use of Cyc knowledge base • Used for answer “sanity” checking • Have system generate answer and ask Cyc if answer seems reasonable • If answer ± 10% of Cyc’s best guess, then “sane” • Only helped once – What is population of Maryland? – Answer: 50,000 – Justification: “Maryland’s population is 50,000 and growing rapidly.” – Valid on the surface except that it had to do with something called CMPSCI 646 Copyright © James Allan , Nutria (a rodent raised for its fur), not people – Cyc knew answer was about 5.1million, so second best (though less highly ranked) answer was accepted because it was “sane” • Follow-up work has made better use of Cyc – Didn’t help at all in TREC 2003, though Top performing systems at TREC 2002 CMPSCI 646 Copyright © James Allan Impact of confidence weighting 16 Ability of systems to estimate confidence All right answers first CMPSCI 646 Copyright © James Allan All wrong answers first [Voorhees, TREC 2002] Definition task (TREC 2003) • Sample questions – Who is Colin Powell? – What is mold? • Drawn from search engine logs, so they’re “realistic” – 50 questions – 30 had a “person” as target (Vlad the Impaler, Ben Hur) – 10 had an organization (Freddie Mac, Bausch & Lomb) – 10 had something else (golden parachute, feng shui, TB) • Answer to a definition has an implicit context CMPSCI 646 Copyright © James Allan – Adult, native speaker of English, “average” reader of US news – Has come across a term they want more information about – Has some basic ideas already (e.g., Grant was a president) – Not looking for esoteric details 17 Judging definitions • Phase one: creating truth – Assessor created a list of information “nuggets” – Used own question research – Combined with judgments of submitted answers – Vital nuggets—those that must appear—selected • Phase two: judging – Look at each system response – Note where each nugget appeared – If nugget returned more than once, only one instance is counted CMPSCI 646 Copyright © James Allan Example judging • What is a golden parachute? CMPSCI 646 Copyright © James Allan 20 Results for definitions • Table shows results of definitions for β=5 • Also shows what different values of β do • Note how good sentence baseline does – Return all sentences that mention the target (e.g., “golden parachute”) – But reduce it slightly by eliminating sentences that overlap too much CMPSCI 646 Copyright © James Allan – Provided by BBN • Does best when recall is heavily weighted BBN’s results • Did okay except for about 10 questions • Several result of faulty assumption of target – What is Ph in Biology? • Assumed “Ph in Biology” was a object – Who is Akbar the Great? CMPSCI 646 Copyright © James Allan • Assumed “Great” was his last name • Some errors caused by redundancy checking – Ari Fleischer, Dole’s former spokesman who now works for Bush – Ari Fleischer, a Bush spokesman • This was redundant because of previous kernel 21 Final scores of systems (TREC 2003) • Three types: factoid, list, and definition • Final score is a linear combination ½ f t id– ac o score – ¼ list score – ¼ definition score • Doesn’t match balance of questions – 413 factoid – 37 list – 50 definitions CMPSCI 646 Copyright © James Allan • Reflects desire to force work on lists and definitions • But to keep factoid questions important Scores of top 15 systems (TREC 2003) CMPSCI 646 Copyright © James Allan 22 What about that top-performing system? • LCC (Language Computer Corporation) does a consistently great job at this task Very complex system with lots of AI like technology• - – Attempts to prove candidate answers from text – Lots of feedback loops – Lots of sanity-checking that can reject answers or require additional checking • Attempts to replicate results have failed – System is so complex it’s hard to know where to start CMPSCI 646 Copyright © James Allan – LCC is a company and probably isn’t telling us everything • Until their high-quality results are understood, they remain an outlier (albeit a really good outlier) Summary • Question answering is a hot area right now • Has been explored numerous times in the past P h ti i ?– er aps me was r pe • So far focus has been on simpler questions – “Factoid” questions, lists, definitions – TREC tries to make things more difficult each year • Part of an AQUAINT program looking at problem – Much more complex types of questions being explored in research program CMPSCI 646 Copyright © James Allan • Dialogue situations, cross-language, against rapidly changing data (so answer might change) • Some efforts require heavy knowledge bases (e.g., Cyc) • Exciting and active area of research at the moment