Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Computational Linguistics I: Parts-of-Speech Tagging - Lecture 4 - Prof. Saif Mohammad, Study notes of Computer Science

A lecture note from a computational linguistics i course focusing on parts-of-speech tagging. The basics of parts-of-speech, closed class and open class pos, tagsets, and reasons for performing pos tagging. It also discusses rule-based and transformation-based pos tagging methods.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-2vz
koofers-user-2vz 🇺🇸

5

(1)

10 documents

1 / 76

Toggle sidebar

Related documents


Partial preview of the text

Download Computational Linguistics I: Parts-of-Speech Tagging - Lecture 4 - Prof. Saif Mohammad and more Study notes Computer Science in PDF only on Docsity! CMSC 723/LING 723 Computational Linguistics I Parts-of-Speech Tagging Lecture 4 September 24th, 20081 CMSC 723/LING 723 Computational Linguistics I Parts-of-Speech Tagging Lecture 4 September 24th, 20081 Parts-of-Speech • Schacter (1985) provides more details • These classes occur in almost every language • Defined primarily in terms of syntactic and morphological criteria (not semantic): • Syntactic distribution: what occurs nearby? • Morphological properties: what affixes they take? • Syntactic function: what does it act as? • Semantic cohesion is incidental, not guaranteed [nouns: peoples/places, adjectives: properties] Note: Think back to the comic (verb is actually a noun) Schacter P. (1985) Part-of-Speech systems. Language Typology and Syntactic Description 4 Parts-of-Speech • Two broad categories • Closed Class: • Relatively fixed membership • Generally function words (of, to, as, since ...) • Short and used primarily for structuring • Open Class • Frequent Neologisms (borrowed/coined) 5 Closed Class POS • Idiosyncratic • Not all languages have the same classes • English • Prepositions: on, under, over, near, ... • Conjunctions: and, but, or, if, ... • Particles: up, down, off, in, ... • Auxiliaries: can, may, should, are, ... • Determiners: a, an, the, ... • Pronouns: she, who, I, ... 6 Closed Class POS Particles vs Prepositions He came by the office in a hurry He came by his fortune honestly We ran up the phone bill We ran up the small hill He lived down the block He never lived down the nicknames (by = preposition) (by = particle) (up = particle) (up = preposition) (down = preposition) (down = particle) 9 Closed Class POS Particles vs Prepositions He came by the office in a hurry He came by his fortune honestly We ran up the phone bill We ran up the small hill He lived down the block He never lived down the nicknames (by = preposition) (by = particle) (up = particle) (up = preposition) (down = preposition) (down = particle) Very Difficult To Differentiate 10 Closed Class POS of in for to with on at by from about than over 540,085 331,235 142.421 125,691 124,965 109,129 100,169 77,794 74,843 38,428 20,210 18,071 through after between under per among within towards above near off past 14,964 13,670 13,275 9525 6.515 5,090 5,030 4,700 3,056 2,026 1,695 1,575 worth toward plus till amongst via amid underneath versus amidst sans circa NWO A~iwh WN BML YK ND © Le J 164 113 67 20 14 pace nigh re mid o'er but ere less midst thru vice Prepositions & Particles from CELEX cooooocoocoonwhkh won Closed Class POS it I he you his they this that she her we all which their what my him me who them no some other your its our these any more many such those own us 199,920 198,139 158,366 128,688 99,820 88.416 84,927 82,603 73,966 69,004 64.846 61,767 61,399 51,922 50,116 46,791 45,024 43,071 42,881 42,099 33,458 32,863 29,391 28,923 27,783 23,029 22,697 22,666 21,873 17,343 16,880 15,819 15,741 15,724 how another where same something each both last every himself nothing when one much anything next themselves most itself myself everything several less herself whose someone certain anyone whom enough half few everyone whatever 13,137 12,551 11,857 11,841 11,754 11,320 10,930 10,816 9,788 9,113 9,026 8,336 7423 7,237 6,937 6,047 5,990 5.115 5,032 4,819 4,662 4306 4.278 4.016 4.005 3,755 3,345 3,318 3,229 3,197 3,065 2,933 2,812 2,571 yourself why little none nobody further everybody ourselves mine somebody former past plenty either yours neither fewer hers ours whoever least twice theirs wherever oneself thou “un ye thy whereby thee yourselves latter whichever COww oon Doeons 6 S DX Do as : 1,474 1.428 1,426 1,322 1,177 984 940 848 826 618 536 482 458 391 386 382 303 289 239 229 227 192 191 176 166 148 142 121 no one wherein double thine summat suchlike fewest thyself whomever whosoever whomsoever wherefore whereat whatsoever whereon whoso aught howsoever thrice wheresoever you-all additional anybody each other once one another overmuch such and such whate’er whenever whereof whereto whereunto whichsoever 106 58 39 30 22 18 15 14 11 10 CooooooooooOCoO HR RP RE RE KE NUNEUNDH Pronouns (Personal,Possessive & Wh-) Modal Verbs (part of Auxiliaries class) Closed Class POS 15 Open Class POS • Nouns • Verbs • Adjectives • Adverbs • All languages have Nouns and Verbs but may not have the other two 16 Open Class POS • Adjectives • Word referring to properties/qualities • Not present in all languages (e.g., Korean) • Adverbs • A semantic and formal potpourri • Usually modify verbs • Actually, John walked home extremely slowly yesterday 19 Tagsets • Several English tagsets have been developed • Vary in number of tags • Penn Treebank (45) • Brown Tagset (87) • Language specific Simple morphology = more ambiguity = smaller tagset • Size depends on language and purpose 20 Penn Treebank Tagset • Developed at UPenn • Culled from Brown Tagset • Leaves out some information e.g., that one can get from the word or the parse tree • Applied to Brown Corpus, WSJ, Switchboard 21 POS Tagging the girl kissed the boy on the cheek NN VBD PRP DT “The process of assigning “one” POS or other lexical class marker to each word in a corpus” (Jurafsky & Martin) 24 Why do POS tagging? • Corpus-based Linguistic Analysis & Lexicography • Information Retrieval & Question Answering • Automatic Speech Synthesis • Word Sense Disambiguation • Shallow Syntactic Parsing • Machine Translation 25 Why is it hard? • Not really a lexical problem • Sequence labeling problem • Treating it as lexical problem runs us smack into the wall of ambiguity I thought that you ... (that: CS) That day was nice (that: DT) You can go that far (that: RB) 26 Rule-based POS Tagging • (Klein & Simmons, 1963) • One of the first rule-based taggers (“grammar coder”) • Two stage architecture • Use dictionary to tag function words directly OR find which tests to run in second stage • Run chosen handwritten tests to find the “right” candidate Klein, S. and Simmons, R. F. 1963. A Computational Approach to Grammatical Coding of English Words. J. ACM 10, 3 (Jul. 1963), 334-347. 29 Rule-based POS Tagging • (Klein & Simmons, 1963) • Tagset size: 30 • Fits in ~15,000 IBM 7090 machine words (gasp!) • Dictionary: about 2000 English words • Tests: Capitalization, Suffixes, Numerals • Final answer: intersection of all test answers • Evaluated manually on “several pages” of text; 90% accuracy (half via dictionary) Klein, S. and Simmons, R. F. 1963. A Computational Approach to Grammatical Coding of English Words. J. ACM 10, 3 (Jul. 1963), 334-347. 30 Rule-based POS Tagging • Constraint Grammar Approach✝ • More recent rule-based method • Similar two-stage architecture • Vastly larger dictionaries and rulesets • Most popular implementation: EngCG ✝ Fred Karlsson et al. Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. 31 Rule-based POS Tagging Word POS Additional POS features smaller ADJ COMPARATIVE entire ADJ ABSOLUTE ATTRIBUTIVE fast ADV SUPERLATIVE that DET CENTRAL DEMONSTRATIVE SG all DET PREDETERMINER SG/PL QUANTIFIER dog’s N GENITIVE SG furniture N NOMINATIVE SG NOINDEFDETERMINER one-third NUM SG she PRON PERSONAL FEMININE NOMINATIVE SG3 show Vv PRESENT -SG3 VFIN show N NOMINATIVE SG shown PCP2 SVOO SVO SV occurred PCP2 SV occurred V PAST VFIN SV Sample EngCG Lexicon 34 Rule-based POS Tagging Example Sentence: Newman had originally practiced that ... Newman NEWMAN N NOM SG PROPER had HAVE <SVO> V PAST VFIN HAVE <SVO> PCP2 originally ORIGINAL ADV practiced PRACTICE <SVO> <SV> V PAST VFIN PRACTICE <SVO> <SV> PCP2 that ADV PRON DEM SG DET CENTRAL DEM SG CS Overgenerated Taggings ADVERBIAL‐THAT Rule Given input: that if     (+1 A/ADV/QUANT);     (+2 SENT‐LIM);     (NOT ‐1 SVOC/A); then eliminate non‐ADV tags else eliminate ADV tag One possible disambiguation constraint 35 Rule-based POS Tagging • Accuracy about 96% (very good at the time) • A lot of effort to write the rules and create the lexicon • Probably not worth it today given how easy it is to bootstrap stochastic methods • Could try and learn rules automatically • Moving on ! 36 TBL Illustration 39 TBL Illustration Training 39 TBL Illustration Training 39 TBL Illustration Training Most common: BLUE Initial Step: Apply Broadest Transformation 100%Error: 39 TBL Illustration Training 100%Error: 40 TBL Illustration Training Error: 44% 40 TBL Illustration Training Step 3: Apply this transformation 44%Error: change B to G if touching  41 TBL Illustration Training Error: 44% 42 TBL Illustration Training Error: 11% 42 TBL Illustration Training Error: 11% 43 TBL Illustration Training Error: 0% 43 TBL Illustration Training Finished ! Error: 0% 43 TBL Illustration Ordered transformations: change B to G if touching A change B to R if shape is) Testing 46 TBL Illustration -~ Ordered transformations: A change B to R if shape is) Testing 47 TBL Illustration Testing Initial: Make all B  change B to G if touching  change B to R if shape is  Ordered transformations: 48 TBL Painting Algorithm function TBL‐Paint (given: empty canvas with goal painting) begin   apply initial transformation to canvas   repeat     try all color transformations rules     find transformation rule that would yield most improved painting     apply color transformation rule to canvas   until improvement below some threshold end Now, substitute: ‘tag’ for ‘color’ ‘corpus’ for ‘canvas’ ‘untagged’ for ‘empty’ ‘taggi g’ for ‘pai ting’ 51 TBL Tagging Algorithm function TBL‐Tag (given: untagged corpus with goal tagging) begin   apply initial transformation to corpus   repeat     try all tag transformation rules     find transformation rule that would yield most improved tagging     apply tag transformation rule to corpus   until improvement below some threshold end Impossible ! 52 TBL Templates Change tag t1 to tag t2 when:    w‐1 (w+1) is tagged t3    w‐2 (w+2) is tagged t3    w‐1 is tagged t3 and w+1 is tagged t4    w‐1 is tagged t3 and w+2 is tagged t4 Change tag t1 to tag t2 when:    w‐1 (w+1) is foo    w‐2 (w+2) is bar    w is foo and w‐1 is bar    w is foo, w‐2 is bar and w+1 is baz Non-Lexicalized Lexicalized Only try instances of these (and their combinations) 53 TBL Example Rules He/PRP is/VBZ as/RB tall/JJ as/IN her/PRP$ Change from IN to RB if w+2 is as  55 TBL Example Rules He/PRP is/VBZ as/RB tall/JJ as/IN her/PRP$ Change from IN to RB if w+2 is as  He/PRP is/VBZ expected/VBN to/TO race/NN today/NN 55 TBL Example Rules He/PRP is/VBZ as/RB tall/JJ as/IN her/PRP$ Change from IN to RB if w+2 is as  He/PRP is/VBZ expected/VBN to/TO race/NN today/NN Change from NN to VB if w‐1 is tagged as TO 55 HMM Teaser 58 HMM Teaser • Supervised learning requires tagged data • Wouldn’t it be great if we could: • Learn from untagged data (unsupervised) • Get the best tag sequence • Also benefit from tagged data, if available • Achieve accuracies >95% • We can ! Read Ch. 6 & show up next week! 59
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved