Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Statistical Analysis of Word Frequencies: Binomial Distribution and Chi-Square Test, Study notes of Computer Science

The analysis of word frequencies using the binomial distribution and chi-square test. The author, baayen, reports word frequencies from a corpus sample of over 18 million tokens and explains how to calculate the probability of observing a certain word frequency based on the binomial distribution. The document also introduces the chi-square test to compare observed and expected frequencies and test for independence between variables.

Typology: Study notes

Pre 2010

Uploaded on 07/28/2009

koofers-user-5od-1
koofers-user-5od-1 🇺🇸

5

(1)

10 documents

1 / 6

Toggle sidebar

Related documents


Partial preview of the text

Download Statistical Analysis of Word Frequencies: Binomial Distribution and Chi-Square Test and more Study notes Computer Science in PDF only on Docsity! Statistics and Linguistic Applications Hale April 22, 2008 Proportion test In Chapter 3 of his book, Baayen reports word frequencies from celex. These are based on a corpus sample of 18580121 tokens. word frequency relative frequency the 1093547 0.5885575 president 2469 0.000132884 hare 153 0.00000823 harpsichord 15 0.00000086 Even though we know better, lets for the moment conceptualize the observation of a word like “president” as a SUCCESS and the observation of any other word as a FAILURE. Our probability model is thus a Binomial with parameter p = 0.000133. We have seen (e.g. Vasishth p27) that when the corpus size n and the success probability p are not too close to zero, the Binomial distribution is closely approximated1 by a Normal distribution with mean np and variance npq where the failure probability q = (1− p). We can now view other corpus samples as results from a kind of language-production experiment. From such a sample we can compute a statistic, the sample proportion, and look up how probable this statistic’s value is under the assumed parameterization. Is the 1-million word Brown corpus, with 382 attestations of “president” a wacky or run-of-the-mill sample? Across many many corpora, what fraction would attest “president” that many times if the parameter were really p = 0.000133 ? Let us compute the standardized score and make a judgment. A standardized score looks like this Z = x− µ σ where x is the value to be standardized, σ is the (population) standard deviation and µ the (population) mean. In this case x is our observed proportion, p̂, and we know σ, µ in virtue of approximating the Binomial with the Normal. The proportional standard deviation σp of the Binomial is √ pq/n. The kind of decreasing proportional variability as sample size goes up is suggested in the diagram labeled Aha on page 17 in section 2.4 of the Vasishth notes. Z = p̂− p√ pq/n Multiplying by a form of 1 translates the proportion-based Z score into one based on an absolute number of successes. Define the success-count x in terms of the proportion of success such that p̂ = x/n. Z = p̂− p√ pq/n = p̂− p√ pq/n × n n = x− np√ pq/n× n = x− np √ pq/ √ n× ( √ n · √ n) = x− np √ pq · √ n = x− np √ npq (1) 1A derivation of the Binomial approximation to the Normal is given on the Mathworld Binomial page 1 The denominator now shows exactly the standard deviation of the Normal approximation to the Binomial. In our Brown corpus example, Z = 382− (0.000133× 1, 000, 000)√ 1, 000, 000× 0.000133× (1− 0.000133) > success <- 0.000133 > failure <- 1 - success > n <- 1e+06 > (382 - (success * n))/sqrt(n * success * failure) [1] 21.59247 Wow! a Z-score of 21.59! That’s twenty one standard deviations above the mean. What’s the probability of getting a sample that extreme or more? Lets “look it up in our table.” > 1 - pnorm(21.59) [1] 0 This sample is highly unlikely under the null hypothesis that“president”appears 0.000133 of the time. Baayen dryly remarks, “The resulting probability is indistinguishable from zero given machine precision and provides ample reason for surprise.” He gets to this same conclusion via a different route, calculating the Binomial probabilities directly with pbinom. In fact, the Normal was originally introduced by de Moivre as a way of approximating the Binomial (computers were expensive in 1733). Moreover, the Normal approximation leads us far beyond mere success/failure proportions as we shall see. The multinomial In the “president” case there were only two possible outcomes, identified with SUCCESS and FAILURE to produce “president” respectively. In a particular corpus sample x = np̂ is often called the observed frequency of success as opposed to failure. The expected frequency (of success) is np. The more general Multinomial distribution describes k different categories of events A1, A2, . . . , Ak with probabilities p1, p2, . . . pk. But the notions of “observed” and “expected” are the same. If we draw a sample of size n from a Multinomial population, the observed frequencies for the events A1, . . . , Ak can be described by random variables X1, . . . , Xk whose specific values x1, x2, . . . , xk would be the observed frequencies in the sample. The expected frequencies would just be np1, np2, . . . , npk. Event A1 A2 · · · Ak Observed Frequency x1 x2 · · · xk Expected Frequency np1 np2 · · · npk Table 1: Multinomial assigns probability to k kinds of events As an example of a Multinomial, consider bags of M&Ms. How many of each color (blue, brown, green, ... ) are there in a bag of 30? The count of one affects the others, if 28 are red then none of the other colors can claim more than 2 of the candies for their own color. P (X1 = x1, . . . , Xk = xk) = ( n x1 )( n− x1 x2 ) · · · ( n− x1 − x2 − · · · − xk−1 xk ) p1p2 · · · pk In the one-proportion Z test statistic (equation 1), there are exactly two outcomes. Viewed as a special case of the Multinomial, we might think of them as just the first two of potentially many more outcomes. The square of the Z-score What if we wanted to generalized beyond SUCCESS and FAILURE? Consider the square of the Z score 2 Contingency Tables From the perspective of a Multinomial over k different event types, everything is a 1× k (“one-by-k)”) table. If we extend into the second dimension we have an nr × nc contingency table. Frequently in linguistics we cross-classify attestations of a certain sound, word etc in two or more ways — these are contingency tables. Even though these data are arranged in a square-shaped table, we can still ask whether the table as a whole has a large discrepancy as compared to some expected values. To do this, compute equation 3 over the nrnc cells in the table and compare the obtained χ2 statistic to a chi-square distribution with particular degrees of freedom: (nr − 1)(nc − 1) if the expected frequencies can be computed without having to estimate population parameters from sample statistics (nr − 1)(nc − 1)−m if the expected frequencies can be computed only by estimating m population parameters from sample statistics One fascinating hypothesis is that the column variables are probabilistically independent from the row variables. Remember, if two random variables are independent, then their joint distribution is the product of their individual distributions. H0 : pij = pri pcj H1: the pij are not independent For example, Cooper and Hale (2004) examined ah, you know, disfluencies in the Switchboard corpus, a sample of spoken English. Looking at pairs of conjoined constituents affectionately known as “lobes”, they tabulated whether or not each one contains any disfluency. Lobe 2 Disfluent Fluent Lobe 1 Disfluent 150 126 (Expected) (99.2) (176.8) % of total 18.3% 15.3% Fluent 145 400 (Expected) (195.8) (349.2) % of total 17.7% 48.7% N=821, df=1 χ2=61.25 p<.001 Table 2: Disfluency status of conjoined lobes, obtained with tgrep2. A significant distribution. To work out what we expect under the null hypothesis, consider the marginals. Ignoring Lobe2 for a moment, there are 150+126/N observations where Lobe1 is disfluent. Call p1disf = 0.336. Of course, avoiding disfluency is all there is to fluency, so p1fluent = 1p1disf . Likewise we have p2disf = (145+150)/812 = 0.359. UnderH0, we should see Npl1disfp2disf = 99.17 in the upper-left cell of the contingency table (these expected values are pre-parenthesized for you). But in fact all the squared deviations divided by the expected number add up over sixty. It is highly improbably that Cooper and Hale would have observed this pattern if disfluency the first constituent had no influence on disfluency in the second. Arranging for R to calculate your chi-squared test of independence Lets borrow an example from D.G. Altman ‘Practical Statistics for Medial Research’. As quoted in Dal- gaard, this data concerns caffeine consumption among women given birth. The women are classified by marital status. > caff.marital <- matrix(c(652, 1537, 598, 242, 36, 46, 38, 21, 218, 327, 106, 67), nrow = 3, + byrow = T) > colnames(caff.marital) <- c("0", "1-150", "151-300", ">300") > rownames(caff.marital) <- c("Married", "Prev.married", "Single") > caff.marital 0 1-150 151-300 >300 Married 652 1537 598 242 5 Prev.married 36 46 38 21 Single 218 327 106 67 > chisq.test(caff.marital) Pearson's Chi-squared test data: caff.marital X-squared = 51.6556, df = 6, p-value = 2.187e-09 Marital status and caffeine consumption are not independent! But in what ways do they deviate from independence? We can work this out using some handy information that chisq.test hands back. > cm <- chisq.test(caff.marital) > E <- cm$expected > O <- cm$observed > (O - E)^2/E 0 1-150 151-300 >300 Married 4.1055981 1.612783 0.6874502 0.8858331 Prev.married 0.3007537 7.815444 4.5713926 6.8171090 Single 15.3563704 1.875645 7.0249243 0.6023355 The result shows the contribution of each cell to the overall χ2. A lot of single mothers just don’t consume any caffeine. The previously-married are shifted in the direction of greater consumption. Final Projects A written report on your final project is due at 5pm on Friday May 2nd. This document will be graded on the basis of statistical correctness and scientific quality. The Evaluation policy makes clear that the purpose of the final project is to encourage more statistically-sophisticated use of theoretical models or empirical studies in students’ research. empirical study detail and justify the analysis of an experiment to be run or observations to be collected theoretical model demonstrate or disprove a property of some stochastic theory in linguistics. The best project reports have a clear structure. They concisely address points such as 1. what scientific question the project is about. why it is interesting (to linguists) 2. what method the project uses to make progress on the question. why this method is appropriate 3. the logic behind the application of the method to the problem at hand; how did you apply it? 4. what were the results 5. what are the implications of these results, what is left unfinished or undecided. Make sure to write out your statistical reasoning. That is, write out your mathematical argument in greater detail than you would when writing for a journal or other scientific venue whose readership would be bored by repetition of arguments that have become standard. Hand in a paper copy of your report to Hale’s mailbox. An electronic copy would also be useful but I have to have something paper by the due date. Thanks for a great class! 6
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved