Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Notes on Goodness-of-Fit and Contingency Analysis |, Study notes of Statistics

Topic 4 - Goodness-of-Fit and Contingency Analysis (2011) Material Type: Notes; Class: Statistics 2 - Intermediate; Subject: Statistics; University: Carleton University; Term: Forever 1989;

Typology: Study notes

2010/2011

Uploaded on 04/29/2011

rollercoaster-101
rollercoaster-101 🇨🇦

4.6

(6)

41 documents

1 / 11

Toggle sidebar

Related documents


Partial preview of the text

Download Notes on Goodness-of-Fit and Contingency Analysis | and more Study notes Statistics in PDF only on Docsity! TOPIC 4: GOODNESS-OF-FIT TESTS & CONTINGENCY ANALYSIS OBJECTIVES: This lecture will introduce you to two tests. The first, the chi-squared Goodness-of-fit Test, allows you to test whether a sample is drawn from a particular, hypothesized distribution. The second, the chi-squared Contingency Analysis tests, allows you to test for independence of observations. Together, these two tests help you determine whether the normal distribution and independence assumptions required for ANOVA are valid or not. CONTENT: 1. Chi-squared Contingency Analysis 2. Chi-squared Goodness-of-fit 3. Common Errors 4. Definitions 5. Problems ECON 2202, Topic 4 – © S. Dubey, 2010 1 1. CHI-SQUARED CONTINGENCY ANALYSIS Contingency Analysis determines whether two variables are independent, using a Contingency Table, or cross- tabulation table. This type of table is always a two-way table (with r rows and c columns). The test compares observed frequencies, Oij, with expected (hypothesized) frequencies, Eij, for each cell of a two-way table. To be valid, the test statistic needs a sample of at least 30 observations in total, with each expected frequency having a value of at least 5 (Eij≥5). If this statistic is “too large” then the data don’t fit the hypothesized distribution:       r i c j ij ijij cr e eo 1 1 2 2 )1)(1( )(  ; where Oij = observed and Eij = expected cell frequency for cell i,j. There are r rows, c columns. Eij = expected cell frequency = (ith Row total)*(jth Column total)/N, N = total sample size. A sample size of 30 is sufficient as long as none of the rc expected frequencies are too small (at least 5). Otherwise the test statistic may be inflated, which is a problem only if you reject the null hypothesis. THEORY: The hypothesized distribution is in the null hypothesis (it is an equality). EXAMPLE 4.1. 200 people in a company respond to a smoking questionnaire: 11 smokers care about smoking in the office while 29 smokers don’t; 139 non-smokers care about smoking in the office while 21 don’t. Are attitudes about office smoking independent of smoking status? 2. Givens:  Establish contingency table,  determine c and r  Indicate test statistic,      r i c j ij ijij e eo 1 1 2 2 0 )(  Oij Smoker Non- Smoker Row Sum Cares 11 139 150 Doesn't Care 29 21 50 Column Sum 40 160 200 Eij Smoker Non-Smoker Cares 40*150/200=30 160*150/200=120 Doesn't Care 40*50/200=10 160*50/200=40 (Eij - Oij)2/Eij Smoker Non-Smoker Cares 12.0333 3.0083 Doesn't Care 36.1000 9.0250 c=2, r=2, 2. H0: observations are independent Ha: observations not independent H0: attitude is independent of smoking status; Ha: attitude is not independent of smoking status 3. Draw Chi-squared distributon  Find critical value, χ2α, (r-1)(c-1)r-1)(r-1)(c-1)c-1)  Shade and label rejection regions  State Decision Rule: Reject H0 if χ 2 (r-1)(c-1)r-1)(r-1)(c-1)c- 1) < χ 2 0 Draw a Chi-squared distribution with α = 0.05 in the right tail to the right of the critical value, χ2α, (r-1)(c-1)k-1) = χ 2 0.05, (r-1)(c-1)2-1)(r-1)(c-1)2-1 ) = χ 2 0.05, 1 = 3.8415 Reject H0 if χ 2 0 > χ 2 α =3.8415      r i c j ij ijij e eo 1 1 2 2 0 )(       r i c j ij ijij e eo 1 1 2 2 0 )(  = 60.1667 Conclude Reject H0 since χ 2 0 = 60.1667 > 3.8415 = χ 2 0.05,1. So attitude to office smoking is not independent of smoking status If you want to use the p-value approach, steps 3 to 5 are: ECON 2202, Topic 4 – © S. Dubey, 2010 2 Note: If you want to use the p-value approach, the solution is, for steps 3 to 5: GIVEN: 3. Draw Chi-squared distribution and shade α = 0.05 in the right tail. Reject H0 if p-value < α = 0.05. 4. p-value = P(χ2 > χ20 ) = p (χ2 > 14.05) = CHITEST(χ20, df) = CHIDIST(14.05, 4)= 0.0071 5. Reject H0 p-value = 0.0071 < 0.05. Thus, students are not uniformly enrolled across programs EXAMPLE 4.3: A clothes designer hires S. Marty Pants, hires a statistical analyst, to determine the “best” (lowest cost) manufacturer among five that offer the same quality of work. S. Marty decides to use ANOVA, and tests and accepts the assumption of equality of population variances and independence of observations. She also accepts normality for all but the last manufacturer. To conduct the final test of normality at the 5% level, assuming a mean of $140 and a standard deviation of $10, S. Marty uses the following data: SOLUTION: 1. Given: N=200, and we assume X=price ~N(r-1)(c-1)μ,σ) where μ=140, σ=10, so) where μ=140, σ) where μ=140, σ=10, so=10, so E1 = 200 * P(r-1)(c-1) X<120) = 200 * 0.0228 = 4.56 = E6 due to symmetry E2 = 200 * P(r-1)(c-1) 120<X<130) = 200 * 0.1359 = 27.18 = E5 due to symmetry E3 = 200 * P(r-1)(c-1) 130<X<140) = 200 * 0.3413 = 68.26 = E4 due to symmetry 0228.04772.05.0)20(5.0)2()2-( 10 140120 P) 120(           ZPZPZP X XP   1359.03413.04772.0 )10()20()21()-12( 10 140130 10 140120 P) 130120(              ZPZPZPZP X XP   3413.0)10()01( 10 140140 10 140130 P) 140130(             ZPZP X XP   Prices Oi Ei (r-1)(c-1)Oi-Ei)2 (r-1)(c-1)Oi-Ei)2/Ei Less than $120 6 4.56 2.0736 0.4547 $120 to under $130 40 27.18 164.3524 6.0468 $130 to under $140 75 68.26 45.4276 0.6655 $140 to under $150 55 68.26 175.8276 2.5759 $150 to under $160 20 27.18 51.5524 1.8967 $160 and over 4 4.56 0.3136 0.0688 TOTAL 200 200 11.7080 NOTE: if you want to use EXCEL, you can also find cumulative normal values using the formula P(X< value) NORMDIST(value, μ, σ, 1); so P ( X<120 ) = NORMDIST(120, 140, 10, 1) P (120<X<130) = P (X<130) – P(X<120) = NORMDIST(130,140,10,1) – NORMDIST(120,140,10,1) P (130<X<140) = P (X<140) – P(X<130) = NORMDIST(140,140,10,1) – NORMDIST(130,140,10,1) 2. H0: population is normal; Ha: population is not normal 3. Draw a Chi-squared distribution with α = 0.05 shaded in right tail to right of χ2α, (r-1)(c-1)k-1) = χ20.05, (r-1)(c-1)6-1) = 11.0705. Reject H0 if χ 2 0 > 11.0705 4. 7080.11 )( 22 0    i ii E EO  ECON 2202, Topic 4 – © S. Dubey, 2010 Prices Oi Less than $120 6 $120 to under $130 40 $130 to under $140 75 $140 to under $150 55 $150 to under $160 20 $160 and over 4 5 5. Reject H0 since χ20 > χ2α (11.708 > 11.0705). Thus, the distribution is not normal. This means that S. Marty cannot use ANOVA to determine the best manufacturer. Using the p-value approach for steps 3 to 5: 3. Draw a Chi-squared distribution with α = 0.05 shaded in right tail. Reject H0 if p-value < α = 0.05. 4. p-value = P(χ2 > 11.7080) = CHIDIST(value, alpha) = CHIDIST(11.7080,0.05)= 0.039015 5. Reject H0 since p-value = 0.0390 <0.05. Thus, the distribution is not normal. This means that S. Marty cannot use ANOVA to determine the best manufacturer. NOTE: To calculate Ei ,use the sample mean and standard deviation if you don’t have population values. 3. COMMON ERRORS The most common error made by students is not practicing sufficient questions. The two tests in this lecture are relatively simple, but do require a little practice to truly understand. Also keep in mind: 1. Both tests are right-tailed tests, so all the rejection region, α, is in the right tail. 2. Degrees of freedom: Both tests have only one degree of freedom, since both are Chi-squared tests. 3. Expected values: For both tests, expected values are calculated using a combination of the total number of observations, and the probability of being in a particular cell. ECON 2202, Topic 4 – © S. Dubey, 2010 6 4. DEFINITIONS χ2 (r-1)(c-1)Chi-squared) Contingency Test Statistic (r-1)(c-1)to test for independence between two variables/factors)       r i c j ij ijij cr E EO 1 1 2 2 )1)(1( )(  ; df = (r – 1)(c – 1) where Oij = Observed cell frequency in cell (i, j) Eij = Expected cell frequency in cell (i, j) = (row i total) (column j total) / (Grand total) = (row i total) (column j total) / N r = number of rows c = number of columns N = total number of observations. NOTE:  This is a reasonable test when each expected cell frequency is at least 5. When an expected cell frequency is less than 5, the probability of a Type I error increases beyond the stated significance level, α. This is not a problem when you do not reject the H0, but it is when you reject H0. To compensate, increase the sample size or combine categories (use sound logic to determine how you combine your categories). χ2 (r-1)(c-1)Chi-squared) Goodness-of-fit Test (r-1)(c-1)to test population distribution assumption)      k i i ii k E EO 2 2 2 1 )(  ; df = k – 1 where Oi = Observed cell frequency for category i Ei = Expected cell frequency for category i (calculated based on the hypothesized distribution) = N * P(of being in a cell) k = number of categories Assumptions:  Observations are independent NOTE:  This statistic is distributed approximately as a chi-square only if the sample size is large (use n ≥ 30).  Recall from Econ 2201 that a frequency is a count of the number of observations. ECON 2202, Topic 4 – © S. Dubey, 2010 7 7. Suppose a researcher wants to use the data below and the chi-square test of independence to determine if variable one is independent of variable two. Variable One A B C Variable Two D 25 40 60 E 10 45 20 a. What is the expected value for the cell of D and B? b. What are the degrees of freedom for this test? c. What is the critical value for chi-square value  = 0.05? 8. A firm received 250 job applications from Carleton graduates, and classified applicants by: (1) whether or not they got a job offer; and (2) their immigration status. Results of the classification are in the table below: Immigrant Canadian Born Received Job Offer 10 80 Did Not Receive Job Offer 40 120 a. If immigration status was independent of receiving a job offer, how many immigrants would you expect to receive a job offer? b. Test if receiving a job offer (or not) is independent of immigration status. Use α = 0.05. 9. A study of the effects of exercise by women included the following results. Exercise values are in kilocalories of physical activity per week. Use a 0.05 significance level to test for the independence of the level of smoking and the level of exercise. ECON 2202, Topic 4 – © S. Dubey, 2010 Observed Frequency (Oij) Below 200 200-599 600-1499 1500+ Total Never smoked 4,997 5,205 5,784 4,155 20,141 Smoke < 15 cigarettes/day 604 484 447 359 1,894 Smoke 15 or more cigarettes/day 1,403 830 644 350 3,227 Total 7,004 6,519 6,875 4,864 25,262 10 10. Employees are compensated for overtime by receiving extra pay, time off, or other perks. A random sample of 50 employees is taking, and the results are provided in the table below, providing employee preferences by those with children and those that do not have children. Determine if the type of compensation preferred is independent of whether or not an employee has children. Test using a 1% significance level. Oij Pay Time Off Perks Row Total With children 19 8 3 30 No children 6 7 7 20 Column Totals 25 15 10 50 ECON 2202, Topic 4 – © S. Dubey, 2010 11
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved