Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Chi-Square and ANOVA Tests: Hypothesis Analysis in Statistics, Lecture notes of Economics

Probability TheoryData AnalysisStatistical Inference

An overview of Chi-Square and Analysis of Variance (ANOVA) tests, including calculations, examples, and instructions on how to perform these tests using technology. Chi-Square tests are used to determine if observed frequencies fit the expected frequencies, while ANOVA tests compare the means of multiple groups.

What you will learn

  • How do you calculate the expected frequencies for a Chi-Square test?
  • What is the null hypothesis in a Chi-Square test?
  • What is the difference between the Chi-Square test and the Analysis of Variance (ANOVA) test?
  • How do you perform a Chi-Square test using technology?
  • What is the Chi-Square test used for in statistics?

Typology: Lecture notes

2021/2022

Uploaded on 09/12/2022

bartolix
bartolix 🇬🇧

4.8

(17)

73 documents

1 / 40

Toggle sidebar

Related documents


Partial preview of the text

Download Chi-Square and ANOVA Tests: Hypothesis Analysis in Statistics and more Lecture notes Economics in PDF only on Docsity! Chapter 11: Chi-Square Tests and ANOVA 393 Chapter 11: Chi-Square and ANOVA Tests This chapter presents material on three more hypothesis tests. One is used to determine significant relationship between two qualitative variables, the second is used to determine if the sample data has a particular distribution, and the last is used to determine significant relationships between means of 3 or more samples. Section 11.1: Chi-Square Test for Independence Remember, qualitative data is where you collect data on individuals that are categories or names. Then you would count how many of the individuals had particular qualities. An example is that there is a theory that there is a relationship between breastfeeding and autism. To determine if there is a relationship, researchers could collect the time period that a mother breastfed her child and if that child was diagnosed with autism. Then you would have a table containing this information. Now you want to know if each cell is independent of each other cell. Remember, independence says that one event does not affect another event. Here it means that having autism is independent of being breastfed. What you really want is to see if they are not independent. In other words, does one affect the other? If you were to do a hypothesis test, this is your alternative hypothesis and the null hypothesis is that they are independent. There is a hypothesis test for this and it is called the Chi-Square Test for Independence. Technically it should be called the Chi-Square Test for Dependence, but for historical reasons it is known as the test for independence. Just as with previous hypothesis tests, all the steps are the same except for the assumptions and the test statistic. Hypothesis Test for Chi-Square Test 1. State the null and alternative hypotheses and the level of significance Ho : the two variables are independent (this means that the one variable is not affected by the other) HA : the two variables are dependent (this means that the one variable is affected by the other) Also, state your α level here. 2. State and check the assumptions for the hypothesis test a. A random sample is taken. b. Expected frequencies for each cell are greater than or equal to 5 (The expected frequencies, E, will be calculated later, and this assumption means E ≥ 5 ). 3. Find the test statistic and p-value Finding the test statistic involves several steps. First the data is collected and counted, and then it is organized into a table (in a table each entry is called a cell). These values are known as the observed frequencies, which the symbol for an observed frequency is O. Each table is made up of rows and columns. Then each row is totaled to give a row total and each column is totaled to give a column total. Chapter 11: Chi-Squared Tests and ANOVA 394 The null hypothesis is that the variables are independent. Using the multiplication rule for independent events you can calculate the probability of being one value of the first variable, A, and one value of the second variable, B (the probability of a particular cell P A and B( ) ). Remember in a hypothesis test, you assume that H0 is true, the two variables are assumed to be independent. P A and B( ) = P A( ) ⋅P B( ) if A and B are independent = number of ways A can happen total number of individuals ⋅ number of ways B can happen total number of individuals = row total n * column total n Now you want to find out how many individuals you expect to be in a certain cell. To find the expected frequencies, you just need to multiply the probability of that cell times the total number of individuals. Do not round the expected frequencies. Expected frequency cell A and B( ) = E A and B( ) = n row total n ⋅ column total n ⎛ ⎝⎜ ⎞ ⎠⎟ = row total ⋅column total n If the variables are independent the expected frequencies and the observed frequencies should be the same. The test statistic here will involve looking at the difference between the expected frequency and the observed frequency for each cell. Then you want to find the “total difference” of all of these differences. The larger the total, the smaller the chances that you could find that test statistic given that the assumption of independence is true. That means that the assumption of independence is not true. How do you find the test statistic? First find the differences between the observed and expected frequencies. Because some of these differences will be positive and some will be negative, you need to square these differences. These squares could be large just because the frequencies are large, you need to divide by the expected frequencies to scale them. Then finally add up all of these fractional values. This is the test statistic. Test Statistic: The symbol for Chi-Square is χ 2 χ 2 = O − E( )2 E∑ where O is the observed frequency and E is the expected frequency Chapter 11: Chi-Square Tests and ANOVA 397 Table #11.1.2: Calculations for Chi-Square Test Statistic O E O − E O − E( )2 O − E( )2 E 241 228.585 12.415 154.132225 0.674288448 198 195.304 2.696 7.268416 0.03721591 164 167.278 -3.278 10.745284 0.064236086 215 226.833 -11.833 140.019889 0.617281828 20 32.4154 -12.4154 154.1421572 4.755213792 25 27.6959 -2.6959 7.26787681 0.262417066 27 23.7216 3.2784 10.74790656 0.453085229 44 32.167 11.833 140.019889 4.352904809 Total 0.0001 11.2166432= χ 2 The test statistic formula is χ 2 = O − E( )2 E∑ , which is the total of the last column in table #11.1.2. p-value: df = 2 −1( )* 4 −1( ) = 3 Using TI-83/84: χcdf 11.2166432,1E99,3( ) ≈ 0.01061 Using R: 1− pchisq 11.2166432,3( ) ≈ 0.01061566 4. Conclusion Fail to reject Ho since the p-value is more than 0.01. 5. Interpretation There is not enough evidence to show that breastfeeding and autism are dependent. This means that you cannot say that the whether a child is breastfed or not will indicate if that the child will be diagnosed with autism. Example #11.1.2: Hypothesis Test with Chi-Square Test Using Technology Is there a relationship between autism and breastfeeding? To determine if there is, a researcher asked mothers of autistic and non-autistic children to say what time period they breastfed their children. The data is in table #11.1.1 (Schultz, Klonoff-Cohen, Wingard, Askhoomoff, Macera, Ji & Bacher, 2006). Do the data provide enough evidence to show that that breastfeeding and autism are independent? Test at the1% level. Solution: 1. State the null and alternative hypotheses and the level of significance Ho : Breastfeeding and autism are independent HA : Breastfeeding and autism are dependent α = 0.01 Chapter 11: Chi-Squared Tests and ANOVA 398 2. State and check the assumptions for the hypothesis test a. A random sample of breastfeeding time frames and autism incidence was taken. b. Expected frequencies for each cell are greater than or equal to 5 (ie. E ≥ 5 ). See step 3. All expected frequencies are more than 5. 3. Find the test statistic and p-value Test statistic: To use the TI-83/84 calculator to compute the test statistic, you must first put the data into the calculator. However, this process is different than for other hypothesis tests. You need to put the data in as a matrix instead of in the list. Go into the MATRX menu then move over to EDIT and choose 1:[A]. This will allow you to type the table into the calculator. Figure #11.1.2 shows what you will see on your calculator when you choose 1:[A] from the EDIT menu. Figure #11.1.2: Matrix Edit Menu on TI-83/84 The table has 2 rows and 4 columns (don’t include the row total column and the column total row in your count). You need to tell the calculator that you have a 2 by 4. The 1 X1 (you might have another size in your matrix, but it doesn’t matter because you will change it) on the calculator is the size of the matrix. So type 2 ENTER and 4 ENTER and the calculator will make a matrix of the correct size. See figure #11.1.3. Figure #11.1.3: Matrix Setup for Table Chapter 11: Chi-Square Tests and ANOVA 399 Now type the table in by pressing ENTER after each cell value. Figure #11.1.4 contains the complete table typed in. Once you have the data in, press QUIT. Figure #11.1.4: Data Typed into Matrix To run the test on the calculator, go into STAT, then move over to TEST and choose χ 2 -Test from the list. The setup for the test is in figure #11.1.5. Figure #11.1.5: Setup for Chi-Square Test on TI-83/84 Once you press ENTER on Calculate you will see the results in figure #11.1.6. Figure #11.1.6: Results for Chi-Square Test on TI-83/84 The test statistic is χ 2 ≈11.2167 and the p-value is p ≈ 0.01061 . Notice that the calculator calculates the expected values for you and places them in matrix B. To Chapter 11: Chi-Squared Tests and ANOVA 402 Table #11.1.3: Number of Leprosy Cases WHO Region World Bank Income Group Row Total High Income Upper Middle Income Lower Middle Income Low Income Americas 174 36028 615 0 36817 Eastern Mediterranean 54 6 1883 604 2547 Europe 10 0 0 0 10 Western Pacific 26 216 3689 1155 5086 Africa 0 39 1986 15928 17953 South-East Asia 0 0 149896 10236 160132 Column Total 264 36289 158069 27923 222545 Solution: 1. State the null and alternative hypotheses and the level of significance Ho : WHO region and Income Level when dealing with the disease of leprosy are independent HA : WHO region and Income Level when dealing with the disease of leprosy are dependent α = 0.05 2. State and check the assumptions for the hypothesis test a. A random sample of incidence of leprosy was taken from different countries and the income level and WHO region was taken. b. Expected frequencies for each cell are greater than or equal to 5 (ie. E ≥ 5 ). See step 3. There are actually 4 expected frequencies that are less than 5, and the results of the test may not be valid. If you look at the expected frequencies you will notice that they are all in Europe. This is because Europe didn’t have many cases in 2011. 3. Find the test statistic and p-value Test statistic: First find the expected frequencies for each cell. E Americas and High Income( ) = 36817 *264 222545 ≈ 43.675 E Americas and Upper Middle Income( ) = 36817 * 36289 222545 ≈ 6003.514 E Americas and Lower Middle Income( ) = 36817 *158069 222545 ≈ 26150.335 E Americas and Lower Income( ) = 36817 *27923 222545 ≈ 4619.475 Chapter 11: Chi-Square Tests and ANOVA 403 Others are done similarly. It is easier to do the calculations for the test statistic with a table, and the others are in table #11.1.4 along with the calculation for the test statistic. Table #11.1.4: Calculations for Chi-Square Test Statistic O E O − E O − E( )2 O − E( )2 E 174 43.675 130.325 16984.564 388.8838719 54 3.021 50.979 2598.813 860.1218328 10 0.012 9.988 99.763 8409.746711 26 6.033 19.967 398.665 66.07628214 0 21.297 -21.297 453.572 21.29722977 0 189.961 -189.961 36085.143 189.9608978 36028 6003.514 30024.486 901469735.315 150157.0038 6 415.323 -409.323 167545.414 403.4097962 0 1.631 -1.631 2.659 1.6306365 216 829.342 -613.342 376188.071 453.5983897 39 2927.482 -2888.482 8343326.585 2850.001268 0 26111.708 -26111.708 681821316.065 26111.70841 615 26150.335 -25535.335 652053349.724 24934.7988 1883 1809.080 73.920 5464.144 3.020398811 0 7.103 -7.103 50.450 7.1027882 3689 3612.478 76.522 5855.604 1.620938405 1986 12751.636 -10765.636 115898911.071 9088.944681 149896 113738.368 36157.632 1307374351.380 11494.57632 0 4619.475 -4619.475 21339550.402 4619.475122 604 319.575 284.425 80897.421 253.1404187 0 1.255 -1.255 1.574 1.25471253 1155 638.147 516.853 267137.238 418.6140882 15928 2252.585 13675.415 187016964.340 83023.25138 10236 20091.963 -9855.963 97140000.472 4834.769106 Total 0.000 328594.008= χ 2 The test statistic formula is χ 2 = O − E( )2 E∑ , which is the total of the last column in table #11.1.2. p-value: df = 6 −1( )* 4 −1( ) = 15 Using the TI-83/84: χcdf 328594.008,1E99,15( ) ≈ 0 Using R: 1− pchisq 328594.008,15( ) ≈ 0 4. Conclusion Reject Ho since the p-value is less than 0.05. Chapter 11: Chi-Squared Tests and ANOVA 404 5. Interpretation There is enough evidence to show that WHO region and income level are dependent when dealing with the disease of leprosy. WHO can decide how to focus their efforts based on region and income level. Do remember though that the results may not be valid due to the expected frequencies not all be more than 5. Example #11.1.4: Hypothesis Test with Chi-Square Test Using Technology The World Health Organization (WHO) keeps track of how many incidents of leprosy there are in the world. Using the WHO regions and the World Banks income groups, one can ask if an income level and a WHO region are dependent on each other in terms of predicting where the disease is. Data on leprosy cases in different countries was collected for the year 2011 and a summary is presented in table #11.1.3 ("Leprosy: Number of," 2013). Is there evidence to show that income level and WHO region are independent when dealing with the disease of leprosy? Test at the 5% level. Solution: 1. State the null and alternative hypotheses and the level of significance Ho : WHO region and Income Level when dealing with the disease of leprosy are independent HA : WHO region and Income Level when dealing with the disease of leprosy are dependent α = 0.05 2. State and check the assumptions for the hypothesis test a. A random sample of incidence of leprosy was taken from different countries and the income level and WHO region was taken. b. Expected frequencies for each cell are greater than or equal to 5 (ie. E ≥ 5 ). See step 3. There are actually 4 expected frequencies that are less than 5, and the results of the test may not be valid. If you look at the expected frequencies you will notice that they are all in Europe. This is because Europe didn’t have many cases in 2011. 3. Find the test statistic and p-value Test statistic: Using the TI-83/84. See example #11.1.2 for the process of doing the test on the calculator. Remember, you need to put the data in as a matrix instead of in the list. Chapter 11: Chi-Square Tests and ANOVA 407 Section 11.1: Homework In each problem show all steps of the hypothesis test. If some of the assumptions are not met, note that the results of the test may not be correct and then continue the process of the hypothesis test. 1.) The number of people who survived the Titanic based on class and sex is in table #11.1.5 ("Encyclopedia Titanica," 2013). Is there enough evidence to show that the class and the sex of a person who survived the Titanic are independent? Test at the 5% level. Table #11.1.5: Surviving the Titanic Class Sex Total Female Male 1st 134 59 193 2nd 94 25 119 3rd 80 58 138 Total 308 142 450 2.) Researchers watched groups of dolphins off the coast of Ireland in 1998 to determine what activities the dolphins partake in at certain times of the day ("Activities of dolphin," 2013). The numbers in table #11.1.6 represent the number of groups of dolphins that were partaking in an activity at certain times of days. Is there enough evidence to show that the activity and the time period are independent for dolphins? Test at the 1% level. Table #11.1.6: Dolphin Activity Activity Period Row Total Morning Noon Afternoon Evening Travel 6 6 14 13 39 Feed 28 4 0 56 88 Social 38 5 9 10 62 Column Total 72 15 23 79 189 Chapter 11: Chi-Squared Tests and ANOVA 408 3.) Is there a relationship between autism and what an infant is fed? To determine if there is, a researcher asked mothers of autistic and non-autistic children to say what they fed their infant. The data is in table #11.1.7 (Schultz, Klonoff-Cohen, Wingard, Askhoomoff, Macera, Ji & Bacher, 2006). Do the data provide enough evidence to show that that what an infant is fed and autism are independent? Test at the1% level. Table #11.1.7: Autism Versus Breastfeeding Autism Feeding Row Total Brest- feeding Formula with DHA/ARA Formula without DHA/ARA Yes 12 39 65 116 No 6 22 10 38 Column Total 18 61 75 154 4.) A person’s educational attainment and age group was collected by the U.S. Census Bureau in 1984 to see if age group and educational attainment are related. The counts in thousands are in table #11.1.8 ("Education by age," 2013). Do the data show that educational attainment and age are independent? Test at the 5% level. Table #11.1.8: Educational Attainment and Age Group Education Age Group Row Total 25-34 35-44 45-54 55-64 >64 Did not complete HS 5416 5030 5777 7606 13746 37575 Competed HS 16431 1855 9435 8795 7558 44074 College 1-3 years 8555 5576 3124 2524 2503 22282 College 4 or more years 9771 7596 3904 3109 2483 26863 Column Total 40173 20057 22240 22034 26290 130794 Chapter 11: Chi-Square Tests and ANOVA 409 5.) Students at multiple grade schools were asked what their personal goal (get good grades, be popular, be good at sports) was and how important good grades were to them (1 very important and 4 least important). The data is in table #11.1.9 ("Popular kids datafile," 2013). Do the data provide enough evidence to show that goal attainment and importance of grades are independent? Test at the 5% level. Table #11.1.9: Personal Goal and Importance of Grades Goal Grades Importance Rating Row Total 1 2 3 4 Grades 70 66 55 56 247 Popular 14 33 45 49 141 Sports 10 24 33 23 90 Column Total 94 123 133 128 478 6.) Students at multiple grade schools were asked what their personal goal (get good grades, be popular, be good at sports) was and how important being good at sports were to them (1 very important and 4 least important). The data is in table #11.1.10 ("Popular kids datafile," 2013). Do the data provide enough evidence to show that goal attainment and importance of sports are independent? Test at the 5% level. Table #11.1.10: Personal Goal and Importance of Sports Goal Sports Importance Rating Row Total 1 2 3 4 Grades 83 81 55 28 247 Popular 32 49 43 17 141 Sports 50 24 14 2 90 Column Total 165 154 112 47 478 7.) Students at multiple grade schools were asked what their personal goal (get good grades, be popular, be good at sports) was and how important having good looks were to them (1 very important and 4 least important). The data is in table #11.1.11 ("Popular kids datafile," 2013). Do the data provide enough evidence to show that goal attainment and importance of looks are independent? Test at the 5% level. Table #11.1.11: Personal Goal and Importance of Looks Goal Looks Importance Rating Row Total 1 2 3 4 Grades 80 66 66 35 247 Popular 81 30 18 12 141 Sports 24 30 17 19 90 Column Total 185 126 101 66 478 Chapter 11: Chi-Squared Tests and ANOVA 412 4. Conclusion This is where you write reject Ho or fail to reject Ho . The rule is: if the p-value < α , then reject Ho . If the p-value ≥α , then fail to reject Ho 5. Interpretation This is where you interpret in real world terms the conclusion to the test. The conclusion for a hypothesis test is that you either have enough evidence to show HA is true, or you do not have enough evidence to show HA is true. Example #11.2.1: Goodness of Fit Test Using the Formula Suppose you have a die that you are curious if it is fair or not. If it is fair then the proportion for each value should be the same. You need to find the observed frequencies and to accomplish this you roll the die 500 times and count how often each side comes up. The data is in table #11.2.1. Do the data show that the die is fair? Test at the 5% level. Table #11.2.1: Observed Frequencies of Die Die values 1 2 3 4 5 6 Total Observed Frequency 78 87 87 76 85 87 100 Solution: 1. State the null and alternative hypotheses and the level of significance Ho : The observed frequencies are consistent with the distribution for fair die (the die is fair) HA : The observed frequencies are not consistent with the distribution for fair die (the die is not fair) α = 0.05 2. State and check the assumptions for the hypothesis test a. A random sample is taken since each throw of a die is a random event. b. Expected frequencies for each cell are greater than or equal to 5. See step 3. 3. Find the test statistic and p-value First you need to find the probability of rolling each side of the die. The sample space for rolling a die is {1, 2, 3, 4, 5, 6}. Since you are assuming that the die is fair, then P 1( ) = P 2( ) = P 3( ) = P 4( ) = P 5( ) = P 6( ) = 1 6 . Now you can find the expected frequency for each side of the die. Since all the probabilities are the same, then each expected frequency is the same. Expected frequency = E = n* p = 500* 1 6 ≈ 83.33 Chapter 11: Chi-Square Tests and ANOVA 413 Test Statistic: It is easier to calculate the test statistic using a table. Table #11.2.2: Calculation of the Chi-Square Test Statistic O E O − E O − E( )2 O − E( )2 E 78 83.33 -5.33 28.4089 0.340920437 87 83.33 3.67 13.4689 0.161633265 87 83.33 3.67 13.4689 0.161633265 76 83.33 -7.33 53.7289 0.644772591 85 83.33 1.67 2.7889 0.033468139 87 83.33 3.67 13.4689 0.161633265 Total 0.02 χ 2 ≈ 1.504060962 The test statistic is χ 2 ≈ 1.504060962 The degrees of freedom are df = k −1= 6 −1= 5 Using TI-83/84: p − value = χ 2cdf 1.50406096,1E99,5( ) ≈ 0.913 Using R: p − value = 1− pchisq 1.50406096,5( ) ≈ 0.9126007 4. Conclusion Fail to reject Ho since the p-value is greater than 0.05. 5. Interpretation There is not enough evidence to show that the die is not consistent with the distribution for a fair die. There is not enough evidence to show that the die is not fair. Example #11.2.2: Goodness of Fit Test Using Technology Suppose you have a die that you are curious if it is fair or not. If it is fair then the proportion for each value should be the same. You need to find the observed frequencies and to accomplish this you roll the die 500 times and count how often each side comes up. The data is in table #11.2.1. Do the data show that the die is fair? Test at the 5% level. Solution: 1. State the null and alternative hypotheses and the level of significance Ho : The observed frequencies are consistent with the distribution for fair die (the die is fair) HA : The observed frequencies are not consistent with the distribution for fair die (the die is not fair) α = 0.05 2. State and check the assumptions for the hypothesis test a. A random sample is taken since each throw of a die is a random event. Chapter 11: Chi-Squared Tests and ANOVA 414 b. Expected frequencies for each cell are greater than or equal to 5. See step 3. 3. Find the test statistic and p-value Using the TI-83/84 calculator: Using the TI-83: To use the TI-83 calculator to compute the test statistic, you must first put the data into the calculator. Type the observed frequencies into L1 and the expected frequencies into L2. Then you will need to go to L3, arrow up onto the name, and type in L1− L2( )^ 2 / L2 . Now you use 1-Var Stats L3 to find the total. See figure #11.2.1 for the initial setup, figure #11.2.2 for the results of that calculation, and figure #11.2.3 for the result of the 1-Var Stats L3. Figure #11.2.1: Input into TI-83 Figure #11.2.2: Result for L3 on TI-83 Figure #11.2.3: 1-Var Stats L3 Result on TI-83 Chapter 11: Chi-Square Tests and ANOVA 417 2.) Eyeglassomatic manufactures eyeglasses for different retailers. They test to see how many defective lenses they made the time period of January 1 to March 31. Table #11.2.4 gives the defect and the number of defects. Table #11.2.4: Number of Defective Lenses Defect type Number of defects Scratch 5865 Right shaped – small 4613 Flaked 1992 Wrong axis 1838 Chamfer wrong 1596 Crazing, cracks 1546 Wrong shape 1485 Wrong PD 1398 Spots and bubbles 1371 Wrong height 1130 Right shape – big 1105 Lost in lab 976 Spots/bubble – intern 976 Do the data support the notion that each defect type occurs in the same proportion? Test at the 10% level. 3.) On occasion, medical studies need to model the proportion of the population that has a disease and compare that to observed frequencies of the disease actually occurring. Suppose the end-stage renal failure in south-west Wales was collected for different age groups. Do the data in table 11.2.5 show that the observed frequencies are in agreement with proportion of people in each age group (Boyle, Flowerdew & Williams, 1997)? Test at the 1% level. Table #11.2.5: Renal Failure Frequencies Age Group 16-29 30-44 45-59 60-75 75+ Total Observed Frequency 32 66 132 218 91 539 Expected Proportion 0.23 0.25 0.22 0.21 0.09 4.) In Africa in 2011, the number of deaths of a female from cardiovascular disease for different age groups are in table #11.2.6 ("Global health observatory," 2013). In addition, the proportion of deaths of females from all causes for the same age groups are also in table #11.2.6. Do the data show that the death from cardiovascular disease are in the same proportion as all deaths for the different age groups? Test at the 5% level. Table #11.2.6: Deaths of Females for Different Age Groups Age 5-14 15-29 30-49 50-69 Total Cardiovascular Frequency 8 16 56 433 513 All Cause Proportion 0.10 0.12 0.26 0.52 Chapter 11: Chi-Squared Tests and ANOVA 418 5.) In Australia in 1995, there was a question of whether indigenous people are more likely to die in prison than non-indigenous people. To figure out, the data in table 11.2.7 was collected. ("Aboriginal deaths in," 2013). Do the data show that indigenous people die in the same proportion as non-indigenous people? Test at the 1% level. Table #11.2.7: Death of Prisoners Prisoner Dies Prisoner Did Not Die Total Indigenous Prisoner Frequency 17 2890 2907 Frequency of Non-Indigenous Prisoner 42 14459 14501 6.) A project conducted by the Australian Federal Office of Road Safety asked people many questions about their cars. One question was the reason that a person chooses a given car, and that data is in table #11.2.8 ("Car preferences," 2013). Table #11.2.8: Reason for Choosing a Car Safety Reliability Cost Performance Comfort Looks 84 62 46 34 47 27 Do the data show that the frequencies observed substantiate the claim that the reasons for choosing a car are equally likely? Test at the 5% level. Chapter 11: Chi-Square Tests and ANOVA 419 Section 11.3: Analysis of Variance (ANOVA) There are times where you want to compare three or more population means. One idea is to just test different combinations of two means. The problem with that is that your chance for a type I error increases. Instead you need a process for analyzing all of them at the same time. This process is known as analysis of variance (ANOVA). The test statistic for the ANOVA is fairly complicated, you will want to use technology to find the test statistic and p-value. The test statistic is distributed as an F-distribution, which is skewed right and depends on degrees of freedom. Since you will use technology to find these, the distribution and the test statistic will not be presented. Remember, all hypothesis tests are the same process. Note that to obtain a statistically significant result there need only be a difference between any two of the k means. Before conducting the hypothesis test, it is helpful to look at the means and standard deviations for each data set. If the sample means with consideration of the sample standard deviations are different, it may mean that some of the population means are different. However, do realize that if they are different, it doesn’t provide enough evidence to show the population means are different. Calculating the sample statistics just gives you an idea that conducting the hypothesis test is a good idea. Hypothesis test using ANOVA to compare k means 1. State the random variables and the parameters in words x1 = random variable 1 x2 = random variable 2  xk = random variable k µ1 = mean of random variable 1 µ2 = mean of random variable 2  µk = mean of random variable k 2. State the null and alternative hypotheses and the level of significance Ho :µ1 = µ2 = µ3 == µk HA : at least two of the means are not equal Also, state your α level here. 3. State and check the assumptions for the hypothesis test a. A random sample of size ni is taken from each population. b. All the samples are independent of each other. c. Each population is normally distributed. The ANOVA test is fairly robust to the assumption especially if the sample sizes are fairly close to each other. Unless the populations are really not normally distributed and the sample sizes are close to each other, then this is a loose assumption. Chapter 11: Chi-Squared Tests and ANOVA 422 Notice the sample sizes are not the same. The sample sizes are n1 = 13,n2 = 17,n3 = 17,n4 = 6,n5 = 11 2. State the null and alternative hypotheses and the level of significance Ho :µ1 = µ2 = µ3 = µ4 = µ5 HA : at least two of the means are not equal α = 0.01 3. State and check the assumptions for the hypothesis test a. A random sample of 13 survival times from stomach cancer was taken. A random sample of 17 survival times from bronchus cancer was taken. A random sample of 17 survival times from colon cancer was taken. A random sample of 6 survival times from ovarian cancer was taken. A random sample of 11 survival times from breast cancer was taken. These statements may not be true. This information was not shared as to whether the samples were random or not but it may be safe to assume that. b. Since the individuals have different cancers, then the samples are independent. c. Population of all survival times from stomach cancer is normally distributed. Population of all survival times from bronchus cancer is normally distributed. Population of all survival times from colon cancer is normally distributed. Population of all survival times from ovarian cancer is normally distributed. Population of all survival times from breast cancer is normally distributed. Looking at the histograms, box plots and normal quantile plots for each sample, it appears that none of the populations are normally distributed. The sample sizes are somewhat different for the problem. This assumption may not be true. d. The population variances are all equal. The sample standard deviations are approximately 346.3, 209.9, 427.2, 1098.6, and 1239.0 respectively. This assumption does not appear to be met, since the sample standard deviations are very different. The sample sizes are somewhat different for the problem. This assumption may not be true. 4. Find the test statistic and p-value To find the test statistic and p-value using the TI-83/84, type each data set into L1 through L5. Then go into STAT and over to TESTS and choose ANOVA(. Then type in L1,L2,L3,L4,L5 and press enter. You will get the results of the ANOVA test. Chapter 11: Chi-Square Tests and ANOVA 423 Figure #11.3.1: Setup for ANOVA on TI-83/84 Figure #11.3.2: Results of ANOVA on TI-83/84 The test statistic is F ≈ 6.433 and p − value ≈ 2.29 ×10−4 . Just so you know, the Factor information is between the groups and the Error is within the groups. So Chapter 11: Chi-Squared Tests and ANOVA 424 MSB ≈ 2883940.13, SSB ≈11535760.5, and dfB = 4 and MSW ≈ 448273.635, SSW ≈ 448273.635, and dfW = 59 . To find the test statistic and p-value on R: The commands would be: variable=c(type in all data values with commas in between) – this is the response variable factor=c(rep("factor 1", number of data values for factor 1), rep("factor 2", number of data values for factor 2), etc) – this separates the data into the different factors that the measurements were based on. data_name = data.frame(variable, factor) – this puts the data into one variable. data_name is the name you give this variable aov(variable ~ factor, data = data name) – runs the ANOVA analysis For this example, the commands would be: time=c(124, 42, 25, 45, 412, 51, 1112, 46, 103, 876, 146, 340, 396, 81, 461, 20, 450, 246, 166, 63, 64, 155, 859, 151, 166, 37, 223, 138, 72, 245, 248, 377, 189, 1843, 180, 537, 519, 455, 406, 365, 942, 776, 372, 163, 101, 20, 283, 1234, 89, 201, 356, 2970, 456, 1235, 24, 1581, 1166, 40, 727, 3808, 791, 1804, 3460, 719) factor=c(rep("Stomach", 13), rep("Bronchus", 17), rep("Colon", 17), rep("Ovary", 6), rep("Breast", 11)) survival=data.frame(time, factor) results=aov(time~factor, data=survival) summary(results) Df Sum Sq Mean Sq F value Pr(>F) factor 4 11535761 2883940 6.433 0.000229 *** Residuals 59 26448144 448274 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 The test statistic is F = 6.433 and the p-value = 0.000229. 5. Conclusion Reject Ho since the p-value is less than 0.01. 6. Interpretation There is evidence to show that at least two of the mean survival times from different cancers are not equal. By examination of the means, it appears that the mean survival time for breast cancer is different from the mean survival times for both stomach and bronchus cancers. It may also be different for the mean survival time for colon cancer. The others may not be different enough to actually say for sure. Chapter 11: Chi-Square Tests and ANOVA 427 3.) Several magazines were grouped into three categories based on what level of education of their readers the magazines are geared towards: high, medium, or low level. Then random samples of the magazines were selected to determine the number of three-plus-syllable words were in the advertising copy, and the data is in table #11.3.4 ("Magazine ads readability," 2013). Is there enough evidence to show that the mean number of three-plus-syllable words in advertising copy is different for at least two of the education levels? Test at the 5% level. Table #11.3.4: Number of Three Plus Syllable Words in Advertising Copy High Education Medium Education Low Education 34 13 7 21 22 7 37 25 7 31 3 7 10 5 7 24 2 7 39 9 8 10 3 8 17 0 8 18 4 8 32 29 8 17 26 8 3 5 9 10 5 9 6 24 9 5 15 9 6 3 9 6 8 9 Chapter 11: Chi-Squared Tests and ANOVA 428 4.) A study was undertaken to see how accurate food labeling for calories on food that is considered reduced calorie. The group measured the amount of calories for each item of food and then found the percent difference between measured and labeled food, measured − labeled( ) labeled *100% . The group also looked at food that was nationally advertised, regionally distributed, or locally prepared. The data is in table #11.3.5 ("Calories datafile," 2013). Do the data indicate that at least two of the mean percent differences between the three groups are different? Test at the 10% level. Table #11.3.5: Percent Differences Between Measured and Labeled Food National Advertised Regionally Distributed Locally Prepared 2 41 15 -28 46 60 -6 2 250 8 25 145 6 39 6 -1 16.5 80 10 17 95 13 28 3 15 -3 -4 14 -4 34 -18 42 10 5 3 -7 3 -0.5 -10 6 Chapter 11: Chi-Square Tests and ANOVA 429 5.) The amount of sodium (in mg) in different types of hotdogs is in table #11.3.6 ("Hot dogs story," 2013). Is there sufficient evidence to show that the mean amount of sodium in the types of hotdogs are not all equal? Test at the 5% level. Table #11.3.6: Amount of Sodium (in mg) in Beef, Meat, and Poultry Hotdogs Beef Meat Poultry 495 458 430 477 506 375 425 473 396 322 545 383 482 496 387 587 360 542 370 387 359 322 386 357 479 507 528 375 393 513 330 405 426 300 372 513 386 144 358 401 511 581 645 405 588 440 428 522 317 339 545 319 298 253
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved