Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Analysis of Variance (ANOVA) for Comparing Means of K Groups - Prof. R. Strawderman, Study notes of Data Analysis & Statistical Methods

An overview of analysis of variance (anova), a statistical method used to compare the means of k groups. The concepts of observational studies, controlled experiments, anova, sampling from k populations, and the differences between fixed and random effects anova. It also includes key facts, formulas, and results related to anova.

Typology: Study notes

2009/2010

Uploaded on 12/09/2010

wk2151
wk2151 🇺🇸

3 documents

1 / 58

Toggle sidebar

Related documents


Partial preview of the text

Download Analysis of Variance (ANOVA) for Comparing Means of K Groups - Prof. R. Strawderman and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity! Inference about More Than Two Location Parameters I One Way ANOVA: FCSM 8.1-8.4 ANOVA 1BTRY 6010 & ILRST 6100 Quick Review: Two Independent Samples  Assumptions: normality (or large sample sizes) and independent observations, both within & between samples  We discussed testing & estimation for two cases: 1) equal variances (pooled), 2) unequal variances (Satterthwaite df )  For sample sizes not large and highly nonnormal data: consider nonparametric test (Wilcoxon rank sum test) Ne t inference for se eral pop lation means K ≥ 2 gro ps x : v u : u  Basic ANOVA (ANalysis Of VAriance):  Assumes independence normality equal variances , , .  Generalizes 2-sample t-test with pooled variance to settings involving 3 or more populations / groups.  Looks different from this t-test, but turns out equivalent. ANOVA 2BTRY 6010 & ILRST 6100 Data Structure: Observational Study (K = 3) 1-3 drinks ≥ 4 drinks Adults sampled, weekly alcohol consumption assessed per week per week Simple Simple Simple Nondrinker Random Sample Random Sample Random Sample HDL Cholesterol 12 11 y y  22 21 y y  32 31 y y  Level 1n1 y 2n2 y 3n3 y ANOVA 5BTRY 6010 & ILRST 6100 Y1j ~ N(1 , 2) Y2j ~ N(2 , 2) Y3j ~ N(3 , 2) Controlled Experiment  Subjects (experimental units) randomly and independently drawn from some population (or possibly a set of independent populations).  K ≥ 2 populations (groups) are created by randomly applying one of K treatments to each experimental unit. Each treatment may change the distribution of the response(s) (ideally: only the mean value) .  Objective: To compare the K treatment means to determine whether significant differences exist .  Note: If all samples are drawn from the same population before applying treatment, assumption of equal variances may be more plausible. BTRY 6010 & ILRST 6100 6ANOVA Data Structure: Controlled Experiment (K = 3) C Di t #2 C Di t #3C Di t #1 Dairy cows randomly assigned to three diets. ow e ow e Simple Simple Simple ow e Random Sample Random Sample Random Sample Milk production 12 11 y y  22 21 y y  32 31 y y  1n1 y 2n2 y 3n3 y ANOVA 7BTRY 6010 & ILRST 6100 Y1j ~ N(1 , 2) Y2j ~ N(2 , 2) Y3j ~ N(3 , 2) Note: if K = 2 (i.e., two groups), we have:  E( Y1j ) = 1 , E( Y2j ) = 2 , and Var( Yij ) = 2 for i = 1 2 and j = 1 n , ,…, i.  That is: responses are assumed to come from two independent normal populations with possibly different means but common variance.  This sampling model (i.e., assuming a common variance) was the starting point for developing the (pooled) 2-sample t-test and CI procedures. In fact, we’ll see that ANOVA with K = 2  pooled two-sample t-test. BTRY 6010 & ILRST 6100 10ANOVA Basic ANOVA: Model II  Same assumptions, and equivalent linear model:       for 1, , , 1, , iii jij j nY i K  This just re-expresses i as  + i , where i Response = Overall Mean + Effect of Population i + Random Error gives the value of i relative to overall mean   Main focus: testing whether 1 = 2 =  = K = 0 ; also estimating  & especially i for i = 1,…,K. ANOVA 11BTRY 6010 & ILRST 6100 th thresponse of j unit from i groupijY  th th th overall mean response value effect on mean response due to i group ( ) random error for j unit from i group ii         2 Assumptions: ~ independent (0, ij ij N   ) ( )common variance! 1 0 needed so we can interpret as overall ( ) meanK ii   ij retains its interpretation as random error from the group-specific mean. Because random error has mean 0, the ith group mean equals i =  + i. Formally: the model implies E( Yij ) =  + i and Var( Yij ) = 2 for i = 1,…,K, j = 1,…,ni . If 1 = 2 =  = K = 0, then model becomes Yij =  + ij , implying ANOVA 12BTRY 6010 & ILRST 6100 as before that E( Yij ) =  and Var( Yij ) = 2. Random effects ANOVA  Assume the K groups are not of intrinsic interest, but rather have been chosen to represent a larger population of potential groups. Interest lies in whether group effects (differences among i s) in population contribute to total variability.  In this case: fixed-effects ANOVA might not be best approach; it does not address this question.  If the K groups can be regarded as a random sample, a one-way random effects model (i s are random effects) may be better. Testing is formulated differently (variance components). BTRY 6010 & ILRST 6100 15ANOVA Some examples:  A large company’s personnel manager wants to assess whether employee interviews are influenced by choice of interviewer A SRS of 5 interviewers out of 125 is chosen;. each interviews 4 applicants. [Random effect: interviewer]  To see whether sodium content in beer differs by brand, a SRS of 6 brands out of 200 is selected, & sodium content is measured in 8 bottles of each. [Random effect: brand]  To examine the role of keyboards in spreading infection a , university takes a SRS of 25 keyboards out of 300 in a large public computing lab, & performs aerobic colony count on 5 keys per board. [Random effect: keyboard] A random effects ANOVA compares means of groups (interviewers brands keyboards) assuming that these BTRY 6010 & ILRST 6100 16ANOVA , , , groups are SRS from a larger population of interest. Hypotheses of Interest (Fixed Effects) :H       Model I: 0 1 2 ... : not all means are equal K aH  Model II:      0 1 2: ... 0KH  These are equivalent; both test whether or not : not all of the s are equal to 0a iH all K groups have the same mean. BTRY 6010 & ILRST 6100 17ANOVA  If each of the 10 tests is performed at  = 0 005 . (i.e.,   # tests), then Bonferroni’s inequality also implies that the overall P(Type I error) for H0 can be no larger than 10 (0.005) = 0.05.  However, the actual P(Type I error) level can also be much smaller than 0.05 – in fact, this Bonferroni correction for multiple testing often yields a very conservative test procedure.  But: each individual 2-sample t-test uses only part of the information available to estimate the common variance2. This is very inefficient – it t t th t d h b tt !urns ou a we can o muc e er BTRY 6010 & ILRST 6100 20ANOVA Principal Idea behind ANOVA ANOVA compares the variability between the K estimated group-specific means to a pooled estimate of “within group” variability, obtained using all observations in all groups. Within group variance x y -4 -3 -2 -1 0 1 2 3 4 x y -4 -3 -2 -1 0 1 2 3 4 x y -4 -3 -2 -1 0 1 2 3 4 1 2 3 Between groups variance; or, the variation in means ANOVA 21BTRY 6010 & ILRST 6100 Within group variance is small compared to between groups variance. Why? Populations are located far apart, with clear separation of means. y y y x -4 -3 -2 -1 0 1 2 3 4 x -4 -3 -2 -1 0 1 2 3 4 x -4 -3 -2 -1 0 1 2 3 4 1 2 3 y y y Within group variance is large compared to between groups variance. Why? x -4 -3 -2 -1 0 1 2 3 4 x -4 -3 -2 -1 0 1 2 3 4 x -4 -3 -2 -1 0 1 2 3 4 Populations are located closer together, with less clear separation of means. ANOVA 22BTRY 6010 & ILRST 6100 Pooled Error (Within Group) Variance We just saw how to estimate group means. To estimate common variance first recall pooled two sample t test , - - – there, we used a pooled estimate of sample variance: 2 2( 1) ( 1)21 2 1 1 2 2 1 2 for 21 1 p p y y n s n st s n n s n n          1 2 Extend pooled variance concept to K groups as follows:                2 2 2 2 1 1 2 2( 1) ( 1) ( 1) ( 1) ( 1) ( 1) def K Ks n n s n s n sS n n S K n E 1 2to Kt ANOVA 25BTRY 6010 & ILRST 6100 Variation of Group Means (Between Groups) ANOVA simply compares “between groups” variability (i.e., between the estimated group-specific means) to a (pooled) estimate of the “within group” variability. We have the estimated group means estimated overall mean and , , a pooled estimate of error (or “within group”) variability. O f “ ”ne possible measure o between groups variation is given by the following formula: 2 1 1 1 ) 1 ( K i def i i SSB n y K K y    ANOVA 26BTRY 6010 & ILRST 6100 Key facts: under the ANOVA model , 2SSEE     2 22 for K K tot tot iSS w w n K n nE wB               1 1 1 1 i i i i ti ti i oK K n                  Thus, under 0 1 2: ... 0KH       2SES  That is: under H 2 1 tot E n K K SSBE              0, the two measures of variability have the same average value BTRY 6010 & ILRST 6100 27ANOVA  . Results from Slide 27 imply the following: in general,     2 2 totn KE SSE          2 1 1 21 tot K K i i i i i i SSB K w wE n                                       2 2 21( ) K K itot i ito it wE nS wTS n KK Thus, TSS decomposition implies:     1 11  tot ii n Under H0, we also have:     2( ) 1totn KSSE T K       1totn    Taken together, the “decomposition” of TSS and its df BTRY 6010 & ILRST 6100 30ANOVA form the basis for the ANOVA Table… The ANOVA Table Source of variation Degrees of freedom Sum of squares Mean square F statistic Between K – 1 SSB MSB = SSB / (K –1) MSB/MSE Error (Within) n K SSE MSE = SSE / (n K) tot – tot – Total ntot – 1 TSS Partition of the degrees of Partition of the f Coming up next! freedom sums o squares ANOVA 31BTRY 6010 & ILRST 6100 The F statistic ANOVA table decomposes total variation (TSS) into within group (SSE) and between groups (SSB) components. If the group means are actually different, then previous results imply that MSB should exceed  1KMSB SSB  MSE (on average). Consider the F statistic:  tot SSEF MSE n K    Under H0, the computed value of F should tend to be close to 1. Under Ha, it should exceed 1, by an amount depending on both ntot & degree of variability in the i s. ANOVA 32BTRY 6010 & ILRST 6100 Example: Sand in Concrete A manufacturer of concrete bridge supports is interested in determining the effect of varying the sand content on the strength of the supports. Five supports are made for each of five different amounts of sand in the concrete mix and each is tested for compression resistance . Percent Sand 15% 20% 25% 30% 35% 7 17 14 20 7 7 12 18 24 10 10 11 18 22 11 15 18 19 19 15 9 19 19 23 11 ANOVA 35BTRY 6010 & ILRST 6100 Estimating  1 5 and 1 5 : Percent Sand , ,…, , ,…, 15% 20% 25% 30% 35% 7 17 14 20 7 7 12 18 24 10 10 11 18 22 11 15 18 19 19 15 9 19 19 23 11 9.6 15.4 17.6 21.6 10.8 15 –5.4 0.4 2.6 6.6 –4.2 0 ˆˆ ˆ y y yy    , ,i ii i     ANOVA 36BTRY 6010 & ILRST 6100 Compression Resistance 30 20 25 ,0 00 p si ) 10 15 ta nc e (1 0 ̂ 1̂ 5R es is 0 10 15 20 25 30 35 40 Percent Sand ANOVA 37BTRY 6010 & ILRST 6100 F = 14.8655 > 7.10  p < 0.001 BTRY 6010 & ILRST 6100 40ANOVA Check normality using residuals  Click on red triangle > Save > Save Residuals  Make a normal quantile plot of residuals using Fit Y by X > Distribution îj ij iy y   ANOVA 41BTRY 6010 & ILRST 6100 Name of column containing residuals Selected output obtained from using the command Fit Di t ib ti > s r u on Continuous Fit > Normal No glaring deviations suggested by graphical and test diagnostics. Note: we’ll also assume that the independence assumption holds here; not unreasonable given BTRY 6010 & ILRST 6100 42ANOVA nature of this experiment. Example: CHD & Cholesterol Data Courtesy of: Mandana Arabi and Farbod Rais Zadeh  Cardiovascular heart disease (CHD) data  Study area: cardiology hospital in Tehran, Iran 1999 Response variable: total serum cholesterol  Four groups of patients:  diabetic patients with cardiovascular disease (group 1)  non-diabetic patients with cardiovascular disease (group 2)  diabetic patients without cardiovascular disease (group 3)  control subjects (no diabetes or CHD group 4) –  Number of subjects per group = 50  Main research question: Are there differences in mean total cholesterol between the 4 groups of subjects? BTRY 6010 & ILRST 6100 45ANOVA ‘Oneway Analysis of Cholesterol By Patient Group ] ( Tests that the Variances are Equal 50 3004 “0 . = 2504 a7 > z 2 . 205 3 2004 I 1 ° 104 2 I ! 3 0 r T + 150-4 1 2 3 4 Patient Group 1004 MeanAbsDif MeanAbsDif 1 > 3 4 Level Count Std Dev to Mean to Median P. 41 50 36.74728 28.47820 28.3000 tient Group 2 50 38.80793 31.30720 31.16000 | Oneway Anova ] 3 50 4125103 © -33.89280 © 3.64000 [Analysis of Variance ) 4 50 44.87434 36.28960 36.26000 sou be Sum of Mm E Ratio Prob>F Test FRatio DFNum DFDen Prob>F ree Squares Mean Square io Pr (Patient Group 3 85081.48 216872 132002 <.0001"_) OBrien[. 5] 0.8628 3 186 0.4608 Error Tes amso1e Be Teane Brown-Forsythe 0.9838 3198 0.4014 C. Total 199 38707832 Levene 1.0149 3.64 Bartett rat + S385 | Distributions { Cholesterol centered by Patient Group | Fitted Normal Type Parameter ——1=}—— _]Poremeteresimetes Estimate Lower 95% Upper 95% —— Normall-1€-15,40.2265) Ho ANOVA BTRY 6010 & ILRST 6100 Location p -144e-15 = -5.609125 5.6091247 Dispersion s 40226547 36632803 4460822 (Goodness-of-Fit Test ‘Sha piro-Wilk W Test w Prob<W ‘0 ° fo 0.990632 0.2206 Note: Ho = The data is from the Normal distribution. Small p-values: reject Distributions Cholesterol centered by Patient Group Normal Quantile Plat 46 K = 2: ANOVA F = ( Pooled t statistic )2 SSB      1 2 1 2tot KMSBF MSE n K SSB SSE S n n SE        For K = 2: 2 2 21 1 2 2( 1) ( 1)n s n s sSSE     (Slide 25) 1 2 1 22 2 pn n n n    2 2( ) ( )SSB     1 1 2 2 2 2 2 22 1 1 1 2 2 1 2 n y y n y y n nn y y n y y                       (Slide 26) BTRY 6010 & ILRST 6100 47ANOVA 1 2 1 2n n n n  Analyze > Fit Y by X, select Means/Anova/Pooled t Pooled t-test – Slide 20 of notes for testing with two independent samples.  22 58847 6 7002. . Equal p-values (two-sided) ( tdf )2 = F1 df BTRY 6010 & ILRST 6100 50ANOVA , JMP: One-Way Fixed Effects ANOVA (alternative method) Analyze > Fit Model 1. Specify Y, Model Effects 2 R M d l. un o e ANOVA 51BTRY 6010 & ILRST 6100 O ince aga n: we will ignore this part (for now)! Same results as we obtained using Fit Y by X; various diagnostics that we’ve looked at (as well as others) are available. BTRY 6010 & ILRST 6100 52ANOVA JMP: One-Way Random Effects ANOVA (purely for illustration of ‘Fit Model’) 1 Specify Y. , Model Effects 2. Highlight effect 3. Select Attributes (Random Effect) ANOVA 55BTRY 6010 & ILRST 6100 4. Select Method EMS: expected mean square (tries to do ANOVA. Works fine with balanced data; not recommended otherwise) REML: alternative h th tapproac a uses “maximum likelihood”; current state of the art for all linear random effects models. 5. Run Model ANOVA 56BTRY 6010 & ILRST 6100 Id ti l ANOVA t bl F t t lt (Slid 51) t den ca a e, - es resu s e , as expec e . BTRY 6010 & ILRST 6100 57ANOVA
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved