Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Hypothesis Testing in Statistics: Two Sample Problem and Power Analysis, Lecture notes of Designs and Groups

Hypothesis testing in statistics, focusing on the two-sample problem and power analysis. It covers the t-test, pooled sample standard deviation, and power computation. The document also explains the concept of least-favorable configuration of means and its impact on power.

Typology: Lecture notes

2021/2022

Uploaded on 07/05/2022

allan.dev
allan.dev 🇦🇺

4.5

(85)

1K documents

1 / 100

Toggle sidebar

Related documents


Partial preview of the text

Download Hypothesis Testing in Statistics: Two Sample Problem and Power Analysis and more Lecture notes Designs and Groups in PDF only on Docsity! STAT 8200 — Design and Analysis of Experiments for Research Workers — Lecture Notes Basics of Experimental Design Terminology Response (Outcome, Dependent) Variable: (y) The variable who’s distribution is of interest. • Could be quantitative (size, weight, etc.) or qualitative (pass/fail, quality rated on 5 point scale). – I’ll assume the former (easier to analyze). – Typically interested in mean of y and how it depends on other variables. – E.g., differences in mean response between varieties, drugs. Explanatory (Predictor, Independent) Variables: (x’s) Variables that explain (predict) variablility in the response variable. • E.g., variety, rainfall, predation, soil type, subject age. Factor: A set of related treatments or classifications used as an explana- tory variable. • Often qualitative (e.g., variety), but can be quantitative (0, 100, or 200 units fertilizer). Treatment or Treatment Combination: A particular combination of the levels of all of the treatment factors. 1 Nuisance Variables: Other variables that influence the response variable but are not of interest. • E.g., rainfall, level of predation, subject age. • Systematic bias occurs when treatments differ with respect to a nuisance variable. If so, it becomes a confounding variable or confounder. Experimental Units: The units or objects that are independently assigned to a specific experimental condition. • E.g., a plot assigned to receive a particular variety, a subject assigned a particular drug. Measurement Units: The units or objects on which distinct measure- ments of the response are made. • Not necessarily same as exp’tal units. Distinction is very important! • E.g., a plant or fruit within a plot. Experimental Error: variation among experimental units that have re- ceived the same experimental conditions. • The standard against which differences between treatments are to be judged. • Treatment differences must be large relative to the variability we would expect in the absence of a treatment effect (experimental er- ror) to infer the difference is real (statistical significance). • If two varieties have mean yields that differ by d units, no way to judge how large d is unless we can estimate the experimental error (requires replication). 2 (2) Replication: Repeating the experimental run (one entire set of experimental conditions) using additional similar, independent, ex- perimental units. • Allows estimation of the experimental error without which treatment differences CANNOT be inferred. • Increases the precision/power of the experiment. BEWARE OF PSEUDO-REPLICATION! Example 3: Suppose we randomize plots in a crop row to two treatments as so: A B B B A B A A A B And we measure the size of all 10 plants in each plot. ⇒ we have 50 measurements per treatment. ⇒ we have 5 plots per treatment. Q: What’s the sample size per treatment that determines our power to statistically distinguish varieties A and B? A: 5/treatment not 50. The experimental unit here is the plot, not the plant. 5 Replication: • Taking multiple measurements per experimental unit is called sub- sampling or pseudo-replication. – It is a useful means to reduce measurement error in character- izing the response at the experimental unit level. – If not interested in estimating this measurement error, easiest analysis is to average the subsamples in each experimental unit and analyze these averages as “the data”. – How to allocate resources between experimental units and mea- surements units complicated, but generally more bang for adding experimental units over measurements units. Little gains to go beyond 2 or 3 m.u.s/e.u. 6 Replication: What determines number of replicates? • Available resources. Limitations on cost, labor, time, experi- mental material available, etc. • Sources of variability in system and their magnitudes. • Size of the difference to be detected. • Required significance level (α = 0.05?) • Number of treatments Effect of number of replicates/treatment on smallest difference in treatment means that can be detected at α = .05 in a simple one- way design: Diff b/w means necessary for significance at level .05 n per treatment D iff er en ce in m ea ns 2 4 6 8 10 12 14 2 3 4 5 6 7 7 Blocking: Example: In our greenhouse example, our completely ran- domized assignment of varieties A, B happened to assign va- riety B to all three benches on the west end of the building. Suppose the heater is located on the west end so that there is a temperature gradient from west to east. This identifiable source of heterogeneity among the experimen- tal units is a natural choice of blocking variable. Solution: re- arrange the benches as so and assign both treatments randomly within each column: A B B A B B B A A B A A • Design above is called a Randomized Complete Block De- sign (RCBD). • Blocking is often done within a given site. E.g., a large field is blocked to control for potential heterogeneity within the site. This is useful, but only slightly, as it results in only small gains relative to a CRD. • However, if a known source of variability exists where there is likely to be a gradient in the characteristics of the plots, then blocking within a site is definitely worthwhile. 10 Blocking: • In the presence of a gradient, plots should be oriented as follows*: and, if blocking is done: • Placement of blocks should take into account physical features of the site: * Plots on this page from Petersen (1994). 11 Blocking: There are a number of factors that often create heterogeneity in the ex- perimental units that can, and typically should, form the basis of blocks: • Region. • Time (season, year, day). • Facility (e.g., if multiple greenhouses are to be used, or multiple labs to take measurements, if patients recruited from multiple clinics). • Personnel to conduct the experiment. Block effects are typically assumed not to interaction with treatment ef- fects. • E.g., while we allow that a region (block) effect might raise the yield for all varieties, we assume that differences in mean yield between varieties are the same in each region. • If each treatment occurs just once in each block, this assumption MUST be made in order to analyze the data and the analysis will likely be wrong if this assumption is violated. • If treatment effects are expected to differ across blocks, – use ≥ 2 reps of every treatment within each block, and – consider whether the blocking factor is more appropriately con- sidered to be a treatment factor. 12 (5) Balance: A balanced experiment is one in which the replication is the same under each level of each experimental factor. E.g., an experiment with three treatment groups each with 10 experimental units is balanced; an experiment with three treatment groups of sizes 2, 18, and 10 is unbalanced. Occasionally, the response in a certain treatment or compar- isons with a certain treatment are of particular interest. If so, then extra replication in that treatment may be justified. Oth- erwise, it is desirable to achieve as much balance as possible subject to practical constraints. • Increases power of experiment. • Simplifies statistical analysis. 15 (6) Limiting Scope/Use of Sequential Experimentation: Large experiments with many factors and many levels of the factors are – hard to perform, – hard to analyze, and – hard to interpret. • If the effects of several factors are of interest, best to do several small experiments and build up to an understanding of the entire system. • Can either use several factors each at a small number (e.g., 2) of levels, or can do sequential experiments each examining a subset of factors (2 or 3). 16 (7) Adjustment for Covariates: Nuisance variables other than the blocking factors that affect the response are often measured and compensated for in the analysis. • E.g., measure and adjust for rainfall differences in statistical analysis (can’t block by rainfall). • Avoids systematic bias. • Increases the precision of the experiment. 17 Steps in Experimentation: • Statement of the objectives • Identify sources of variation – Selection of treatment factors and their levels – Selection of experimental units, measurement units – Selection of blocking factors, covariates. • Selection of experimental design • Consideration of the data to be collected (what variables, how mea- sured) • (Perhaps) run a pilot experiment • Outline the statistical analysis • Choice of number of replications • Conduct the experiment • Analyze data and interpret results • Write up a thorough, readable, stand-alone report summarizing the research 20 Structures of an Experimental Design: Treatment Structure: The set of treatments, treatment combina- tions, or populations under study. (The selection and arrange- ment of treatment factors.) Design Structure: The way in which experimental units are grouped together into homogeneous units (blocks). These structures are combined with a method of randomization to create an experimental design. Types of Treatment Structures: (1) One-way (2) n-way Factorial. Two or more factors combined so that every possi- ble treatment combination occurs (factors are crossed). (3) n-way Fractional Factorial. A specified fraction of the total number of possible combinations of n treatments occur (e.g., Latin Square). 21 Types of Design Structures: (1) Completely Randomized Designs. All experimental units are consid- ered as a single homogeneous group (no blocks). Treatments assigned completely at random (with equal probability) to all units. (2) Randomized Complete Block Designs. Experimental units are grouped into homogeneous blocks within which each treatment occurs c times (c = 1, usually). (3) Incomplete Block Designs. Fewer than the total number of treat- ments occur in each block. (4) Latin Square Designs. Blocks are formed in two directions with n experimental units in each combination of the row and column levels of the blocks. (5) Nested (Hierarchical) Design Structures. The levels of one blocking factor are superficially similar but not the same for different levels of another blocking factor. For example, suppose we measure tree height growth in 5 plots on each of 3 stands of trees. The plots are nested within stands (rather than crossed with stands). This means that plot 1 in stand 1 is not the same as plot 1 in stand 2. 22 Example 4: Latin Square An investigator is interested in the effects of tire brand (brands A,B,C,D) on tread wear. Suppose that he wants 4 observations per brand. Latin square design: Wheel Position Car 1 2 3 4 1 A B C D 2 B C D A 3 C D A B 4 D A B C In this design 4 replicates are obtained for each treatment without using all possible wheel position × tire brand combinations. (Eco- nomical design) Treatment Structure: Design Structure: 25 Considerations when Designing an Experiment • Experimental design should give unambiguous answers to questions of interest. • Experimental design should be “optimal”. That is, it should have more power (sensitivity) and estimate quantities of interest more precisely than other designs. • Experiment should be of a manageable size. Not too big to perform, analyze, or interpret. • Objectives of the experiment should be clearly defined. – What questions are we trying to answer? – Which questions are most important? – What populations are we interested in generalizing to? • Appropriate response and explanatory variables must be determined and nuisance variables should be identified. – What levels of the treatment factors will be examined? – Should there be a control group? – Which nuisance variables will be measured? • The practical constraints of the experiment should be identified. – How much time, money, personnel, raw material will it take, and how much do I have? – Is it practical to assign and apply the experimental conditions to experimental units as called for by the experimental design. 26 • Identify blocking factors. – In what respects that can be expected to be relevant to the response variable are experimental units dissimilar? – Will the experiment be conducted over several days (weeks, seasons, years)? Then block by day (week/season/year). – Will the experiment involve several experimenters? (e.g., sev- eral different crews to measure trees, lab technicians, etc.). Then block by experimenter. – Will the experiment involve units from different locations, or units that dispersed geographically/spatially? Then form blocks by location, or as groups of units that are near one another. • Statistical analysis of the experiment should be planned in detail to meet the objectives of the experiment. – What model will be used? – How will nuisance variables be accounted for? – What hypotheses will be tested? – What “effects” will be estimated? • Experimental design should be economical, practical, timely. 27 For a continuous random variable, the probability that it takes on any particular value is 0, so plotting Pr(Y = y) against y doesn’t describe a continuous probability distribution. Instead we use the probability density function fY (y) which gives (roughly) the probability that Y takes a value “close to” y. E.g., assuming Y =adult female height in the U.S. is normally distributed with mean 63.5 in. and s.d. 2.5 in., here’s the distribution of female heights 50 55 60 65 70 75 0 0.05 0.1 0.15 0.2 y (height in inches) f Y (y ) • Essentially, most statistical problems boil down to questions about what that distribution looks like. The two most important aspects of what the distribution looks like are 1. where it is located (mean, median, mode, etc.) 2. how spread out it is (variance, std. dev., range, etc.) 30 The mean, or expected value (often written as µ), of a probability dis- tribution is the mean of a random variable with that distribution, taken over the entire population. The variance (often written as σ2) of a probability distribution is the variance of a r.v. with that distribution, taken over the entire population. • The population standard deviation σ is simply the (positive) square root of the population variance: σ = √ σ2. A few useful facts about expected values and population variances: Let c represent some constant, and let X,Y denote two random variables with population means µX , µY and population variances σ2 X , σ 2 Y , respectively. Then: 1. E(c) = c. 2. E(cY ) = cE(Y ) = cµY . 3. E(X + Y ) = E(X) + E(Y ) = µX + µY . 4. E(XY ) = E(X)E(Y ) if (and only if) X and Y are statistically inde- pendent (i.e., knowing something about X doesn’t tell you anything about Y ). 5. var(c) = 0. 6. var(cX) = c2var(X) = c2σ2 X . Notice that this implies Pop.S.D.(cX) = cPop.S.D.(X) = cσX . 7. var(X ± Y ) = var(X) + var(Y ) ± 2cov(X,Y ) where cov(X,Y ) = E[(X − µX)(Y − µY )] is the covariance between X and Y , and measures the strength of the linear association between them. Note that • cov(X,Y ) = cov(Y,X) • cov(cX, Y ) = ccov(X,Y ), and • if X,Y are independent, then cov(X,Y ) = 0 8. This last property implies that var(X ± Y ) = var(X) + var(Y ) if X,Y are independent of each other. 31 To find out about a probability distribution for Y (how Y behaves over the entire population), we typically examine behavior in a randomly selected subset of the population — a sample: (y1, . . . , yn). We can compute sample quantities corresponding to the population quan- tities of interest: Sample mean: ȳ = 1 n ∑n i=1 yi estimates the pop. mean µY . Sample variance: s2Y = 1 n−1 ∑n i=1(yi − ȳ)2 estimates σ2 Y . Digression on Notation: Recall the summation notation: ∑k i=1 xi is shorthand for x1+x2+· · ·+xk. Often we’ll use a double sum when dealing with a variable that has two indexes. E.g., Subject Trt 1 Trt 2 Trt 3 1 y11 y21 y31 2 y12 y22 y32 ... ... ... ... n y1n y2n y3n Here, yij denotes the response of subject j in group i in an experiment in which each of n subject receives all three treatments. • This is an example of what’s known as a crossover design. 32 The Central Limit Theorem is the main reason that the normal distribu- tion plays such a major role in statistics: In plain English, the CLT says (roughly speaking) that any random vari- able that can be computed as a sum or mean of n independent, identically distributed (or iid, for short) random variables (e.g., measurements taken on a simple random sample) will have a distribution that is well approxi- mated by the normal distribution as long as n is large. How large must n be? Depends on the problem (on the distribution of the Y ’s). Importance? • Many sample quantities of interest (including means, totals, propor- tions, etc.) are sums of iid random variables, so the CLT tells us that these quantities are approximately normally distributed. • Furthermore, the iid part is often not strictly necessary, so there are a great many settings in which random variables that are sums or means of other random variables are approximately normally dis- tributed. • The CLT also suggests that many elementary random variables them- selves (i.e., measurements on experimental units or sample members rather than sums or means of such measurements) will be approxi- mately normal because it is often plausible that the uncertainty in a random variable arises as the sum of several (often many) indepen- dent random quantities. – E.g., reaction time depends upon amount of sleep last night, whether you had coffee this morning, did you happen to blink, etc. 35 The Chi-square Distribution If Y1, . . . , Yn iid∼ N(0, 1), then X = Y 2 1 + · · ·+ Y 2 n has a (central) chi-square distribution with d.f. = n, the number of squared normals that we summed up. To denote this we write X ∼ χ2(n). Important Result: If Y1, . . . , Yn ∼ N(µ, σ2) then SS σ2 = ∑n i=1(Yi − Ȳ )2 σ2 ∼ χ2(n− 1) • Here, SS stands for “sum of squares” or, more precisely, “sum of squared deviations from the mean.” (Student’s) t Distribution If Z ∼ N(0, 1), X ∼ χ2(n), and Z and X are independent, then the r.v. t = Z√ X/n has a (central) t distribution with d.f. = n and we write t ∼ t(n). 36 F Distribution If X1 ∼ χ2(n1), X2 ∼ χ2(n2), and X1 and X2 are independent, then the r.v. F = X1/n1 X2/n2 has a (central) F distribution with n1 numerator d.f., and n2 denominator d.f. We write F ∼ F (n1, n2). Note that the square of a random variable with a t(n) distribution has an F (1, n) distribution: If t = Z√ X/n ∼ t(n), then t2 = Z2/1 X/n ∼ F (1, n) 37 Cases: Case 1: σ2 1 , σ 2 2 both known (may or may not be equal); For example, we happen to know that the variance of words/minute for Empl 1 is σ2 1 = 121, and the variance of words/minute for Empl 2 is σ2 2 = 81. Then var(ȳ1 − ȳ2) = 121 10 + 81 10 (there’s nothing to estimate) ⇒ s.e.(ȳ1 − ȳ2) = √ 121 + 81 10 from which we obtain our test statistic: z = ȳ1 − ȳ2 s.e.(ȳ1 − ȳ2) = ȳ1 − ȳ2√ σ2 1 n1 + σ2 2 n2 = 65− 60√ 121+81 10 = 1.11 The p−value for this test tells us how much evidence our result provides against the null hypothesis: p = the probability of getting a result at least as extreme as the one obtained. p = Pr(z ≤ −1.11) + Pr(z ≥ 1.11) = 2Pr(z ≥ 1.11) = .267 • Comparison of the p−value against a pre-selected significance level, α (.05, say), leads to a conclusion. In this case, since .267 > .05 we fail to reject H0 based on a significance level of .05. (There’s insufficient evidence to conclude that the mean typing speeds of the two workers differ.) 40 Case 2: σ2 1 , σ 2 2 unknown, but assumed equal (σ2 1 = σ2 2 = σ2, say). var(ȳ1 − ȳ2) = σ2 n1 + σ2 n2 = σ2 ( 1 n1 + 1 n2 ) Two possible estimators come to mind: s21 = sample variance from 1st sample s22 = sample variance from 2nd sample Under the assumption that σ2 1 = σ2 2 = σ2, both are estimators of the same quantity, σ2, each based on only a portion of the total number of relevant observations available. Better idea: combine these two estimators by taking their (weighted) av- erage: σ̂2 = s2P = (n1 − 1)s21 + (n2 − 1)s22 n1 + n2 − 2 ⇒ s.e.(ȳ1 − ȳ2) = √ σ̂2 ( 1 n1 + 1 n2 ) = √ s2P ( 1 n1 + 1 n2 ) ⇒ test stat. = t = ȳ1 − ȳ2√ s2P ( 1 n1 + 1 n2 ) ∼ t(n1 + n2 − 2) Suppose we calculate s21 = 110.2, s22 = 89.4 from the data. Then s2P = (10− 1)110.2 + (10− 1)89.4 10 + 10− 2 = 99.8, t = 65− 60√ 99.8 ( 1 10 + 1 10 ) = 1.12 ⇒ p = Pr(t18 ≤ −1.12) + Pr(t18 ≥ 1.12) = 2Pr(t18 ≥ 1.12) = .277 41 Case 3: σ2 1 , σ 2 2 both unknown but assumed different. ⇒ var(ȳ1 − ȳ2) = σ2 1 n1 + σ2 2 n2 ⇒ s.e.(ȳ1 − ȳ2) = √ s21 n1 + s22 n2 ⇒ test stat. = t = ȳ1 − ȳ2√ s21 n1 + s22 n2 .∼ t(ν) where ν = ( s21 n1 + s22 n2 )2 (s21/n1)2 n1−1 + (s22/n2)2 n2−1 (round down to nearest integer) • Here, .∼ means “is approximately distributed as”. For our results, t = 65− 60√ 110.2 10 + 89.4 10 = 1.12, ν = ( 110.2 10 + 89.4 10 )2 (110.2/10)2 10−1 + (89.4/10)2 10−1 = 17.8 ⇒ p ≈ Pr(t17 ≤ −1.12) + Pr(t17 ≥ 1.12) = 2Pr(t17 ≥ 1.12) = .278 42 Case 2: σ2 d unknown, so must be estimated. var(d̄) = σ2 d n • We estimate σ2 d with s2d, the sample variance of the di’s. ⇒ s.e.(d̄) = √ s2d n ⇒ test stat. = t = d̄√ s2d/n ∼ t(n− 1) Suppose we get the following data: Letter (i) Empl 1 Empl 2 di 1 72 61 11 2 78 69 9 ... ... ... ... 10 74 61 13 d̄ = 5 s2d = 76.2 Then t = 5√ 76.2/10 = 1.81 ⇒ p = 2Pr(t9 > 1.81) = 0.104 45 The One-way Layout An Example: Gasoline additives and octane. Suppose that the effect of a gasoline additive on octane is of interest. An investigator obtains 20 one-liter samples of gasoline and randomly divides these samples into 5 groups of 4 samples each. The groups are assigned to receive 0, 1, 2, 3, or 4 cc/liter of additive and octane measurements are made. The resulting data are as follows: Treatment Observations A (0cc/l) 91.7 91.2 90.9 90.6 B (1cc/l) 91.7 91.9 90.9 90.9 C (2cc/l) 92.4 91.2 91.6 91.0 D (3cc/l) 91.8 92.2 92.0 91.4 E (4cc/l) 93.1 92.9 92.4 92.4 • This is an example of a one-way (single factor) layout (design). – Comparisons among the mean octane levels in the five groups are of interest. – Analysis is a generalization of the two sample t-test of equality of means. In general, we have a single treatment factor with a ≥ 2 (5 in example) levels (treatments), and ni (n1 = · · · = n5 = 4 in example) replicates for each treatment. Data: yij , i = 1, . . . , a, j = 1, . . . , ni. Model: yij = µi + eij (means model) 46 Three assumptions are commonly made about the eijs: (1) eij , i = 1, . . . , a, j = 1, . . . , ni, are independent; (2) eij , i = 1, . . . , a, j = 1, . . . , ni, are identically distributed with mean 0 and variance σ2 (all have same variance); (3) eij , i = 1, . . . , a, j = 1, . . . , ni, are normally distributed. Alternative equivalent form of the model: yij = µ+ αi + eij where ∑a i=1 αi = 0. (effects model) • Here, the restriction ∑a i=1 αi = 0 is part of the model. This restric- tion gives the parameters the following interpretations: µ : the grand mean, averaging across all treatment groups αi : the treatment effect (deviation up or down from the grand mean) of the ith treatment • The relationship between the parameters of the cell means model (the µi’s) and the parameters of the effects model (µ and the αi’s) is simply: µi = µ+ αi. Technical points: • the restriction ∑ i αi = 0 is not strictly necessary in the effects model. Even without it, the effects model is equivalent to the means model. However, without the restriction, the effects model is overparam- eterized and that causes some technical complications in the use of the model. In addition, without the sum-to-zero restriction, the parameters of the effects model don’t have the nice interpretations that I’ve given above. • It is also possible to use the restriction ∑ i niαi = 0 (as in our book), or any one of a large number of other restrictions instead of ∑ i αi = 0. Under the restriction ∑ i niαi = 0, µ has a slightly different interpretation: it represents a weighted average (weighted by sample size) of the treatment means rather than a simple average. 47 Minimizing the least squares criterion L is a calculus problem. The way to do it is to take derivatives of L, set them equal to zero and solve the resulting equations, which are called the normal equations The µ̂is that solve the normal equations are as follows: µ̂1 = ȳ1·, µ̂2 = ȳ2·, ..., µ̂a = ȳa· or, in general, µ̂i = ȳi·. • That is, we estimate the population mean for the ith treatment with the sample mean of the data in our experiment from the ith treat- ment. Simple! In the effects version of the model, yij = µ+ αi + eij , so the least squares criterion becomes L = a∑ i=1 ni∑ j=1 e2ij = a∑ i=1 ni∑ j=1 {yij − (µ+ αi)}2 which, along with the restriction† ∑ i αi = 0 leads to estimators µ̂ = 1 a (ȳ1· + · · ·+ ȳa·), α̂i = ȳi· − 1 a (ȳ1· + · · ·+ ȳa·), i = 1, . . . , a • In the balanced case, these estimators simplify to become: µ̂ = ȳ··, α̂i = ȳi· − ȳ··, µ̂i = ȳi·. (∗) Note that there is no disagreement between the cell means and effects version of the model. They are completely consistent. The cell means model says that the data from the ith treatment have mean µi, and the effects model just breaks up that µi into two pieces: µi = µ+ αi The consistency in the two model versions can be see in that the above relationship holds for the parameter estimators too: µ̂i = ȳi· = µ̂+ α̂i † Under the alternative restriction, ∑a i=1 niαi = 0, we get the estima- tors given in (*) in both the balanced and unbalanced cases. 50 Example Gasoline Additives (Continued) Treatment Observations Total Mean i yi1 yi2 yi3 yi4 yi· ȳi· 1 A 91.7 91.2 90.9 90.6 364.4 91.10 2 B 91.7 91.9 90.9 90.9 365.4 91.35 3 C 92.4 91.2 91.6 91.0 366.2 91.55 4 D 91.8 92.2 92.0 91.4 367.4 91.85 5 E 93.1 92.9 92.4 92.4 370.8 92.70 So, the parameter estimates from the cell means model are µ̂1 = ȳ1· = 91.10, µ̂2 = ȳ2· = 91.35, ..., µ̂5 = ȳ5· = 92.70. Or, if we prefer the effects version of the model, y·· = 364.4 + · · ·+ 370.8 = 1834.2, µ̂ = ȳ·· = 1834.2/20 = 91.71 α̂1 = 91.10− 91.71 = −0.61 α̂2 = 91.35− 91.71 = −0.36 α̂3 = 91.55− 91.71 = −0.16 α̂4 = 91.85− 91.71 = 0.14 α̂5 = 92.70− 91.71 = 0.99 • Under assumptions (1) and (2) on the eijs, the method of least squares gives the Best (minimum variance) Linear Unbiased Esti- mators (BLUE) of the parameters. – The point being that we use least-squares to fit the model not just because it makes sense intuitively, but there’s also theory that establishes that it is an optimal approach in some well- defined sense. 51 There’s one more parameter of the model: σ2, the error variance. How do we estimate σ2? The value of the least squares criterion L when evaluated at our fitted model (what we get when we plug in our parameter estimates) is a measure of how well our model fits the data (its the sum of squared differences between the actual and fitted values): Lmin = ∑ i ∑ j (yij − µ̂i) 2 = ∑ i ∑ j [yij − ȳi·] 2 ≡ SSE the Sum of Squares due to Error The Mean Squares due to Error is defined to be MSE = SSE N − a and has expected value E(MSE) = E(SSE) N − a = σ2(N − a) N − a = σ2. (∗∗) • That is, MSE is an unbiased estimator of σ2 and therefore, to com- plete the process of “fitting the model” we take as our estimator of the error variance σ̂2 =MSE . • The divisor inMSE (in this case N −a) is called the error degrees of freedom or d.f.E . • MSE is analogous to s2P in the t-test with equal but unknown vari- ances. It is a pooled estimate of σ2. • Note that the second equality in (**) follows from the fact that that SSE/σ 2 ∼ χ2(N − a) where N = n1 + · · · + na. This was a result we noted earlier in the notes when we introduced the chi-square distribution. 52 How does the decomposition of SST into SSTrt + SSE help us test for differences in the treatment means? We’ve seen that E(MSE) = E ( SSE d.f.E ) = σ2. I.e., MSE estimates σ2. In addition, when H0 is true, it can be shown that MSTrt = SSTrt d.f.Trt also estimates σ2. Therefore, if H0 is true MSTrt/MSE should be ≈ 1. However, when H0 is false, it can be shown that MSTrt estimates some- thing bigger than σ2. Therefore, to determine whether H0 is true or not, we can look at how much larger than 1 MSTrt/MSE is. This ratio of mean squares becomes our test statistic for H0. That is, by some tedious calculations using the rules for how expectations work (see p.31) it can be shown that E(MSTrt) = σ2 + ∑ i niα 2 i a− 1 in general = σ2 if H0 is true. Therefore, MSTrt MSE =  estimator of something larger than σ2 estimator of σ2 if H0 is false; estimator of σ2 estimator of σ2 if H0 is true. • If MSTrt MSE >> 1 then it makes sense to reject H0. 55 How much larger than 1 should MSTrt/MSE be to reject H0? • Should be large in comparison with its distribution under H0. • Notice that MSTrt MSE can be written as MSTrt MSE = SSTrt/d.f.Trt SSE/d.f.E or the ratio of two chi-square distributed sums of squares, each di- vided by its d.f.. It can also be shown that SSTrt and SSE are independent. Therefore, under H0, F = MSTrt MSE ∼ F (a− 1, N − a) and our test statistic becomes an F -test. We reject H0 for large values of F in comparison to an F (a− 1, N − a) distribution. Result: An α-level test of H0 : α1 = · · · = αa = 0 is: Reject H0 if F > Fα(a− 1, N − a) • Reporting simply the test result (reject/not reject) is not as infor- mative as reporting the p−value. The p-value quantifies the strength of the evidence provided by the data against the null hypothesis not just whether that evidence was sufficient to reject. The test procedure may be summarized in an ANOVA Table: Source of Sum of d.f. Mean E(MS) F Variation Squares Squares Treatments SSTrt a− 1 MSTrt σ2 + ∑ niα 2 i a−1 MSTrt MSE Error SSE N − a MSE σ2 Total SST N − 1 56 A Note on Computations: We have defined SST , SSTrt, and SSE as sums of squared deviations. Equivalent formulas for the SST and SSTrt are as follows: SST = a∑ i=1 ni∑ j=1 y2ij − y2·· N SSTrt = a∑ i=1 y2i· ni − y2·· N SSE is typically computed by subtraction: SSE = SST − SSTrt Gasoline Additive Example (Continued): • See handout labeled gasadd1.sas. SSTrt = (364.4)2 + (365.4)2 + (366.2)2 + (367.4)2 + (370.8)2 4 − (1834.2)2 20 = 6.108 Similarly, SST = 9.478 and SSE = 3.370 57 For example, the one-way anova F test compares full model: yij = µ+ αi + eij partial model: yij = µ+ eij In this case, it is easy to show mathematically that F = SSH0/d.f.H0 MSE(full model) = MSTrt(full) MSE(full) . The equivalence can be demonstrated using our Gasoline Additives Ex- ample. • See gasadd1.R where we use the anova() function to obtain a test of nested models. Notice that this gives an identical result to the one given in the anova table for the one-way model (the full model). 60 Comparisons (Contrasts) among Treatment Means Once H0 is rejected we usually want more information: which µis differ and by how much? To answer these questions we can make (1) a priori (planned) comparisons; or (2) Data-based (unplanned, a posteriori, or post hoc) comparisons. (A.K.A. Data-snooping.) Ideally, we avoid (2) altogether. The experiment should be thought out well enough so that all hypotheses of interest take the form of planned comparisons. But comparisons of type (2) are sometimes necessary, par- ticularly in preliminary or exploratory studies. It is important to understand the multiple comparisons problem in- herent in doing multiple comparisons of either type, but especially of type (2). • When performing a single statistical hypothesis test, we try to avoid incorrectly rejecting the null hypothesis (a “Type I Error”) for that test by setting this probabilitity of such an error to be low. – This probability is called the significance level, α, and we typ- ically set it to be α = .05 or some other small value. • This approach controls the probability of a Type I error on that one test. • However, when we conduct multiple hypothesis tests, the probability of making at least one Type I error increases the more tests we perform. – The more chances to make a mistake you have the more likely it is that you will make a mistake. • The problem is exacerbated when doing post hoc tests because post hoc hypotheses are typically chosen by examining the data, doing multiple (many, usually) informal (perhaps even subconscious) com- parisons and deciding to do formal hypothesis tests for those com- parisons that look “promising”. – That is, even just a single post hoc hypothesis test really in- volves many implicit comparisons, which inflates its Type I error probability. 61 • We’ll return to the multiple comparisons problem and statistical methods to help “solve” it later. Contrasts: A contrast takes the form ψ = a∑ i=1 ciµi where ∑ i ci = 0 and is estimated by C = ∑ i ciµ̂i = ∑ i ciȳi·. For example, suppose we have three treatments with population means µ1, µ2, µ3 that we estimate with the corresponding sample means ȳ1·, ȳ2·, ȳ3·. A contrast among the treatment population means is a linear combination of the form ψ = c1µ1 + c2µ2 + c3µ3 such that c1 + c2 + c3 = 0 Simplest example: pairwise contrast µi − µi′ In our three treatment example we could compare treatments 1 and 3 with the pairwise contrast ψ = 1µ1 + 0µ2 + (−1)µ3 = µ1 − µ3 which we estimate with C = ȳ1· − ȳ3· 62 The experiment was conducted in a completely randomized design in which 32 plots were randomly assigned to the 7 treatments so that 8 replicates were obtained in the control treatment and four replicates in each other treatment. The questions of interest here were (1) Is there an effect of treating the soil with sulphur? (2) What time of year should the soil be treated? (3) What is the effect of dose? These questions can be answered through the use of planned comparisons. Each question corresponds to a hypothesis that a contrast of the form c1µ1 + · · ·+ c7µ7 is equal to 0. Appropriate choices for these contrasts are as follows: (1) ψ1 = 6µ1 −µ2 −µ3 −µ4 −µ5 −µ6 −µ7 (c1 = 6, c2 = · · · = c7 = −1) (2) ψ2 = µ2 + µ3 + µ4 − µ5 − µ6 − µ7 (c1 = 0, c2 = c3 = c4 = 1, c5 = c6 = c7 = −1) (3) Two contrasts: (a) ψ3 = 2µ2 − µ3 − µ4 + 2µ5 − µ6 − µ7 (300 vs. 600 & 1200) (b) ψ4 = −µ3 + µ4 − µ6 + µ7 (600 vs. 1200) • These two contrasts can be tested separately, or simultaneously. We can also determine whether or not there is an interaction between dose and time of application by testing the contrasts formed by multiplying the contrast coefficients in (2) and (3): (4) Interaction: (a) ψ5 = 2µ2 − µ3 − µ4 − 2µ5 + µ6 + µ7 (b) ψ6 = −µ3 + µ4 + µ6 − µ7 • Again, these two contrasts can be tested separately or simultane- ously. 65 Computations: C1 = 6ȳ1· − ȳ2· − ȳ3· − ȳ4· − ȳ5· − ȳ6· − ȳ7· = 6(22.625)− 9.50− 15.50− 5.75− 16.75− 18.25− 14.25 = 55.75 v̂ar(C1) =MSE ∑ i c2i ni = 44.9 [ 62/8 + (−1)2/4 + (−1)2/4 + (−1)2/4 + (−1)2/4 + (−1)2/4 + (−1)2/4 ] = 44.9(6) = 269.4 so t = C1 − 0 s.e.(C1) = 55.75√ 269.4 = 3.40. Since t.05/2(32−7) = 2.060, we reject H0 : ψ1 = 0 at α = .05, and conclude that there is a difference in the mean scab index between the control and active treatments. (Adding sulphur helps reduce scab disease.) Alternatively, we could use the equivalent F test: SSC1 = C2 1∑ i c2 i ni = 55.752 6 = 518.010 F = MSC1 MSE = 518.010/1 44.915 = 11.53 • The p-value is the same with either test: p = .0023. 66 The CONTRAST and ESTIMATE statements in PROC GLM • See handout scab1.sas. In PROC GLM in SAS, a constant term is always included in all models unless you specify otherwise by using the NOINT option on the MODEL statement. Therefore by default, ANOVA models are parameterized with effects models. E.g., the one way ANOVA model is parameterized as yij = µ+ αi + eij rather than yij = µi + eij . A contrast is a linear combination of the model parameters µ, α1, α2, . . . , αa. The MODEL statement here would be model y=A; where A is the treatment factor with a levels corresponding to the effects α1, . . . , αa (A must be specified as a factor by including A in a CLASS statement). The syntax of the CONTRAST statements is contrast ’contrast label’ intercept c0 A c1 c2 ... ca; Here, c0, c1, . . ., ca are the contrast coefficients corresponding to µ, α1, . . . , αa, respectively. Here, “intercept” indicates that the next coef- ficient will be for µ, and “A” indicates that the next coefficients will be for α1, . . . , αa. • Note that, by default, the levels of factor A are ordered alphabeti- cally. This ordering can be changed with the ORDER= option on the PROC GLM statement. ORDER=data orders the levels as they appear in the data set. The ordering of the factor levels is impor- tant, because the contrast coefficients are matched to the levels of the factor in the order that SAS is currently using. The factor level ordering used can be seen in the summary of the CLASS variables that appears at the beginning of the PROC GLM output. • SAS allows you to omit terms from the CONTRAST statement. How it fills in the contrast coefficients for omitted terms is complicated, in general, but for the one-way anova models, omitted coefficients are assumed to equal 0. – E.g., the following three CONTRAST statements for the scab disease example are equivalent. Each one tests µ2 − µ3 = 0: CONTRAST ’mu2-mu3 (a)’ intercept 0 trt 0 1 -1 0 0 0 0; CONTRAST ’mu2-mu3 (b)’ trt 0 1 -1 0 0 0 0; CONTRAST ’mu2-mu3 (c)’ trt 0 1 -1; 67 Orthogonal Contrasts Two contrasts ψ1 = ∑ i c1iµi, ψ2 = ∑ i c2iµi are orthogonal if∑ i c1ic2i/ni = 0 ( ∑ i c1ic2i = 0 in balanced case). • Sample versions of orthogonal contrasts are independent. Consider the balanced case: cov ∑ i c1iȳi·, ∑ j c2j ȳj·  = ∑ i ∑ j c1ic2jcov(ȳi·, ȳj·) = ∑ i c1ic2ivar(ȳi·) = σ2 n ∑ i c1ic2i = 0 • The interpretation of this independence is that orthogonal contrasts correspond to distinct, non-redundant comparisons among the treat- ment means. When we test hypotheses on each of several con- trasts that are mutually orthogonal, we are asking a set of “non- overlapping” or “non-redundant” questions in some sense. • At most a− 1 mutually orthogonal contrasts can be constructed on a means. • Orthogonal contrasts can be used to partition the treatment SS into a− 1 independent components each with d.f. = 1: SSTrt = SSC1 + SSC2 + · · ·+ SSCa−1 for any set {C1, . . . , Ca−1} of mutually orthogonal sample contrasts. Example: In a balanced experiment comparing mean response from 2 drugs (µ1, µ2) and a placebo (µ3) a natural set of orthogonal contrasts is ψ1 = µ1 − µ2 ψ2 = µ1 + µ2 − 2µ3 ψ1 and ψ2 are orthogonal since∑ i c1ic2i = (1)(1) + (−1)(1) + (0)(−2) = 0. 70 Scab Disease Example: The 7 treatments can be represented by a tree diagram: If we construct contrasts between the branches at the same level, then these contrasts will be orthogonal. A B C D E F G ψ1 : ψ2 : ψ3 : ψ4 : Two contrasts that are not orthogonal are ψ1 = µ1 − 1 6 (µ2 + µ3 + µ4 + µ5 + µ6 + µ7) and ψ2 = µ1 − 1 5 (µ2 + µ3 + µ4 + µ5 + µ6) • It is clear that the sample versions of these contrasts will not be independent and that there is some redundancy in asking the ques- tions (i) does ψ1 = 0? and (ii) does ψ2 = 0? Clearly, if we reject (i), we are more likely to reject (ii) and vice versa. 71 Orthogonal Polynomials — Refer to Gasoline Example Orthogonal polynomials are a specific type of orthogonal contrast useful when a treatment factor is quantitative. They are especially conve- nient and easy to use when the treatments are evenly spaced and replica- tion is balanced (e.g., the gasoline example). Suppose we plot mean octane vs. amount of additive. Do we have a straight line relationship? If so, is the slope equal to 0? x x x x x additive m ea no ct 0 1 2 3 4 91 .5 92 .0 92 .5 72 Typically, it is of interest to address the question of whether the relation- ship between the mean of y and the quantitative factor is linear or not. That is usually where the use of orthogonal polynomials ends. But oc- casionally, if the relationship is not linear, it may be of interest to check whether or not the relationship is quadratic. That is, if the relationship is nonlinear we may want to proceed and com- pare MSCquad MSE to F (1, N − a). If MSCquad MSE > Fα(1, N − a) then the β0, β1 and β2 terms belong in the model. To determine whether higher order terms are also necessary, we test for lack of fit based on the second degree (quadratic) model. That is, we would then need to test H0 : ψcub = · · · = ψ(a−1)ic = 0 • We can continue in this manner if necessary to check for cubic, quar- tic, etc. relationships. • Caveat: But remember, two points determine a line, three points determine a quadratic curve, four points determine a cubic curve, etc. That means when a = 2, a straight line relationship is guaranteed to hold just as an artifact of only studying two levels of the factor! That doesn’t mean the true relationship is linear. Similarly, the means for a three-level quantitative factor (a = 3) are guaranteed to be perfectly fit by a quadratic curve. Etc. – So relationships found via orthogonal polynomials should be taken with a grain of salt when there are not many factor levels. • So, unless a is large (5 or more, say) and there’s good reason to hypothesize a quadratic or higher-order relationship, it is best to only test whether or not that relationship is linear. 75 Gasoline Additive Example: • Here we do some computations “by hand”, but also refer to gasadd1.sas and its output to see how to do these things in SAS. From Table D.6, we see that the linear contrast coefficients are (-2,- 1,0,1,2). From the output, SSClin = 5.476 on 1 d.f., so MSClin = 5.476 and we conclude that the trend is at least linear since F = MSClin MSE = 5.476 0.225 = 24.37 exceeds its critical value. We can test H0 : mean octanes are linear in the amount of additive versus H1 : mean octanes are nonlinear by computing SSL.O.F. = SSTrt − SSClin = 6.108− 5.476 = 0.632 on d.f.L.O.F. = 3. We do not reject H0 since F = MSL.O.F. MSE = 0.632/3 0.225 = 0.94 is not significant. Conclusion: Octane is linear in the amount of additive over the range 0–4 cc/l. We can use SAS Proc Reg to obtain the relationship ŷij = 90.97 + 0.37xi 76 Multiple Comparisons Procedures Recall that when performing a hypothesis test, there are two types of errors that we could make: Type I errors and Type II errors. Our Conclusion: The Truth H0 is True H0 is False Fail to Reject H0 Correct Type II Error Reject H0 Type I Error Correct • Recall also that the significance level α of a test is the type I error rate for that test. The problem: Suppose we have K hypotheses, H1,H2, . . . , HK , and we choose to test each one at significance level α. If H1, . . . ,HK are all true, then the probability of making at least one type I error (rejecting at least one of the Hi’s) is larger (often much larger) than α. Suppose the K tests statistics appropriate for testing H1, . . . ,HK happen to be independent. In this case, Pr[at least one Hi rejected|all Hi are true] = 1− Pr[all Hi accepted|all Hi are true] = 1− (1− α)K > α. So as K → ∞, Pr[at least one Type I error] → 1. • It is not as easy to compute the type I error rate for the family of hy- potheses H1, . . . , HK when their test statistics are not independent, but the problem persists: In general, (unless the tests are perfectly correlated) the probability of at least one type I error in the family of inferences will be greater than the per-inference type I error rate. Multiple comparison procedures are designed to allow one to conduct sev- eral inferences (perform several tests or compute several confidence inter- vals) without exceeding a pre-specified error rate for the entire “family” of inferences. 77 With this strategy, we get SFWER ≤ K∑ j=1 Pr(Rj |Tj) = K∑ j=1 α K = α so that by setting CWER = α/K for each comparison, we ensure that SFWER ≤ α. Operationally, this just means that we test each of the j = 1, . . . ,K hy- potheses as follows: reject Hj : ∑a i=1 cjiµi = 0 if t = | ∑ i cjiȳi·|√ MSE ∑ i c2 ji ni > tα/(2K)(N − a). • That is, we just do the usual t-test but compare t with tα/(2K)(N−a) rather than tα/2(N − a). Equivalently, compare the usual p-value to α/K rather than α. Simultaneous confidence intervals with a controlled error rate for the con- trasts ψ1, . . . , ψK can be formed based on the Bonferroni method. One can be at least 100(1−α)% confident that ψ1, . . . , ψK will all be contained in the intervals∑ i cjiȳi· ± tα/(2K)(N − a) √√√√MSE ∑ i c2ji ni , j = 1, . . . ,K. • Any critical value of the t distribution can be obtained with the tinv function in SAS. Suppose you want the upper .05/(2K) critical value of a t distribution with 9 d.f., and where K = 3. This value is called the 1 − .05/(2(3)) = 1 − .008333 = .9917th quantile of the t(9) distribution. It can be obtained with the following short SAS program: data junk; tcrit=tinv(.9917,9); run; proc print; run; • There is also an finv function for quantiles of the F distribution. To obtain the upper αth critical value of the F (a, b) distribution (the 1− αth quantile) use finv(1− α, a, b). 80 2. Fisher’s LSD Method: • Controls the FWER but not the FDR or SFWER. Does not produce simultaneous confidence intervals with controlled error rate. • Fisher’s LSD is typically substantially more powerful than Bonfer- roni, Scheffé. Step 1. Do an F -test of H0 : µ1 = · · · = µa at level α. If reject do step 2; otherwise stop. Step 2. Test contrasts, each with CWER α. That is, do t-test for contrast or F -test for contrast (either one) at level α. • By making step 2 conditional on rejecting H0 in step 1, we have con- trolled the FWER. However, if H0 does not hold, then the combined type I error rate with this procedure may be > α (i.e., we haven’t controlled the SFWER). • Fisher’s LSD is sometimes called the “protected LSD”. This method should be distinguished from ordinary LSD, which just does step 2 without first checking the overall ANOVA F test for significance. Or- dinary LSD is not a multiple comparison technique at all (it ignores multiplicity). 81 • The term “LSD” stands for least significant difference. The reason for this terminology is as follows: Suppose the contrasts that we are interested in are pairwise con- trasts. – By a pairwise contrast we mean a contrast of the form µj −µk (a comparison between a pair of treatment means). Then the LSD procedure is a t-test of H0 : µj − µk = 0, which has rejection rule: reject H0 if |ȳj· − ȳk·| > tα/2(N − a) √ MSE ( 1 nj + 1 nk ) = tα/2(N − a) √ MSE ( 2 n ) ︸ ︷︷ ︸ the “LSD” in the balanced case. Notice that in the balanced case where n1 = n2 = · · · = na = n the right-hand side does not depend on either j or k. This means that no matter which pair of means we compare, we conclude that they are significantly different from one another if the difference between the corresponding sample means exceeds the LSD. – That is, the LSD is the least difference between the sample means that will result in concluding that they are significantly different from one-another. E.g., in the gasoline additives example, the LSD is LSD = tα/2(N − a) √ MSE ( 2 n ) = 2.131 √ .2247 ( 2 4 ) = .7144, which means that any pair of sample means that differ by more than .7144 will be declared significantly different from one-another by the LSD method. 82 Tukey’s “Honest” significant difference (HSD) is an alternative to using the LSD for pairwise comparisons. In the balanced case, the HSD is HSD = qα(a, d.f.E)√ 2 √ MSE ( 1 n + 1 n ) . • In the balanced case, we conclude that a pair of means µj , µk are significantly different from one another if |ȳj· − ȳk·| > HSD. In the unbalanced case, the HSD is no longer the same for all pairs on means. In this case, the HSD for comparing µj , µk is HSDjk = qα(a, d.f.E)√ 2 √ MSE ( 1 nj + 1 nk ) , and we conclude that a pair of means µj , µk are significantly different from one another if |ȳj· − ȳk·| > HSDjk. Tukey’s method can also be used to compute simultaneous confidence in- tervals for µj−µk for all pairs µj , µk with combined error rate of α. These intervals are given by ȳj· − ȳk· ±HSDjk for each pair j, k. 85 Recommendations: For simultaneous confidence intervals on multiple contrasts or other linear combinations of the model parameters: 1. If you wish to form confidence intervals on all pairwise differences among the treatment means, use Tukey’s HSD method; for simulta- neous intervals on other quantities, use Bonferroni intervals. For hypothesis tests corresponding to planned comparisons: 2. Use Tukey’s HSD for all pairwise comparisons; Dunnett’s method for all pairwise comparisons with a single reference mean (the control, standard, best or worst treatment); and use Fisher’s LSD for all other planned contrasts. For unplanned comparisons and data snooping: 3. Use Scheffé’s method. • These recommendations are not universally agreed upon, but they are what I’ll expect you to follow in this course when deciding which multiple comparison procedure to use for homework and test prob- lems unless I tell you explicitly to use some other approach. Other recommendations are given in our text and elsewhere. • There are many opinions about multiple comparisons, many of which are equally valid. Although for a given error rate it is often possible to choose an optimal procedure, it is often unclear which error rate should be controlled in a given situation. • The choice of error rate and α-level is really not a statistical issue, nor even a question that is amenable to objective analysis. Similarly, the choice of the “family” of inferences for which the error rate is to be controlled is also subjective. In some contexts there are guidelines based on custom that can be followed, but ultimately these choices are up to the researcher to decide based upon his/her own tolerance for the risk of making incorrect conclusions. 86 Power Analysis/Choice of Sample Size Error Types: There are two types of errors that can be made in choosing between H0 and HA: I. Type I Error: Reject H0 when H0 is true. II. Type II Error: Fail to reject H0 when H0 is false. • The probability of I is called α, and it is usually fixed at α = .05 or α = .01 by the investigator. • The probability of II is called β and it cannot be fixed (set) by the investigator. – Note that the power of a hypothesis test is power = Pr(reject H0|H0 is false) = 1− β, or the probability of establishing our scientific proposition given that it is true. • Want power to be high (want to have high probability of detecting the effect we’re looking for) and β to be low. – However, power depends upon lots of things we can’t control and a few we can. 87 Typical Power Analysis for Two-sample t Test: 1. Fix α = .05 (or .01, or some other value). 2. Assume values for µ1−µ2 and σ based upon previous research (yours or from literature), educated guesses, and considerations of clini- cal/practical/scientific (not statistical) significance. 3. Select a desired level of power, 80%, say. This may be dictated by the funding agency, if power analysis is part of a grant proposal, or chosen by the investigator (represents risk of not being able to detect true effect, so how much risk are you comfortable with?). 4. Compute power for each of several values of n. Select smallest n that gives you power ≥ the level selected in 3. Typically, step 2 is the hardest part. One way that people often try to simplify this step is to specify a generic or “canned” effect size. 90 Effect size. In the two-sample t-test, it can be shown that although the power of the test depends on |µ1 − µ2| and σ, it only depends upon these quantities through their ratio: |µ1 − µ2| σ ≡ d which is known as the effect size* for the 2-sample t test. • In other experimental design/statistical hypothesis testing frame- works quantities analogous to d can sometimes (but not always — especially in more complex contexts) be identified and defined as “the effect size”. There are some famous definitions (due to Cohen, 1988) of what constitutes a “small”, “medium” or “large” effect size. These effect size descriptions (and their misuse) are especially common in the social sciences. It is tempting, but not recommended, to avoid consideration of |µ1 − µ2| and σ separately and instead simply specify d, the effect size. I especially encourage you to avoid taking a generic “small”, “medium”, or “large” effect size as your target (e.g., computing the sample size necessary to achieve 80% power to detect a small/medium/large effect size d) without close consideration of the specific research context at hand. See the article by R. Lenth handed out in class for more on this topic. * Or, more accurately, the standardized effect size for the 2-sample t test. 91 How is power calculated? Remember, power is the probability of rejecting H0 given that it is false. Whether or not H0 is rejected is determined by a comparison of a test statistic with a critical value. • Therefore, power computations are based upon the statistical test employed in the analysis! In the two-sample t test situation, the null and alternative hypotheses are H0 : µ1 − µ2 = 0, vs. HA : µ1 − µ2 ̸= 0, (assuming a two-tailed alternative) and the rejection rule is: reject H0 if t = |ȳ1 − ȳ2| sP √ 1 n1 + 1 n2 > tα/2(n1 + n2 − 2), where ȳ1, ȳ2 are the sample means in groups 1 and 2 sP is the pooled sample standard deviation n1, n2 are the sample sizes in the two treatment groups • We can allow n1 ̸= n2, but typically it is best to design balanced experiments, so we allocate equal sample sizes n1 = n2 = n to the two groups. Thus, the rule becomes: reject H0 if t = |ȳ1 − ȳ2| sP √ 2 n > tα/2(2n− 2). 92 Assumptions for Sample Size Calculation: Although there was a difference of ȳ1−ȳ2 = 132.86−127.44 = 5.42 in mean BP among OC users and non-users in the pilot study, this is not necessarily the value we want to assume in computing |µ1 − µ2| the unstandardized effect size we wish to be able to detect. Instead |µ1 − µ2| should be chosen based upon the answer to questions such as: i. How much of a difference in mean BP do we expect OC use to be associated with? or ii. How much of a difference in mean BP would be clinically significant (e.g., correspond to an elevated health risk)? Suppose that we decide that a 5% increase (or reduction) in BP associated with OC use would be clinically significant. If we assume that µ2 the population mean systolic BP for non OC users is 127.44 (our sample value), then a 5% increase in mean BP would be 1.05(127.44)−127.44 = 6.37 units, so we take |µ1 − µ2| = 6.37. Because we don’t know σ, we’ll estimate it by the pooled standard devia- tion from our pilot study: sP = √ (n1 − 1)s21 + (n2 − 1)s22 n1 + n2 − 2 = √ (4− 1)(15.34)2 + (6− 1)(18.23)2 4 + 6− 2 = 17.2 These values give an effect size of |µ1 − µ2| σ = 6.37/17.2 = .37 • This effect size can be used to determine the non-centrality parame- ter of the t distribution describing the distribution of the two-sample t-statistic under HA. Then tables, plots (called operating character- istic curves), or computer functions giving non-central t probabilities can be used to look up the power. 95 • Alternatively, computer programs can be used that short-cut this process. – SAS does sample size and power computations in its “Analyst Application”. – There are also numerous online sample size/power tools. In particular, we will use Russ Lenth’s Java Applets for Power and Sample Size (http://www.stat.uiowa.edu/∼rlenth/Power/index.html). • SAS Analyst is started from the Solutions menu in SAS. Click on “Solutions”, then “Analysis”, then “Analyst” to start this applica- tion. • The click on the “Statistics” menu, click on “Sample Size” and click on whichever statistical analysis for which you want to compute power or sample size (in this case, click on “Two-Sample t-test...”). • These steps will lead to a dialog box in which you can specify the necessary assumptions for either a sample size calculation (for a given power value, or range of power values), or a power calculation (for a given sample size or range of sample sizes). • Some programs require direct input of the effect size or noncentrality parameter. SAS Analyst requires the actual means and SDs. Note however, that if we change the means and SD without changing the effect size, the answer remains the same. • For the values of |µ1 − µ2| and σ assumed in this example, it turns out that 116 subjects per group are required to achieve at least 80% power using a two-sample t test and a significance level of α = .05. 96 Sample Size/Power Analysis for the One-way Layout: The two-sample design analyzed with a t test is among the simplest settings for a power analysis. Often, however, the design and statistical test will be more complex. • E.g., Suppose we have a treatments to compare rather than just 2. This is a one-way layout (design), for which a one-way analysis of variance is appropriate. The same principles underlying power analysis in the two-sample situation apply here as well, but the test statistic differs, and the issue of effect size is more complex. In particular, for a groups, the probability of rejecting H0 : µ1 = µ2 = · · · = µa depends not just on the difference between two means, but the difference between all possible pairs of means — i.e., on the spacing of the means. In addition, the statistical test is now an F test, not a t test. In fact, it can be shown that the t test is a special case of the F test corresponding to a = 2. • So, power for a two sample comparison of means is based on the non- central t, and power for more general F tests based on the general linear model (e.g., the one-way anova model) is based on the non- central F distribution. 97
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved