Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

SAS Two-Way ANOVA: Sales Data, Factor A (Height), Factor B (Width), Slides of Designs and Groups

Multivariate AnalysisInferential StatisticsRegression AnalysisDescriptive Statistics

An analysis of sales data using a two-way analysis of variance (ANOVA) model with two factors: height and width of a shelf display. reading the data, assumptions of the cell means model, parameter estimates, zero-sum constraints, and hypothesis testing for factor A effect and interaction effect using SAS GLM procedure.

What you will learn

  • What are the factors A and B in this two-way analysis of variance (ANOVA) model?
  • What is the difference between the cell means model and the factor effects model?
  • How are the subscripts i, j, and k interpreted in this document?
  • What are the zero-sum constraints in this two-way analysis of variance (ANOVA) model?

Typology: Slides

2021/2022

Uploaded on 07/05/2022

lee_95
lee_95 🇦🇺

4.6

(59)

1K documents

1 / 28

Toggle sidebar

Related documents


Partial preview of the text

Download SAS Two-Way ANOVA: Sales Data, Factor A (Height), Factor B (Width) and more Slides Designs and Groups in PDF only on Docsity! Statistics 512: Applied Linear Models Topic 7 Topic Overview This topic will cover • Two-way Analysis of Variance (ANOVA) • Interactions Chapter 19: Two-way ANOVA The response variable Y is continuous. There are now two categorical explanatory variables (factors). Call them factor A and factor B instead of X1 and X2. (We will have enough subscripts as it is!) Data for Two-way ANOVA • Y , the response variable • Factor A with levels i = 1 to a • Factor B with levels j = 1 to b • A particular combination of levels is called a treatment or a cell. There are ab treat- ments. • Yi,j,k is the kth observation for treatment (i, j), k = 1 to n In Chapter 19, we for now assume equal sample size in each treatment combination (ni,j = n > 1; nT = abn). This is called a balanced design. In later chapters we will deal with unequal sample sizes, but it is more complicated. Notation For Yi,j,k the subscripts are interpreted as follows: • i denotes the level of the factor A • j denotes the level of the factor B • k denotes the kth observation in cell or treatment (i, j) i = 1, . . . , a levels of factor A j = 1, . . . , b levels of factor B k = 1, . . . , n observations in cell (i, j) 1 KNNL Example • KNNL page 832 (nknw817.sas) • response Y is the number of cases of bread sold. • factor A is the height of the shelf display; a = 3 levels: bottom, middle, top. • factor B is the width of the shelf display; b = 2 levels: regular, wide. • n = 2 stores for each of the 3 × 2 = 6 treatment combinations (nT = 12) Read the data data bread; infile ’h:\System\Desktop\CH19TA07.DAT’; input sales height width; proc print data=bread; Obs sales height width 1 47 1 1 2 43 1 1 3 46 1 2 4 40 1 2 5 62 2 1 6 68 2 1 7 67 2 2 8 71 2 2 9 41 3 1 10 39 3 1 11 42 3 2 12 46 3 2 Model Assumptions We assume that the response variable observations are independent, and normally distributed with a mean that may depend on the levels of the factors A and B, and a variance that does not (is constant). Cell Means Model Yi,j,k = µi,j + εi,j,k where • µi,j is the theoretical mean or expected value of all observations in cell (i, j). • the εi,j,k are iid N(0, σ2) • Yi,j,k ∼ N(µi,j, σ 2), independent There are ab + 1 parameters of the model: µi,j, for i = 1 to a and j = 1 to b; and σ2. 2 output out=avbread mean=avsales; proc print data=avbread; Obs height width _TYPE_ _FREQ_ avsales 1 1 1 0 2 45 2 1 2 0 2 43 3 2 1 0 2 65 4 2 2 0 2 69 5 3 1 0 2 40 6 3 2 0 2 44 Plot the means Recall the plotting syntax to get two separate lines for the two width levels. We can also do a plot of sales vs width with three lines for the three heights. This type of plot is called an “interaction plot” for reasons that we will see later. symbol1 v=square i=join c=black; symbol2 v=diamond i=join c=black; symbol3 v=circle i=join c=black; proc gplot data=avbread; plot avsales*height=width; plot avsales*width=height; The Interaction plots Questions Does the height of the display affect sales? If yes, compare top with middle, top with bottom, and middle with bottom. Does the width of the display affect sales? Does the effect of height on sales depend on the width? Does the effect of width on sales depend on the height? If yes to the last two, that is an interaction. Notice that these questions are not straightforward to answer using the cell means model. 5 Factor Effects Model For the one-way ANOVA model, we wrote µi = µ+ τi where τi was the factor effect. For the two-way ANOVA model, we have µi,j = µ + αi + βj + (αβ)i,j, where • µ is the overall (grand) mean - it is µ.. in KNNL • αi is the main effect of Factor A • βj is the main effect of Factor B • (αβ)i,j is the interaction effect between A and B. Note that (αβ)i,j is the name of a parameter all on its own and does not refer to the product of α and β. Thus the factor effects model is Yi,j,k = µ + αi + βj + (αβ)i,j + εi,j,k. A model without the interaction term, i.e. µi,j = µ + αi + βj, is called an additive model. Parameter Definitions The overall mean is µ = µ.. = ∑ i,j µi,j ab under the zero-sum constraint (or µ = µab under the “last = 0 constraint”). The mean for the ith level of A is µi. = ∑ j µi,j b , and the mean for the jth level of B is µ.j = ∑ i µi,j a . αi = µi. − µ and βj = µ.j − µ, so µi. = µ + αi and µ.j = µ + βj. Note that the α’s and β’s act like the τ ’s in the single-factor ANOVA model. (αβ)i,j is the difference between µi,j and µ + αi + βj: (αβ)i,j = µi,j − (µ + αi + βj) = µi,j − (µ + (µi. − µ) + (µ.j − µ)) = µi,j − µi. − µ.j + µ These equations also spell out the relationship between the cell means µi,j and the factor effects model parameters. Interpretation µi,j = µ + αi + βj + (αβ)i,j • µ is the overall mean • αi is an adjustment for level i of A. • βj is an adjustment for level j of B. • (αβ)i,j is an additional adjustment that takes into account both i and j. 6 Zero-sum Constraints As in the one-way model, we now have too many parameters and need now several con- straints: α. = ∑ i αi = 0 β. = ∑ j βj = 0 (αβ).j = ∑ i (αβ)i,j = 0 ∀j (for all j) (αβ)i. = ∑ j (αβ)i,j = 0 ∀i (for all i) Estimates for Factor-effects model µ̂ = Ȳ... = ∑ i,j,k Yi,j,k abn µ̂i. = Ȳi.. and µ̂.j = Ȳ.j. α̂i = Ȳi.. − Ȳ... and β̂j = Ȳ.j. − Ȳ... ˆ(αβ)i,j = Ȳi,j. − Ȳi.. − Ȳ.j. + Ȳ... SS for ANOVA Table SSA = ∑ i,j,k α̂2 i = ∑ i,j,k(Ȳi.. − Ȳ...) 2 = nb ∑ i(Ȳi.. − Ȳ )2 factor A sum of squares SSB = ∑ i,j,k β̂2 j = ∑ i,j,k(Ȳ.j. − Ȳ...) 2 = na ∑ j(Ȳ.j. − Ȳ )2 factor B sum of squares SSAB = ∑ i,j,k ˆ(αβ) 2 i,j = n ∑ i,j ˆ(αβ) 2 i,j AB interaction sum of squares SSE = ∑ i,j,k(Yi,j,k − Ȳi,j.) 2 = ∑ i,j,k e2 i,j,k error sum of squares SST = ∑ i,j,k(Yi,j,k − Ȳ...) 2 total sum of squares SSM = SSA + SSB + SSAB = SST − SSE model sum of squares SST = SSA + SSB + SSAB + SSE = SSM + SSE df for ANOVA Table dfA = a − 1 dfB = b − 1 dfAB = (a − 1)(b − 1) dfE = ab(n − 1) dfT = abn − 1 = nT − 1 dfM = a − 1 + b − 1 + (a − 1)(b − 1) = ab − 1 7 • Type III SS is like Type II SS (variable added last) but it also adjusts for differing ni,j. So if all cells have the same number of observations (balanced designs are nice - the variables height and width in our example are independent - no multicollinearity!) SS1, SS2, and SS3 will all be the same. • More details on SS later. Other output R-Square Coeff Var Root MSE sales Mean 0.962241 6.303040 3.214550 51.00000 Results • The interaction between height and width is not statistically significant (F = 1.16; df = (2, 6); p = 0.37). NOTE: Check Interaction FIRST! If it is significant then main effects are left in the model, even if not significant themselves! We may now go on to examine main effects since our interaction is not significant. • The main effect of height is statistically significant (F = 74.71; df = (2, 6); p = 4.75 × 10−5). • The main effect of width is not statistically significant (F = 1.16; df = (1, 6); p = 0.32) Interpretation • The height of the display affects sales of bread. • The width of the display has no apparent effect. • The effect of the height of the display is similar for both the regular and the wide widths. Additional Analyses • We will need to do additional analyses to understand the height effect (factor A). • There were three levels: bottom, middle and top. Based on the interaction picture, it appears the middle shelf increases sales. • We could rerun the data with a one-way anova and use the methods we learned in the previous chapters to show this (e.g. tukey).. 10 Parameter Estimation Cell Means Model Yi,j,k = µi,j + εi,j,k, where • µi,j is the theoretical mean or expected value of all observations in cell (i, j). • εi,j,k ∼iid N(0, σ2) • Yi,j,k ∼ N(µi,j,k, σ 2) are independent • There are ab + 1 parameters of the model: µi,j; i = 1, . . . , a, j = 1, . . . , b and σ2. For the bread example, estimate the µi,j with Ȳi,j. which we can get from the means height*width statement: µ̂1,1 = Ȳ1,1. = 45 µ̂1,2 = Ȳ1,2. = 43 µ̂2,1 = Ȳ2,1. = 65 µ̂2,2 = Ȳ2,2. = 69 µ̂3,1 = Ȳ3,1. = 40 µ̂3,2 = Ȳ3,2. = 44 As usual, σ2 is estimated by MSE. Factor Effects Model µi,j = µ + αi + βj + (αβ)i,j, where • µ is the overall (grand) mean - it is µ.. in KNNL • αi is the main effect of Factor A • βj is the main effect of Factor B • (αβ)i,j is the interaction effect between A and B. Note that (αβ)i,j is the name of a parameter all on its own and does not refer to the product of α and β. Overall Mean The overall mean is estimated as µ̂ = Ȳ... = 51 under the zero-sum constraining. (This is sales mean in the glm output). You can get a whole dataset of this value by using a model statement with no right hand side, e.g. model sales=; and storing the predicted values. 11 Main Effects The main effect of A is estimated from the means height output. You can get a whole dataset of the µ̂i. by running a model with just A, e.g. model sales = height; and storing the predicted values. To estimate the α’s, you then subtract µ̂ from each height mean. µ̂1. = Ȳ1.. = 44 ⇒ α̂1 = 44 − 51 = −7 µ̂2. = Ȳ2.. = 67 ⇒ α̂2 = 67 − 51 = +16 µ̂3. = Ȳ3.. = 42 ⇒ α̂3 = 42 − 51 = −9 This says that “middle” shelf height has the effect of a relative increase in sales by 16, while bottom and top decrease the sales by 7 and 9 respectively. Notice that these sum to zero so that there is no “net” effect (there are only 2 free parameters, dfA = 2). The main effect of B is similarly estimated from the means width output, or by storing the predicted values of model sales = width; then subtract µ̂ from each height mean. µ̂.1 = Ȳ.1. = 50 ⇒ β̂1 = 50 − 51 = −1 µ̂.2 = Ȳ.2. = 52 ⇒ β̂2 = 52 − 51 = +1 Wide display increases sales by an average of 1, while regular display decreases sales by 1 (they sum to zero so there’s only 1 free parameter, dfB = 1). Interaction Effects Recall that α̂βi,j = µ̂i,j−(µ̂+α̂i+β̂j). This is the difference between the treatment mean and the value predicted by the overall mean and main effects only (i.e. by the additive model). You can get the treatment means from the means height*width statement, or by the pre- dicted values of model sales=height*width; then subtract the appropriate combination of the previously estimated parameters. ˆ(αβ)11 = Ȳ11. − (µ̂ + α̂1 + β̂1) = 45 − (51 − 7 − 1) = +2 ˆ(αβ)12 = Ȳ12. − (µ̂ + α̂1 + β̂2) = 43 − (51 − 7 + 1) = −2 ˆ(αβ)21 = Ȳ21. − (µ̂ + α̂2 + β̂1) = 65 − (51 + 16 − 1) = −1 ˆ(αβ)22 = Ȳ22. − (µ̂ + α̂2 + β̂2) = 69 − (51 + 16 + 1) = +1 ˆ(αβ)31 = Ȳ31. − (µ̂ + α̂3 + β̂1) = 40 − (51 − 9 − 1) = −1 ˆ(αβ)32 = Ȳ32. − (µ̂ + α̂3 + β̂2) = 44 − (51 − 9 + 1) = +1 Notice that they sum in pairs (over j) to zero and also the sum over i is zero for each j. Thus there are in reality only two free parameters here (dfAB = 2). 12 • Main effect of B (b) • Interaction of A and B (ab) There are 1 + a + b + ab parameters and 1 + a + b constraints, so there are ab remaining unconstrained parameters (or sets of parameters), the same number of parameters for the means in the cell means model. This is the number of parameters we can actually estimate. KNNL Example KNNL page 823 (nknw817b.sas) Y is the number of cases of bread sold A is the height of the shelf display, a = 3 levels: bottom, middle, top B is the width of the shelf display, b = 2: regular, wide n = 2 stores for each of the 3 × 2 treatment combinations proc glm with solution We will get different estimates for the parameters here because a different constraint system is used. proc glm data=bread; class height width; model sales=height width height*width/solution; means height*width; Solution output Intercept 44.00000000 B* = µ̂ height 1 -1.00000000 B = α̂1 height 2 25.00000000 B* = α̂2 height 3 0.00000000 B = α̂3 width 1 -4.00000000 B = β̂1 width 2 0.00000000 B = β̂2 height*width 1 1 6.00000000 B = ˆ(αβ)1,1 height*width 1 2 0.00000000 B = ˆ(αβ)1,2 height*width 2 1 0.00000000 B = ˆ(αβ)2,1 height*width 2 2 0.00000000 B = ˆ(αβ)2,2 height*width 3 1 0.00000000 B = ˆ(αβ)3,1 height*width 3 2 0.00000000 B = ˆ(αβ)3,2 It also prints out standard errors, t-tests and p-values for testing whether each parameter is equal to zero. That output has been omitted here but the significant ones have been starred. Notice that the last α and β are set to zero, as well as the last α̂β in each category. They no longer sum to zero. 15 Means The estimated treatment means are µ̂i,j = µ̂ + α̂i + β̂j + ˆ(αβ)i,j. height width N Mean 1 1 2 45.0000000 = 44 - 1 - 4 + 6 1 2 2 43.0000000 = 44 - 1 + 0 + 0 2 1 2 65.0000000 = 44 + 25 - 4 + 0 2 2 2 69.0000000 = 44 + 25 + 0 + 0 3 1 2 40.0000000 = 44 + 0 - 4 + 0 3 2 2 44.0000000 = 44 + 0 + 0 + 0 ANOVA Table Source df SS MS F A a − 1 SSA MSA MSA/MSE B b − 1 SSB MSB MSB/MSE AB (a − 1)(b − 1) SSAB MSAB MSAB/MSE Error ab(n − 1) SSE MSE Total abn − 1 SSTO MST Expected Mean Squares E(MSE) = σ2 E(MSA) = σ2 + nb a − 1 ∑ i α2 i E(MSB) = σ2 + na b − 1 ∑ j β2 j E(MSAB) = σ2 + n (a − 1)(b − 1) ∑ i,j (αβ)2 i,j Here, αi, βj, and (αβ)i,j are defined with the usual zero-sum constraints. Analytical strategies • Run the model with main effects and the two-way interaction. • Plot the data, the means and look at the residuals. • Check the significance test for the interaction. What if AB interaction is not significant? If the AB interaction is not statistically significant, you could rerun the analysis without the interaction (see discussion of pooling KNNL Section 19.10). This will put the SS and df for AB into Error. Results of main effect hypothesis tests could change because MSE 16 and denominator df have changed (more impact with small sample size). If one main effect is not significant... • There is no evidence to conclude that the levels of this explanatory variable are asso- ciated with different means of the response variable. • Model could be rerun without this factor giving a one-way ANOVA. If neither main effect is significant... • Model could be run as Y=; (i.e. no factors at all) • A one population model • This seems silly, but this syntax can be useful for getting parameter estimates in the preferred constraint system (see below). For a main effect with more than two levels that is significant, use the means statement with the Tukey multiple comparison procedure. Contrasts and linear combinations can also be examined using the contrast and estimate statements. If AB interaction is significant but not important • Plots and a careful examination of the cell means may indicate that the interaction is not very important even though it is statistically significant. • For example, the interaction effect may be much smaller in magnitude than the main effects; or may only be apparent in a small number of treatments. • Use the marginal means for each significant main effect to describe the important results for the main effects. • You may need to qualify these results using the interaction. • Keep the interaction in the model. • Carefully interpret the marginal means as averages over the levels of the other factor. • KNNL also discuss ways that transformations can sometimes eliminate interactions. If AB interaction is significant and important The interaction effect is so large and/or pervasive that main effects cannot be interpreted on their own. Options include the following: • Treat as a one-way ANOVA with ab levels; use Tukey to compare means; contrasts and estimate can also be useful. • Report that the interaction is significant; plot the means and describe the pattern. • Analyze the levels of A for each level of B (use a by statement) or vice versa 17 L = (−0.5α1 + α2 − 0.5α3) +(−0.25αβ1,1 − 0.25αβ1,2 + 0.5αβ2,1 + 0.5αβ2,2 − 0.25αβ3,1 − 0.25αβ3,2) Note the β’s do not appear in this contrast because we are looking at height only and averaging over width (this would not necessarily be true in an unbalanced design). proc glm with contrast and estimate (nknw864.sas) proc glm data=bread; class height width; model sales=height width height*width; contrast ’middle vs others’ height -.5 1 -.5 height*width -.25 -.25 .5 .5 -.25 -.25; estimate ’middle vs others’ height -.5 1 -.5 height*width -.25 -.25 .5 .5 -.25 -.25; means height*width; Output Contrast DF Contrast SS Mean Square F Value Pr > F middle vs others 1 1536.000000 1536.000000 148.65 <.0001 Standard Parameter Estimate Error t Value Pr > |t| middle vs others 24.0000000 1.96850197 12.19 <.0001 Check with means 1 1 45 1 2 43 2 1 65 2 2 69 3 1 40 3 2 44 L̂ = (65 + 69) 2 − (45 + 43 + 40 + 44) 4 = 24 Combining with Quantitative Factors Sometimes a factor can be interpreted as either categorical or quantitative. For example, “low, medium, high” or actual height above floor. If there are replicates for a quantitative factor we could use either regression or ANOVA. Recall that GLM will treat a factor as 20 quantitative unless it is listed in the class statement. Notice that you can use ANOVA even if the relationship with the quantitative variable is non-linear, whereas with regression you would have to find that relationship. One Quantitative factor and one categorical • Plot the means vs the quantitative factor for each level of the categorical factor • Consider linear and quadratic terms for the quantitative factor • Consider different slopes for the different levels of the categorical factor; i.e, interaction terms. • Lack of fit analysis can be useful (recall trainhrs example). Two Quantitative factors • Plot the means vs A for each level of B • Plot the means vs B for each level of A • Consider linear and quadratic terms. • Consider products to allow for interaction. • Lack of fit analysis can be useful. Chapter 20: One Observation per Cell For Yi,j,k, as usual • i denotes the level of the factor A • j denotes the level of the factor B • k denotes the kth observation in cell (i, j) • i = 1, . . . , a levels of factor A • j = 1, . . . , b levels of factor B Now suppose we have n = 1 observation in each cell (i, j). We can no longer estimate variances separately for each treatment. The impact is that we will not be able to estimate the interaction terms; we will have to assume no interaction. 21 Factor Effects Model µi,j = µ + αi + βj • µ is the overall mean • αi is the main effect of A • βj is the main effect of B Because we have only one observation per cell, we do not have enough information to estimate the interaction in the usual way. We assume no interaction. Constraints • Text: ∑ αi = 0 and ∑ βj = 0 • SAS glm: αa = βb = 0 ANOVA Table Source df SS MS F A a − 1 SSA MSA MSA/MSE B b − 1 SSB MSB MSB/MSE Error (a − 1)(b − 1) SSE MSE Total ab − 1 SSTO MST Expected Mean Squares E(MSE) = σ2 E(MSA) = σ2 + b a − 1 ∑ i α2 i E(MSB) = σ2 + a b − 1 ∑ j β2 j Here, αi and βj are defined with the zero-sum factor effects constraints. KNNL Example • KNNL page 882 (nknw878.sas) • Y is the premium for auto insurance • A is the size of the city, a = 3 levels: small, medium and large • B is the region, b = 2: East, West • n = 1 • the response is the premium charged by a particular company 22 The lines are not quite parallel, but the interaction, if any, does not appear to be substantial. If it was, our analysis would not be valid and we would need to collect more data. Plot the estimated model title1 ’Plot of the model estimates’; proc gplot data=preds; plot muhat*sizea=region; Notice that the model estimates produce completely parallel lines. Tukey test for additivity If we believe interaction is a problem, this is a possible way to test it without using up all our df. 25 One additional term is added to the model (θ), replacing the (αβ)i,j with the product: µi,j = µ + αi + βj + θαiβj We use one degree of freedom to estimate θ, leaving one left to estimate error. Of course, this only tests for interaction of the specified form, but it may be better than nothing. There are other variations on this idea, such as θiβj. Find µ̂ (grand mean) (nknw884.sas) proc glm data=carins; model premium=; output out=overall p=muhat; proc print data=overall; Obs premium size region muhat 1 140 1 1 175 2 100 1 2 175 3 210 2 1 175 4 180 2 2 175 5 220 3 1 175 6 200 3 2 175 Find µ̂A (treatment means) proc glm data=carins; class size; model premium=size; output out=meanA p=muhatA; proc print data=meanA; muhat Obs premium size region A 1 140 1 1 120 2 100 1 2 120 3 210 2 1 195 4 180 2 2 195 5 220 3 1 210 6 200 3 2 210 Find µ̂B (treatment means) proc glm data=carins; class region; model premium=region; output out=meanB p=muhatB; proc print data=meanB; 26 muhat Obs premium size region B 1 140 1 1 190 2 100 1 2 160 3 210 2 1 190 4 180 2 2 160 5 220 3 1 190 6 200 3 2 160 Combine and Compute data estimates; merge overall meanA meanB; alpha = muhatA - muhat; beta = muhatB - muhat; atimesb = alpha*beta; proc print data=estimates; var size region alpha beta atimesb; Obs size region alpha beta atimesb 1 1 1 -55 15 -825 2 1 2 -55 -15 825 3 2 1 20 15 300 4 2 2 20 -15 -300 5 3 1 35 15 525 6 3 2 35 -15 -525 proc glm data=estimates; class size region; model premium=size region atimesb/solution; Sum of Source DF Squares Mean Square F Value Pr > F Model 4 10737.09677 2684.27419 208.03 0.0519 Error 1 12.90323 12.90323 Corrected Total 5 10750.00000 R-Square Coeff Var Root MSE premium Mean 0.998800 2.052632 3.592106 175.0000 Source DF Type I SS Mean Square F Value Pr > F size 2 9300.000000 4650.000000 360.37 0.0372 region 1 1350.000000 1350.000000 104.62 0.0620 atimesb 1 87.096774 87.096774 6.75 0.2339 Standard Parameter Estimate Error t Value Pr > |t| Intercept 195.0000000 B 2.93294230 66.49 0.0096 size 1 -90.0000000 B 3.59210604 -25.05 0.0254 size 2 -15.0000000 B 3.59210604 -4.18 0.1496 27
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved