Download Statistical Analysis of Fire Station Aid Response Data and Worker Output Data - Prof. Frie and more Assignments Statistics in PDF only on Docsity! Stat 421, Fall 2008 Fritz Scholz Homework 9 Solutions Problem 1: For the fire station aid response data from the previous homework perform pairwise means analyses using 95% confidence intervals, comparing one fire station against another for all choose(4,2) possible comparisons. Do this using the Fisher protected LSD method, the Bonferroni method, the Tukey-Kramer method and Scheffe's method. Write a function that computes these sets of 4 confidence intervals for each of the choose(4,2) comparisons and plots them in groups next to each other as done in the class slides, each group identified as to which pair of means is compared. Draw a horizontal line at the zero level. Also give the confidence interval results in tabular form. Suggestion: you may want to use write.csv(out,"out.csv"), where out is the appropriate tabular output from your function and then you can cut and paste the view of out.csv from Excel into Word. Give the plots, state and explain your conclusion, and give your function code. Solution: Given that the last HW showed that the p-value was .0182 or .0205 (normal theory F- test or randomization F-test), we see that we are below alpha=.05. Thus we go beyond the first step in Fisher’s protected LSD method. The function call out=task1() produced the plot below, comparing the four types of intervals. The code of task1 is given below. co nt ra st s -6 -4 -2 0 2 4 6 Fisher's protected LSD Tukey-Kramer Bonferroni Scheffe ST1 ST2 ST1 ST3 ST1 ST4 ST2 ST3 ST2 ST4 ST3 ST4 The tabular output of the confidence intervals, cut and pasted from Excel, after exporting out to out.csv via write.csv(out,”out.csv”), and minor editing of the spreadsheet (headers and decimals shown): Diff lowerLSD upperLSD lowerTK upperTK lowerBonf upperBonf lowerScheffe upperScheffe -1.00 -3.07 1.07 -3.78 1.78 -3.91 1.91 -4.03 2.03 -1.78 -3.86 0.29 -4.57 1.00 -4.69 1.13 -4.82 1.25 1.57 -0.51 3.64 -1.22 4.35 -1.34 4.48 -1.47 4.60 -0.78 -2.86 1.29 -3.57 2.00 -3.69 2.13 -3.82 2.25 2.57 0.49 4.64 -0.22 5.35 -0.34 5.48 -0.47 5.60 3.35 1.28 5.42 0.57 6.13 0.44 6.26 0.32 6.38 Conclusion: The conclusion is that the difference between Station 3 and Station 4 seems to be mainly responsible for the significant F-test, because (almost) all the other intervals for the pairwise mean differences miss zero. There is one miss for the comparison of Station 2 and Station 4 by the Fisher LSD method. Since that method does not fully protect against multiple comparison issues and since the other intervals cover zero, we should not jump to a conclusion here. The code of task1: task1=function (alpha = 0.05) { y = read.csv("aidresponse.csv") response = as.vector(as.matrix(y)) station = c(rep("station1", 6), rep("station2", 6), rep("station3", 6), rep("station4", 6)) out.lm = lm(response ~ as.factor(station)) out.anova = anova(out.lm) s = sqrt(out.anova[[3]][2]) Ybar = out.lm$fitted.values[c(1, 7, 13, 19)] Diff = NULL lower = NULL upper = NULL lowerTK = NULL upperTK = NULL lowerBonf = NULL upperBonf = NULL lowerScheffe = NULL upperScheffe = NULL tfac = qt(1 - alpha/2, 20) qfac = qtukey(1 - alpha, 4, 20) Ffac = sqrt((4 - 1) * qf(1 - alpha, 4 - 1, 24 - 4)) delta = tfac * s * sqrt(1/6 + 1/6) deltaScheffe = Ffac * s * sqrt(1/6 + 1/6) 1 2 3 4 5 6 7 8 9 10 2 5 0 3 0 0 3 5 0 4 0 0 worker p a rt o u tp u t p e r d a y The outputs look quite variable from worker to worker and we would expect strongly significant results in our ANOVA. Also, the variability seems to be somewhat higher for the higher boxplots. b) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) as.factor(worker) 9 254380 28264 118.46 < 2.2e-16 *** Residuals 190 45335 239 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The p-value of < 2.2e-16 is below the underflow level. Thus the means are significantly different. c) -3 -2 -1 0 1 2 3 -4 0 -2 0 0 20 40 Normal Q-Q Plot Theoretical Quantiles S am pl e Q ua nt ile s The normality looks quite good in the middle portion but the data seem to veer away from the fitted line, especially at the upper end. This could be due to the inhomogeneous variability from worker to worker. d) As for the modified Levine test: The ANOVA on the absolute deviations of the outputs from their respective group medians produced the ANOVA table below. Since .025 < .05 it is certainly significant at level .05, we should think about a variance stabilizing transform for the output data. Analysis of Variance Table Response: d Df Sum Sq Mean Sq F value Pr(>F) as.factor(worker) 9 1937.2 215.2 2.1798 0.02511 * Residuals 190 18762.1 98.7 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 e) k = 10 , n = 20 , Nsim = 10000 , a = 1/ 7 , Fmin.observed = 0.16694 mins1 2 sk 2 maxs1 2 sk 2 F re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 1 0 0 2 0 0 3 0 0 4 0 0 0 .0 0 2 1 0 .0 0 7 7 The above distribution is the null-distribution of the test statistic Fmin=min(s12,…,sk2)/max(s12, …,sk2). Small values of Fmin would indicate that there is evidence that the standard deviations are not the same. The distribution above is simulated while using the same variance for each group, thus it is the null-distribution. The Fmin.test shows a p-value of .0077< .01, less than half of that in the Levene test, thus more strongly indicating that the variability of output is not sufficiently homogeneous from worker to worker. Thus we should think about transforming the response variable output. f) g) ( e) ) k = 10 , n = 20 , Nsim = 10000 , a = 1/ 7 , Fmin.observed = 0.49235 mins1 2 sk 2 maxs1 2 sk 2 F re q u e n cy 0.0 0.2 0.4 0.6 0.8 1.0 0 5 0 1 0 0 1 5 0 2 0 0 0 .0 0 2 9 0 .8 8 3 Here the Fmin.test also shows no significance whatsoever. g) ( f) ) 0.0030 0.0032 0.0034 0.0036 0.0038 0.0040 0.0042 0 .0 0 0 1 5 0 .0 0 0 1 6 0 .0 0 0 1 7 0 .0 0 0 1 8 0 .0 0 0 1 9 0 .0 0 0 2 0 0 .0 0 0 2 1 Y ~ i s~ i slope 0.067 Y ~ 1 Y s~ sample standard deviation of 1/Y After transforming the data to 1/Y and plotting again si versus Xbari on a log-log scale we don’t see much of a strong positive linear trend anymore, visually comparing the size of the slope with the magnitude of the scatter around the line. h) The F-statistic in the transformed case is even more extreme than in the untransformed case, i.e., 132.91 > 118.46. The analysis on the transformed data is preferable since it does not suffer from the variability inhomogeneity from worker to worker. Homogeneous variability is one of the assumptions of a proper ANOVA. The code producing the above analyses and plots follows. It is assumed that the data frame worker.data was read into the work space as follows worker.data=read.csv(“workerdata.csv”,header=T). worker.data.analysis=function () { { y=worker.data[,2] worker=worker.data[,1] boxplot(y~worker,xlab="worker",ylab="part output per day") readline("hit return\n") svec=NULL mvec=NULL svec1=NULL mvec1=NULL d=NULL d1=NULL for(i in 1:10){ med=median(y[worker==i]) med1=median(1/y[worker==i]) d=c(d,abs(y[worker==i]-med)) d1=c(d1,abs(1/y[worker==i]-med1)) mvec=c(mvec,mean(y[worker==i])) svec=c(svec,sqrt(var(y[worker==i]))) mvec1=c(mvec1,mean(1/y[worker==i])) svec1=c(svec1,sqrt(var(1/y[worker==i]))) } out.Levene0=anova(lm(d~as.factor(worker))) out.Levene1=anova(lm(d1~as.factor(worker))) Fmin.test(k=10,n=20,a.recip=7,Fmin.observed=min(svec^2)/ max(svec^2)) readline("hit return\n") out.lm=lm(y~as.factor(worker)) out.anova=anova(out.lm) qqnorm(out.lm$resid) qqline(out.lm$resid) readline("hit return\n") plot(mvec,svec,log="xy",xlab=expression(bar(Y)[i]),ylab= expression(s[i])) out=lsfit(log10(mvec),log10(svec)) text(min(mvec),max(svec),substitute("slope"~alpha==alph,list(alp h= format(signif(out$coef[2],3)))),adj=0) abline(out) readline("hit return\n") plot(mvec1,svec1,log="xy",xlab=expression(bar(tilde(Y)) [i]),ylab= expression(tilde(s)[i])) out1=lsfit(log10(mvec1),log10(svec1)) text(min(mvec1),max(svec1),substitute("slope"~alpha==alph,list(a lph= format(signif(out1$coef[2],3)))),adj=0) abline(out1) text(median(mvec1),max(svec1),expression(tilde(Y)==1/Y),adj=0.5) text(median(mvec1),.97*max(svec1),expression( tilde(s)==~"sample standard deviation of 1/Y"),adj=0.5) readline("hit return\n") out.lm1=lm(1/y~as.factor(worker)) out.anova1=anova(out.lm1) qqnorm(out.lm1$resid) qqline(out.lm1$resid) readline("hit return\n") Fmin.test(k=10,n=20,a.recip=7,Fmin.observed=min(svec1^2)/ max(svec1^2)) readline("hit return\n") boxplot(1/y~worker,xlab="worker",ylab="days per part") list(out$coef,out1$coef,out.Levene0,out.Levene1,out.anova,out.an ova1) }