Download biostatistics cheat sheet and more Cheat Sheet Biostatistics in PDF only on Docsity! Population entire collection of objects or individuals about which information is desired. ➔ easier to take a sample ◆ Sample part of the population that is selected for analysis ◆ Watch out for: ● Limited sample size that might not be representative of population ◆ Simple Random Sampling Every possible sample of a certain size has the same chance of being selected Observational Study there can always be lurking variables affecting results ➔ i.e, strong positive association between shoe size and intelligence for boys ➔ **should never show causation Experimental Study lurking variables can be controlled; can give good evidence for causation Descriptive Statistics Part I ➔ Summary Measures ➔ Mean arithmetic average of data values ◆ * *Highly susceptible to extreme values (outliers). Goes towards extreme values ◆ Mean could never be larger or smaller than max/min value but could be the max/min value ➔ Median in an ordered array, the median is the middle number ◆ **Not affected by extreme values ➔ Quartiles split the ranked data into 4 equal groups ◆ Box and Whisker Plot ➔ Range = Xmaximum Xminimum ◆ Disadvantages: Ignores the way in which data are distributed; sensitive to outliers ➔ Interquartile Range (IQR) = 3rd quartile 1st quartile ◆ Not used that much ◆ Not affected by outliers ➔ Variance the average distance squared sx2 = n 1 (x x)∑ n i=1 i 2 ◆ gets rid of the negativesx2 values ◆ units are squared ➔ Standard Deviation shows variation about the mean s =√ n 1(x x)∑ n i=1 i 2 ◆ highly affected by outliers ◆ has same units as original data ◆ finance = horrible measure of risk (trampoline example) Descriptive Statistics Part II Linear Transformations ➔ Linear transformations change the center and spread of data ➔ ar(a X) V ar(X)V + b = b2 ➔ Average(a+bX) = a+b[Average(X)] ➔ Effects of Linear Transformations: ◆ a + b*mean meannew = ◆ a + b*medianmediannew = ◆ *stdev stdevnew = b| | ◆ *IQR IQRnew = b| | ➔ Zscore new data set will have mean 0 and variance 1 z = S X X Empirical Rule ➔ Only for moundshaped data Approx. 95% of data is in the interval: x s , x s ) s ( 2 x + 2 x = x + / 2 x ➔ only use if you just have mean and std. dev. Chebyshev's Rule ➔ Use for any set of data and for any number k, greater than 1 (1.2, 1.3, etc.) ➔ 1 1 k2 ➔ (Ex) for k=2 (2 standard deviations), 75% of data falls within 2 standard deviations Detecting Outliers ➔ Classic Outlier Detection ◆ doesn't always work ◆ z| | = | | S X X | | ≥ 2 ➔ The Boxplot Rule ◆ Value X is an outlier if: X<Q11.5(Q3Q1) or X>Q3+1.5(Q3Q1) Skewness ➔ measures the degree of asymmetry exhibited by data ◆ negative values= skewed left ◆ positive values= skewed right ◆ if = don't need.8 skewness| | < 0 to transform data Measurements of Association ➔ Covariance ◆ Covariance > 0 = larger x, larger y ◆ Covariance < 0 = larger x, smaller y ◆ s (x )(y ) xy = 1n 1 ∑ n i=1 x y ◆ Units = Units of x Units of y ◆ Covariance is only +, , or 0 (can be any number) ➔ Correlation measures strength of a linear relationship between two variables ◆ rxy = covariance xy (std.dev. )(std. dev. )x y ◆ correlation is between 1 and 1 ◆ Sign: direction of relationship ◆ Absolute value: strength of relationship (0.6 is stronger relationship than +0.4) ◆ Correlation doesn't imply causation ◆ The correlation of a variable with itself is one Combining Data Sets ➔ Mean (Z) = X Y Z = a + b ➔ Var (Z) = V ar(X) V ar(Y ) sz2 = a2 + b 2 + abCov(X , ) 2 Y Portfolios ➔ Return on a portfolio: R RRp = wA A + wB B ◆ weights add up to 1 ◆ return = mean ◆ risk = std. deviation ➔ Variance of return of portfolio s ss2p = w2A 2 A + w 2 B 2 B + w w (s )2 A B A,B ◆ Risk(variance) is reduced when stocks are negatively correlated. (when there's a negative covariance) Probability ➔ measure of uncertainty ➔ all outcomes have to be exhaustive (all options possible) and mutually exhaustive (no 2 outcomes can occur at the same time) ➔ Mean for uniform distribution: (X)E = 2 (a+b) ➔ Variance for unif. distribution: ar(X)V = 12 (b a)2 Normal Distribution ➔ governed by 2 parameters: (the mean) and (the standard μ σ deviation) ➔ (μ, ) X ~ N σ2 Standardize Normal Distribution: Z = σ X μ ➔ Zscore is the number of standard deviations the related X is from its mean ➔ **Z< some value, will just be the probability found on table ➔ **Z> some value, will be (1probability) found on table Normal Distribution Example Sums of Normals Sums of Normals Example: ➔ Cov(X,Y) = 0 b/c they're independent Central Limit Theorem ➔ as n increases, ➔ should get closer to (populationx μ mean) ➔ mean( ) x = μ ➔ variance x) n ( = σ2/ ➔ (μ, )X ~ N n σ2 ◆ if population is normally distributed, n can be any value ◆ any population, n needs to be 0 ≥ 3 ➔ Z = X μσ/√n Confidence Intervals = tells us how good our estimate is **Want high confidence, narrow interval **As confidence increases , interval also increases A. One Sample Proportion ➔ p︿ = xn = sample size number of successes in sample ➔ ➔ We are thus 95% confident that the true population proportion is in the interval… ➔ We are assuming that n is large, n >5 and p︿ our sample size is less than 10% of the population size. Standard Error and Margin of Error Example of Sample Proportion Problem Determining Sample Size n = e2 (1.96) p(1 p)2︿ ︿ ➔ If given a confidence interval, is p︿ the middle number of the interval ➔ No confidence interval; use worst case scenario ◆ =0.5 p︿ B. One Sample Mean For samples n > 30 Confidence Interval: ➔ If n > 30, we can substitute s for so that we get: σ For samples n < 30 T Distribution used when: ➔ is not known, n < 30, and data is σ normally distributed * Stata always uses the tdistribution when computing confidence intervals Hypothesis Testing ➔ Null Hypothesis: ➔ , a statement of no change and isH0 assumed true until evidence indicates otherwise. ➔ Alternative Hypothesis: is aHa statement that we are trying to find evidence to support. ➔ Type I error: reject the null hypothesis when the null hypothesis is true. (considered the worst error) ➔ Type II error: do not reject the null hypothesis when the alternative hypothesis is true. Example of Type I and Type II errors Methods of Hypothesis Testing 1. Confidence Intervals ** 2. Test statistic 3. Pvalues ** ➔ C.I and Pvalues always safe to do because don’t need to worry about size of n (can be bigger or smaller than 30) One Sample Hypothesis Tests 1. Confidence Interval (can be used only for twosided tests) 2. Test Statistic Approach (Population Mean) 3. Test Statistic Approach (Population Proportion) 4. PValues ➔ a number between 0 and 1 ➔ the larger the pvalue, the more consistent the data is with the null ➔ the smaller the pvalue, the more consistent the data is with the alternative ➔ ** If P is low (less than 0.05), must go reject the nullH0 hypothesis Assumptions of Simple Linear Regression 1. We model the AVERAGE of something rather than something itself 2. ◆ As (noise) gets bigger, it’sε harder to find the line Estimating Se ➔ S 2 e = n 2 SSE ➔ is our estimate of σ Se2 2 ➔ is our estimate of σ Se =√Se2 ➔ 95% of the Y values should lie within the interval X 1.96S b0 + b1 + e Example of Prediction Intervals: Standard Errors for and bb1 0 ➔ standard errors when noise ➔ amount of uncertainty in ours b0 estimate of (small s good, large sβ0 bad) ➔ amount of uncertainty in ours b1 estimate of β1 Confidence Intervals for and bb1 0 ➔ ➔ ➔ ➔ ➔ n small → bad big → bad se small→ bad (wants x’s spread out fors2x better guess) Regression Hypothesis Testing *always a twosided test ➔ want to test whether slope ( ) isβ1 needed in our model ➔ : = 0 (don’t need x)H0 β1 : 0 (need x) Ha = β1 / ➔ Need X in the model if: a. 0 isn’t in the confidence interval b. t > 1.96 c. Pvalue < 0.05 Test Statistic for Slope/Yintercept ➔ can only be used if n>30 ➔ if n < 30, use pvalues Multiple Regression ➔ ➔ Variable Importance: ◆ higher tvalue, lower pvalue = variable is more important ◆ lower tvalue, higher pvalue = variable is less important (or not needed) Adjusted Rsquared ➔ k = # of X’s ➔ Adj. Rsquared will as you add junk x variables ➔ Adj. Rsquared will only if the x you add in is very useful ➔ **want Adj. Rsquared to go up and Se low for better model The Overall F Test ➔ Always want to reject F test (reject null hypothesis) ➔ Look at pvalue (if < 0.05, reject null) ➔ : (don’tH0 ...β1 = β2 = β3 = βk = 0 need any X’s) : (need at Ha ... = β1 = β2 = β3 = βk / 0 least 1 X) ➔ If no x variables needed, then SSR=0 and SST=SSE Modeling Regression Backward Stepwise Regression 1. Start will all variables in the model 2. at each step, delete the least important variable based on largest pvalue above 0.05 3. stop when you can’t delete anymore ➔ Will see Adj. Rsquared and Se Dummy Variables ➔ An indicator variable that takes on a value of 0 or 1, allow intercepts to change Interaction Terms ➔ allow the slopes to change ➔ interaction between 2 or more x variables that will affect the Y variable How to Create Dummy Variables (Nominal Variables) ➔ If C is the number of categories, create (C1) dummy variables for describing the variable ➔ One category is always the “baseline”, which is included in the intercept Recoding Dummy Variables Example: How many hockey sticks sold in the summer (original equation) ockey 00 0Wtr 0Spr 0Fall h = 1 + 1 2 + 3 Write equation for how many hockey sticks sold in the winter ockey 10 0Fall 0Spri 0Summer h = 1 + 2 3 1 ➔ **always need to get same exact values from the original equation Regression Diagnostics Standardize Residuals Check Model Assumptions ➔ Plot residuals versus Yhat ➔ Outliers ◆ Regression likes to move towards outliers (shows up as being really high)R2 ◆ want to remove outlier that is extreme in both x and y ➔ Nonlinearity (ovtest) ◆ Plotting residuals vs. fitted values will show a relationship if data is nonlinear ( also high)R2 ◆ Log transformation accommodates nonlinearity, reduces right skewness in the Y, eliminates heteroskedasticity ◆ **Only take log of X variable so that we can compare models. Can’t compare models if you take log of Y. ◆ Transformations cheatsheet ◆ ovtest: a significant test statistic indicates that polynomial terms should be added ◆ : H0 ata o transformation d = n : Ha ata = o transformation d / n ➔ Normality (sktest) ◆ : H0 ata ormality d = n : Ha ata = ormality d / n ◆ don’t want to reject the null hypothesis. Pvalue should be big ➔ Homoskedasticity (hettest) ◆ : H0 ata omoskedasticity d = h ◆ ata = omoskedasticity Ha : d / h ◆ Homoskedastic: band around the values ◆ Heteroskedastic: as x goes up, the noise goes up (no more band, fanshaped) ◆ If heteroskedastic, fix it by logging the Y variable ◆ If heteroskedastic, fix it by making standard errors robust ➔ Multicollinearity ◆ when x variables are highly correlated with each other. ◆ > 0.9R2 ◆ pairwise correlation > 0.9 ◆ correlate all x variables, include y variable, drop the x variable that is less correlated to y Summary of Regression Output