Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Statistics Consulting Cheat Sheet, Cheat Sheet of Statistics

Cheat sheet of Statistics Consulting on these topics: Hypothesis testing, Elementary estimation, Linear Models, Regression variants, etc.

Typology: Cheat Sheet

2019/2020

Uploaded on 10/09/2020

deville
deville 🇺🇸

4.7

(23)

166 documents

Partial preview of the text

Download Statistics Consulting Cheat Sheet and more Cheat Sheet Statistics in PDF only on Docsity! Statistics Consulting Cheat Sheet Kris Sankaran October 1, 2017 Contents 1 What this guide is for 3 2 Hypothesis testing 3 2.1 (One-sample, Two-sample, and Paired) t-tests . . . . . . . . . . . 4 2.2 Difference in proportions . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.1 χ2 tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.2 Fisher’s Exact test . . . . . . . . . . . . . . . . . . . . . . 8 2.3.3 Cochran-Mantel-Haenzel test . . . . . . . . . . . . . . . . 8 2.3.4 McNemar’s test . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Nonparametric tests . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.1 Rank-based tests . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.2 Permutation tests . . . . . . . . . . . . . . . . . . . . . . 11 2.4.3 Bootstrap tests . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.4 Kolmogorov-Smirnov . . . . . . . . . . . . . . . . . . . . . 13 2.5 Power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Analytical . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.2 Computational . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Elementary estimation 16 3.1 Classical confidence intervals . . . . . . . . . . . . . . . . . . . . 16 3.2 Bootstrap confidence intervals . . . . . . . . . . . . . . . . . . . . 16 4 (Generalized) Linear Models 17 4.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.5 Psueo-Poisson and Negative Binomial regression . . . . . . . . . 27 4.6 Loglinear models . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.7 Multinomial regression . . . . . . . . . . . . . . . . . . . . . . . . 29 4.8 Ordinal regression . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1 5 Inference in linear models (and other more complex settings 32 5.1 (Generalized) Linear Models and ANOVA . . . . . . . . . . . . . 32 5.2 Multiple testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2.1 Alternative error metrics . . . . . . . . . . . . . . . . . . . 35 5.2.2 Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3 Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.1 Propensity score matching . . . . . . . . . . . . . . . . . . 36 6 Regression variants 36 6.1 Random effects and hierarchical models . . . . . . . . . . . . . . 36 6.2 Curve-fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.2.1 Kernel-based . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.2.2 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.3.1 Ridge, Lasso, and Elastic Net . . . . . . . . . . . . . . . . 37 6.3.2 Structured regularization . . . . . . . . . . . . . . . . . . 37 6.4 Time series models . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.4.1 ARMA models . . . . . . . . . . . . . . . . . . . . . . . . 37 6.4.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . 37 6.4.3 State-space models . . . . . . . . . . . . . . . . . . . . . . 37 6.5 Spatiotemporal models . . . . . . . . . . . . . . . . . . . . . . . . 37 6.6 Survival analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.6.1 Kaplan-Meier test . . . . . . . . . . . . . . . . . . . . . . 37 7 Model selection 37 7.1 AIC / BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.2 Stepwise selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.3 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 8 Unsupervised methods 38 8.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8.2 Low-dimensional representations . . . . . . . . . . . . . . . . . . 41 8.2.1 Principle Components Analysis . . . . . . . . . . . . . . . 41 8.2.2 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . 41 8.2.3 Distance based methods . . . . . . . . . . . . . . . . . . . 41 8.3 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 8.4 Mixture modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 8.4.1 EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 9 Data preparation 41 9.1 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 9.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 9.3 Reshaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2 Figure 1: Pairing makes it possible to see the effect of treatment in this toy example. The points represent a value for patients (say, white blood cell count) measured at the beginning and end of an experiment. In general, the treatment leads to increases in counts on a per-person basis. However, the inter-individual variation is very large – looking at the difference between before and after with- out the lines joining pairs, we wouldn’t think there is much of a difference. Pairing makes sure the effect of the treatment is not swamped by the varia- tion between people, by controlling for each persons’ white blood cell count at baseline. • In the two sample case, depending on the the sample sizes and population variances within groups, you would need to use different estimates of the standard error. • If the sample size is large enough, we don’t need to assume normality in the population(s) under investigation. This is because the central limit kicks in and makes the means normal. In the small sample setting however, you would need normality of the raw data for the t-test to be appropriate. Otherwise, you should use a nonparametric test, see below. Pairing is a useful device for making the t-test applicable in a setting where individual level variation would otherwise dominate effects coming from treat- ment vs. control. See Figure 1 for a toy example of this behavior. • Instead of testing the difference in means between two groups, test for whether the per-individual differences are centered around zero. • For example, in Darwin’s Rhea Mays data, a treatment and control plant are put in each pot. Since there might be a pot-level effect in the growth of the plants, it’s better to look at the per-pot difference (the differences are i.i.d). Pairing is related to a few other common statistical ideas, 5 • Difference in differences: In a linear model, you can model the change from baseline • Blocking: tests are typically more powerful when treatment and control groups are similar to one another. For example, when testing whether two types of soles for shoes have different degrees of wear, it’s better to give one of each type for each person in the study (randomizing left vs. right foot) rather than randomizing across people and assigning one of the two sole types to each person Some examples from past consulting quarters, • Interrupted time analysis • Effectiveness of venom vaccine • The effect of nurse screening on hospital wait time • Testing the difference of mean in time series • t-test vs. Mann-Whitney • Trial comparison for walking and stopping • Nutrition trends among rural vs. urban populations 2.2 Difference in proportions 2.3 Contingency tables Contingency tables are a useful technique for studying the relationship between categorical variables. Though it’s possible to study K-way contingency tables (relating K categorical variables), we’ll focus on 2× 2 tables, which relate two categorical variables with two levels each. These can be represented like in the table in Table 2.3. We usually imagine a sampling mechanism that leads to this table 2, where the probability that a sample lands in cell ij is pij . Hypotheses are then formulated in terms of these pij . A few summary statistics of 2 × 2 tables are referred to across a variety of tests, • Difference in proportions: This is the difference p12 − p22. If the columns represent the survival after being given a drug, and the rows correspond to treatment vs. control, then this is the difference in probabilities someone will survive depending on whether they were given the treatment drug or the control / placebo. • Relative Risk: This is the ratio p12p22 . This can be useful because a small difference near zero or near one is more meaningful than a small difference near 0.5. 2The most common are binomial, multinomial, or Poisson, depending on whether we con- dition on row totals, the total count, or nothing, respectively 6 A1 A2 total B1 n11 n12 n1. B2 n21 n22 n2. total n.1 n.2 n.. Table 1: The basic representation of a 2× 2 contingency table. • Odds-Ratio: This is p12p21p11p22 . It’s referred to in many test, but I find it useful to transform back to relative risk whenever a result is state in terms of odds ratios. • A cancer study • Effectiveness of venom vaccine • Comparing subcategories and time series • Family communication of genetic disease 2.3.1 χ2 tests The χ2 test is often used to study whether or not two categorical variables in a contingency table are related. More formally, it assesses the plausibility of the null hypothesis of independence, H0 : pij = pi+p+j The two most common statistics used to evaluate discrepancies the Pearson and likelihood ratio χ2 statistics, which measure the deviation from the expected count under the null, • Pearson: Look at the squared absolute difference between the observed and expected counts, using ∑ i,j (nij−µ̂ij)2 µ̂ij • Likelihood-ratio: Look at the logged relative difference between observed and expected counts, using 2 ∑ i,j nij log ( nij µ̂ij ) Under the null hypotheses, and assuming large enough sample sizes, these are both χ2 distributed, with degrees of freedom determined by the number of levels in each categorical variable. A useful follow-up step when the null is rejected is to see which cell(s) con- tributed to the most to the χ2-statistic. These are sometimes called Pearson residuals. 7 • Mann-Whitney test – The null hypothesis is H0 : P (X > Y ) = P (Y > X) , which is a strictly stronger condition than equality in the means of X and Y . For this reason, care needs to be taken to interpret a rejection in this and other rank-based tests – the rejection could have been due to any difference in the distributions (for example, in the variances), and not just a difference in the means. – The procedure does the following: (1) combine the two groups of data into one pooled set, (2) rank the elements in this pooled set, and (3) see whether the ranks in one group are systematically larger than another. If there is such a discrepancy, reject the null hypothesis. • Sign test – This is an alternative to the paired t-test, when data are paired be- tween the two groups (think of a change-from-baseline experiment). – The null hypothesis is that the differences between paired measure- ments is symmetrically distributed around 0. – The procedure first computes the sign of the difference between all pairs. It then computes the number of times a positive sign occurs and compares it with the a Bin ( n, 12 ) , which is how we’d expect this quantity to be distributed under the null hypothesis. – Since this test only requires a measurement of the sign of the differ- ence between pairs, it can be applied in settings where there is no numerical data (for example, data in a survey might consist of “likes” and “dislikes” before and after a treatment). • Signed-rank test – In the case that it is possible to measure the size of the difference between pairs (not just their sign), it is often possible to improve the power of the sign test, using the signed-rank test. – Instead of simply calculating the sign of the difference between pairs, we compute provide a measure of the size of the difference between pairs. For example, in numerical data, we could just use |xiafter − xibefore|. – At this point, order the difference scores from largest to smallest, and see whether one group is systematically overrepresented among the larger scores5. In this case, reject the null hypothesis. • Evaluating results of a training program 5The threshold is typically tabulated, or a more generally applicable normal approximation can be applied 10 • Nonparametric tests for mean / variance • t-test vs. Mann-Whitney • Trial comparison for walking and stopping 2.4.2 Permutation tests Permutation tests are a kind of computationally intensive test that can be used quite generally. The typical setting in which it applies has two groups between which we believe there is some difference. The way we measure this difference might be more complicated than a simple difference in means, so no closed-form distribution under the null may be available. The basic idea of the permutation test is that we can randomly create arti- ficial groups in the data, so there will be no systematic differences between the groups. Then, computing the statistic on these many artificial sets of data gives us an approximation to the null distribution of that statistic. Comparing the value of that statistic on the observed data with this approximate null can give us a p-value. See Figure 2 for a representation of this idea. • More formally, the null hypothesis tested by permutation tests is that the group labels are exchangeable in the formal statistical sense6 • For the same reason that caution needs to be exercised when interpret- ing rejections in the Mann-Whitney test, it’s important to be aware that a permutation test can reject the null for reasons other than a simple difference in means. • The statistic you use in a permutation test can be whatever you want, and the test will be valid. Of course, the power of the test will depend crucially on whether the statistic is tailored to the type of departures from the null which actually exist. • The permutation p-value of a test statistic is obtained by making a his- togram of the statistic under all the different relabelings, placing the ob- served value of the statistic on that histogram, and looking at the fraction of the histogram which is more extreme than the value calculated from the real data. See Figure 2. • A famous application of this method is to Darwin’s Zea Mays data7. In this experiment, Darwin planted Zea Mays that had been treated in two different ways (self vs. cross-fertilized). In each pot, he planted two of each plant, and he made sure to put one of each type in each pot, to control for potential pot-level effects. He then looked to see how high the plants grew. The test statistic was then the standardized difference in means, and this was computed many times after randomly relabeling the 6The distribution is invariant under permutations. 7R.A. Fisher also used this dataset to explain the paired t-test. 11 Figure 2: A representation of a two-sample difference in means permutation test. The values along the x-axis represent the measured data, and the colors represent two groups. The two row gives the values in the observed data, while each following row represents a permutation in the group labels of that same data. The crosses are the averages within those groups. Here, it looks like in the real data the blue group has a larger mean than the pink group. This is reflected in the fact that the difference in means in this observed data is larger here than in the permuted data. The proportion of times that the permuted data have a larger difference in means is used as the p-value. plants as self and cross-fertilized. The actual difference in heights for the groups was then compared to this histogram, and the difference was found to be larger than those in the approximate null, so the null hypothesis was rejected. Some examples of permutation tests recommended in past consulting quar- ters, • Diference of two multinomials • Comparing subcategories and time series • Changepoint in time course of animal behavior 2.4.3 Bootstrap tests While the bootstrap is typically used to construct confidence intervals (see Sec- tion 3.2), it is also possible to use the bootstrap principle to perform hypothesis test. Like permutation tests, it can be applied in a range of situations where classical testing may not be appropriate. • The main idea is to simulate data under the null and calculate the test statistic on these null data sets. The p-value can then be calculated by 12 2.5.1 Analytical Traditionally, power analysis have been done by deciding in advance upon the type of statistical test to apply to the collected data and then using basic sta- tistical theory to work out exactly the number of samples required to reject the null when the signal has some assumed strength. • For example, if the true data distribution is assumed to be N ( µ, σ2 ) , and we are testing against the null N ( 0, σ2 ) using a one-sample t-test, then the fact that x̄ ∼ N ( µ, σ 2 n ) can be used to analytically calculate the probability that the observed mean will be above the t-test rejection threshold. • The size of the signal is assumed known (smaller signals require larger sample sizes to detect). Of course this is the quantity of interest in the study – if it were known, there would be no point in doing the study. The idea though is to get a rough estimate of the number of samples required for a few different signal strengths8. • There are many power calculators available, these can be useful to share / walk through with clients. 2.5.2 Computational When more complex tests or designs are used, it is typically impossible to work out an analytical form for the sample size as a function of signal strength. In this situation, it is common to set up a simulation experiment to approximate this function. • The client needs to specify a simulation mechanism under which the data can be plausibly generated, along with a description of which knobs change the signal strengths in what ways. • The client needs to specify the actual analysis that will be applied to these data to declare significance. • From here, many simulated datasets are generated for every configuration of signal strengths along with a grid of sample sizes. The number of times the signal was correctly detected is recorded and is used to estimate of the power under each configuration of signal strength and sample size. • Appropriate sample size calculations • Land use and forest cover • t-test vs. Mann-Whitney 8Sometimes, a pilot study has been conducted previously, which can give an approximate range for the signal strength to expect 15 3 Elementary estimation While testing declares that a parameter θ cannot plausibly lie in the subset H0 of parameters for which the null is true, estimation provides an estimate θ̂ of θ, typically accompanied by a confidence interval, which summarizes the precision of that estimate. In most multivariate situations, estimation requires the technology of lin- ear modeling. However, in simpler settings, it is often possible to construct a confidence interval based parametric theory or the bootstrap. 3.1 Classical confidence intervals Classical confidence intervals are based on rich parametric theory. Formally, a confidence interval for a parameter θ is random (data dependent) interval [L (X) , U (X)] such that, under data X generated with parameter θ, that in- terval would contain θ some prespecified percentage of the time (usually 90, 95, or 99%). • The most common confidence interval is based on the Central Limit The- orem. Suppose data x1, . . . , xn are sampled i.i.d. from a distribution with mean θ and variance σ2. Then the fact that √ n (x̄n − θ) ≈ N ( 0, σ2 ) for large n means that [ x̄n − zα/2 σ√n , x̄n + z1−α2 σ√ n ] , where zα 2 is the α2 quantile of the normal distribution, is a (1− α) % confidence interval for θ. • Since proportions can be thought of as averages of indicators variables (1 if present, 0 if not) which have bernoulli means p and variances p (1− p), the same reasoning gives confidence intervals for proportions. • For the same reason that we might prefer a t-test to a z-test9, we may sometimes prefer using a t-quantile instead. • If a confidence interval is known for a parameter, then it’s easy to con- struct an approximate interval for any smooth function of that parameter using the delta method. For example, this is commonly used to calculate confidence intervals for log-odds ratios. 3.2 Bootstrap confidence intervals There are situations for which the central limit theorem might not apply and no theoretical analysis can provide a valid confidence interval. For example, we might have defined a parameter of interest that is not a simple function of means of the data. In many of these cases, it may be nonetheless be possible to use the bootstrap. 9Samples sizes too small to put faith in the central limit theorem 16 • The main idea of the bootstrap is the “plug-in principle.” Suppose our goal is to calculate the variance of some statistic t (X1:n) under the true sam- pling distribution F for theXi. That is, we want to know 10 VarF ( θ̂ (X1:n) ) , but this is unknown, since we don’t actually know F . However, we can plug-in F̂ for F , which gives VarF̂ ( θ̂ (X1:n) ) . This two is unknown, but we can sample from F̂ to approximate it – the more samples from F̂ we make, the better our estimate of VarF̂ ( θ̂ (X1:n) ) . This pair of approx- imations (plugging in F̂ for F , and then simulating to approximate the variance under F̂ ) gives a usable approximation v̂ ( θ̂ ) of VarF ( θ̂ (X1:n) ) . The square-root of this quantity is usually called the bootstrap estimate of standard error. • The bootstrap estimate of standard error can be used to construct confi- dence intervals,θ̂ (X1:n)− zα2 √√√√ v̂ (θ̂) n , θ̂ (X1:n) + z1−α2 √√√√ v̂ (θ̂) n  • Since sampling from F̂ is the same as sampling from the original data with replacement, the bootstrap is often explained in terms of resampling the original data. • A variant of the above procedure skips the calculation of a variance esti- mator and instead simply reports the upper and lower α percentiles of the samples of θ̂ (X∗1:n) for X ∗ i ∼ F̂ . This is sometimes called the percentile bootstrap, and it is the one more commonly encountered in practice. • In consulting situations, the bootstrap gives very general flexibility in defining statistics on which to do inference – you can do inference on parameters that might be motivated by specific statistical structure or domain knowledge. 4 (Generalized) Linear Models Linear models provide the basis for most inference in multivariate settings. We won’t even begin to try to cover this topic comprehensively – there are entire course sequences that only cover linear models. But, we’ll try to highlight the main regression-related ideas that are useful to know during consulting. This section is focused more on the big-picture of linear regression and when we might want to use it in consulting. We defer a discussion of inference in linear models to Section 5. 10We usually want this so we can calculate a confidence interval, see the next bullet. 17 Figure 5: In the simplest setting, an interaction between a continuous and binary variable leads to two different slopes for the continuous variable. Here, we are showing the scatterplot (xi1, yi) pairs observed in the data. We suppose there is a binary variable that has also been measured, denoted xi2, and we shade in each point according to its value for this binary variable. Apparently, the relationship between x1 and y depends on the value of y (when in the pink group, the slope is less. This can exactly be captured by introducing an interaction term between x1 and x2. In cases where x2 is not binary, we would have a continuous of slopes between x1 and y – one for each value of x2. that aren’t just the ones that were collected originally might not be ob- vious to your client. For example, if you were trying to predict whether someone will have a disease12 based on time series of some lab tests, you can construct new variables corresponding to the “slope at the beginning,” or “slope at the end” or max, or min, ... across the time series. Of course, deciding which variables might actually be relevant for the regression will depend on domain knowledge. • One trick – introducing random effects – is so common that it gets it’s own section. Basically, it’s useful whenever you have a lot of levels for a particular categorical vector. Some examples where regression was used in past sessions, • Family communication genetic disease • Stereotype threat in virtual reality • Fish gonad regression • UV exposure and birth weight • Land use and forest cover • Molecular & cellular physiology 12In this case, the response is binary, so you would probably use logistic regression, but the basic idea of derived variables should still be clear. 20 Figure 6: Complex functions can be represented as simple mixtures of basis functions. This allows the modeling of nonlinear effects using just a linear model, by including the basis functions among the set of covariates. • Multiple linear regression for industry data • Prediction intervals 4.2 Diagnostics How can you assess whether a linear regression model is appropriate? Many types of diagnostics have been proposed, but a few of the most common are, • Look for structure in residuals: According to equation 1, the amount that the predictions ŷi = x T i β̂ is off from yi (this difference is called the residual ri = ŷi − yi) should be about i.i.d. N ( 0, σ2 ) . Whenever there is a systematic pattern in these residuals, the model is misspecified in some way. For example, if you plot the residuals over time and you find clumps that are all positive or negative, it means there is some unmeasured phenomena associated with these time intervals that influences the average value of the responses. In this case, you would define new variables for whether you are in one of these intervals, but the solution differs on a case- by-case basis. Other types of patterns to keep an eye out for: nonconstant spread (heteroskesdasticity), large outliers, any kind of discreteness (see Figure 7). • Make a qq-plot of residuals: More generally than simply finding large outliers in the residuals, we might ask whether the residuals are plausibly drawn from a normal distribution. qq-plots give one way of doing this – more often than not the tails are heavier than normal. Most people ignore this, but it can be beneficial to consider e.g. robust regression or considering logistically (instead of normally) distributed errors. • Calculate the leverage of different points: The leverage of a sample is a measure of how much the overall fit would change if you took that point out. Points with very high leverage can be cause for concern – it’s bad if your fit completely depended on one or two observations only – and these high leverage points often turn out to be outliers. See Figure 8 for an example of this phenomenon. If you find a few points have very high 21 Figure 7: Some of the most common types of “structure” to watch out for in residual plots are displayed here. The top left shows how residuals should appear – they look essentially i.i.d. N (0, 1). In the panel below, there is nonconstant variance in the residuals, the one on the bottom has an extreme outlier. The panel on the top-right seems to have means that are far from zero in a structured way, while the one below has some anomalous discreteness. leverage, you might consider throwing them out. Alternatively, you could consider a robust regression method. • Simulate data from the model: This is a common practice in the Bayesian community (“posterior predictive checks”), though I’m not aware if people do this much for ordinary regression models. The idea is simple though – simulate data from xTi β̂ + i, where i ∼ N ( 0, σ̂2 ) and see whether the new yi’s look comparable to the original yi’s. Characterizing the ways they don’t match can be useful for modifying the model to better fit the data. Some diagnostics-related questions from past quarters, • Evaluating regression model 4.3 Logistic regression Logistic regression is the analog of linear regression that can be used whenever the response Y is binary (e.g., patient got better, respondent answered “yes,” email was spam). 22 • Out of the box, the coefficients β fitted by logistic regression can be dif- ficult to interpret. Perhaps the easiest way to interpret them is in terms of the relative risk, which gives an analog to the usual linear regression interpret “when the jth feature goes up by one unit, the expected response goes up by βj .” First, recall that the relative risk is defined as P (yi = 1|xi) P (yi = 0|xi) = p (xi) 1− p (xi) , (4) which in logistic regression is approximated by exp ( xTi β ) . If we increase the jth coordinate of xi (i.e., we take xi → xi + δj), then this relative risk becomes exp ( (xi + δj) T β ) = exp ( xTi β ) exp (βj) . (5) The interpretation is that the relative risk got multiplied by exp (βj) when we increased the jth covariate by one unit. • Diagnostics: While you could study the differences yi−p̂ (xi), which would be analogous to linear regression residuals, it is usually more informative to study the Pearson or deviance residuals, which upweight small differences near the boundaries 0 and 1. These types of residuals take into account structure in the assumed bernoulli (coin-flipping) sampling model. • For ways to evaluate the classification accuracy of logistic regression mod- els, see Section 10.5, and for an overview of formal inference, see Section 5. • Teacher data and logistic regression • Conditional logistic regression • Land investment and logistic regression • Uganda 4.4 Poisson regression Poisson regression is a type of generalized linear model that is often applied when the responses yi are counts (i.e., yi ∈ {0, 1, 2, . . . , }). • As in logistic regression, one motivation for using this model is that using ordinary logistic regression on these responses might yield predictions that are impossible (e.g., numbers below zero, or which are not integers). • To see where the main idea for this model comes from, recall that the Poisson distribution with rate λ draws integers with probabilities Pλ [y = k] = λk exp (−λ) k! . (6) 25 Figure 11: A representation of Poisson regression with one feature. The x-axis encodes the value of this feature, and the y-axis gives the count value. Observed data points are sketched in purple. The mean count λ (x) increases as a function of x – the true mean is sketched in pink. Poisson regression models this mean count as an exponential in x, and the approximation is drawn in blue. The dashed lines represent the variation of the counts around there means. Note that larger counts are associated with larger variation, but that the Poisson regression underestimates the true variation – this data seem overdispersed. The idea of Poisson regression is to say that the different yi are drawn Poissons with rates that depend on the associated xi. • In more detail, we assume that the data have a joint likelihood p (y1, . . . , yn) = n∏ i=1 λ (xi) yi exp (−λ (xi)) yi! (7) and that the log of the rates are linear in the covariates, log λ (xi) ≈ xTi β. (8) (modeling the logs as linear makes sure that the actual rates are always nonnegative, which is a requirement for the Poisson distribution). • We think of different regions of the covariate space as having more or less counts on average, depending on this function λ (xi). Moving from xi to xi + δj multiplies the rate from λ (xi) to exp (βj)λ (xi). • As in logistic regression, it makes more sense to consider the deviance residuals when performing diagnostics. • A deficiency in Poisson regression models (which often motivates clients to show up to consulting) is that real data often exhibit overdipersion 26 with respect the assumed Poisson model. The issue is that the mean and variance of counts in a Poisson model are tied together: if you sample from then the mean and variance of the associated counts are both λ. In real data, the variance is often larger than the mean, so while the Poisson regression model might do a good job approximating the mean λ (xi) at xi, the observed variance of the yi near xi might be much larger than λ (xi). This motivates the methods in Section . 4.5 Psueo-Poisson and Negative Binomial regression Pseudo-Poisson and negative binomial regression are two common strategies for addressing overdispersion in count data. In the pseudo-Poisson setup, a new parameter ϕ is introduced that sets the relative scale of the variance in comparison to the mean: Var (y) = ϕE [y]. This is not associated with any real probability distribution, and the associated like- lihood is called a pseudolikelihood. However, ϕ can be optimized along with β from the usual Poisson regression setup to provide a maximum pseudolikelihood estimate. In negative binomial regression, the Poisson likelihood is abandoned alto- gether in favor of the negative binomial likelihood. Recall that the negative bi- nomial (like the Poisson) is a distribution on nonnegative counts {0, 1, 2, . . . , }. It has two parameters, p and r, Pp,r [y = k] = ( k + r − 1 k ) pk (1− p)r (9) which can be interpreted as the number of heads that appeared before seeing r tails, when flipping a coin with probability p of heads. More important than the specific form of the distribution is the fact that it has two parameters, which allow different variances even for the same mean, Ep,r [y] = pr 1− p (10) Varp,r (y) = pr (1− p)2 = Ep,r [y] + 1 r (Ep,r [y])2 . (11) In particular, for small r, the variance is much larger than the mean, while for large r, the variance is about equal to the mean (it reduces to the Poisson). For negative binomial regression, this likelihood 9 is substituted for the Pois- son when doing regression, and the mean is allowed to depend on covariates. On the other hand, while the variance is no longer fixed to the mean, it must be the same across all data points. This likelihood model is not exactly a GLM (the negative binomial is not in the exponential family), but various methods for fitting it are available. There is a connection between the negative binomial and the Poisson that both illuminates potential sources of overdispersion and suggests new algorithms 27 we increase the jth variable by one unit, so that xi → xi + δj . Then, the relative probability against class K changes according to pW (yi=k|xi+δj) pW (yi=K|xi+δj) pW (yi=k|xi) pW (yi=K|xi) = exp ( wTk (xi + δj) )∑′ k exp (· · · ) ∑ k′ exp (· · · ) exp ( wTK (xi + δj) ) (23) = exp ( wTk xi + δ wk j ) (24) = exp ( wTk xi ) exp (wkj) , (25) where in the second line we used the fact that wK is a priori constrained to be 0. So, the K-class analog of relative risk for the kth class is multipled by exp (wkj) when we increase the j th feature by a unit. 4.8 Ordinal regression Sometimes we have K classes for the responses, but there is a natural ordering between them. For example, survey respondents might have chosen one of 6 values along a likert scale. Multinomial regression is unaware of this additional ordering structure – a reasonable alternative in this setting is ordinal regression. • The basic idea for ordinal regression is to introduce a continuous latent variable zi along with K − 1 “cutpoints” γ1, . . . , γK−1, which divides the real line into K intervals. When zi lands in the k th of these intervals, we observe yi = k. • Of course, neither the zi’s nor the cutpoints γk are known, so they must be inferred. This can be done using the class frequencies of the observed yis though (many yi = k means the k th bin is large). • To model the influence of covariates p (yi = k|xi), we suppose that zi = βTxi + i. When i is Gaussian, we recover ordinal probit regression, and when i follows the logistic distribution 13 we recover ordinal logistic regression. • An equivalent formulation of ordinal logistic regression models the “cu- mulative logits” as linear, log ( p (yi ≤ k) 1− p (yi ≤ k) ) = αk + β Txi. (26) Here, the αk’s control the overall frequencies of the k classes, while β controls the influence of covariates on the response. • Outside of the latent variable interpretation, it’s also possible to under- stand β in terms of relative risks. In particular, when we increase the jth coordinate by 1 unit, xi → xi + δj , the odds of class k relative to class k−1 gets multiplied by exp (βj), for every pair of neighboring classes k−1 and k. 13This is like the Gaussian, except it has heavier (double-exponential-like) tails. 30 Figure 12: The association between the latent variable and the observed re- sponse in ordinal regression. The top panel is a histogram of the latent zis. The dashed lines represent associated cutpoints, which determine which bin the observed response yi belong to. The correspondence is mapped in the bottom panel, which plots the latent variable against the observed yi – larger values of the latent variable map to larger values of yi, and the width of bins in the top panel is related to the frequency of different observed classes. Latent variables give a natural approach for dealing with ordered categories. 31 Figure 13: To incorporate the influence of features xi in ordinal regression, a model is fit between the xi and the latent zi. The x-axis represents the value of this feature, and the y-axis represents the latent zi. The histogram on the left is the same as the histogram in Figure 12. The (xi, zi) in the real data are in the purple scatterplot, and their projection onto the y-axis is marked by purple crosses. The bin in which the purple crosses appear (equivalently, which horizontal band the data point appears in) determines the observed class yi. In this data, larger values of xi are asociated with larger latent zi, which then map to higher ordinal yi. 5 Inference in linear models (and other more complex settings It’s worthwhile to draw a distinction between model estimation and statistical inference. Everything in section 4 falls under the purview of model estimation – we are building interpretable representations of our data by fitting different kinds of functions over spaces defined by the data. In contrast, the goal of inference is to quantify uncertainty in our analysis – how far off might our models be? The point is that critiquing the “insights” gained from a data analysis is just as scientifically relevant as the analysis itself. The testing and estimation procedures outlined in section 2 are a first step towards this kind of critical evaluation of data analysis, but those ideas become much more powerful when combined with GLMs. 5.1 (Generalized) Linear Models and ANOVA One of the reasons linear models are very popular across many areas of science is that the machinery for doing inference with them is (1) relatively straightfor- wards and (2) applicable to a wide variety of hypotheses. While we can fit linear models without making assumptions about the un- derlying data generating mechanism, these assumptions are crucial for doing valid inference. 32 interval (or region) for a parameter (or a set of parameters) contains all the configurations of those parameters that we would not reject, if they were set as the values for the hypothesized submodel. It’s better to recommend confidence intervals when possible, because they give an idea of practical in addition to statistical significance. While understanding these fundamentals is important for using them suc- cessfully during a consulting session, the real difficulty is often in carefully for- mulating the client’s hypothesis in terms of a linear model. This requires being very explicit about the data generating mechanism, as well as the pair / sub- model pair of interest. However, this additional effort allows testing in settings that are much more complex than those reviewed in section 2, especially those that require controlling for sets of other variables. • Testing the difference of mean in time series 5.2 Multiple testing Multiple testing refers to the practice of testing many hypotheses simultane- ously. The need to adapt classical methods is motivated by the fact that, if (say) a 100 tests were done all at a level of α = 0.05, and the the data for each hypothesis is in fact null, you would still reject about 5 on average. This comes from the interpretation of p-values as the probability under the null hypothesis that you observe a statistic at least as extreme as the one in your data set. 5.2.1 Alternative error metrics The issue is that in classical testing, errors are measured on a per-hypothesis level. In the multiple testing setting, two proposals that quantify the error rate over a collection of hypotheses are, • Family-Wise Error Rate (FWER): This is defined as the probability that at least one of the rejections in the collection is a false positive. It can be overly conservative in settings with many hypothesis. That said, it is the expected standard of evidence in some research areas. • False Discovery Rate (FDR): This is defined as the expected proportion of significant results that are false positives. Tests controlling this quantity don’t need to be as conservative as those controlling FWER. 5.2.2 Procedures The usual approach to resolving the multiple testing problem is (1) do classical tests for each individual hypothesis, and then (2) perform an adjustment of the p-values for each test, to ensure one of the error rates defined above is controlled. Two adjustments are far more common than the rest, • Bonferroni: Simply multiple each p-value by the number of tests N that were done (equivalently, divide each sigificance threshold by N). This procedure controls the FWER, by a union bound. 35 • Benjamini-Hochberg: Plot the p-values in increasing order. Look for the last time the p-values drop below the line iαN (viewed as a function of i). Reject all hypotheses associated with the p-values up to this point. This procedure controls the FDR, assuming independence (or positive dependence) of test statistics among individual hypotheses. A few other methods are worth noting, • Simes procedure: This is actually always more powerful than Bonferroni, and applicable in the exact same setting, but people don’t use it as often, probably because it’s not as simple. • Westfall-Young procedure: This is a permutation-based approach to FWER control. • Knockoffs: It’s possible to control FDR in high-dimensional linear models. This is still a relatively new method though, so may not be appropriate for day-to-day consulting (yet). Here are some problems from previous quarters related to multiple testing, • Effectiveness of venom vaccine • Multiple comparisons between different groups • Molecular & cellular physiology • Pyschiatric diagnosis and human behavior 5.3 Causality • Land use and forest cover 5.3.1 Propensity score matching • The effect of nurse screening on hospital wait time 6 Regression variants 6.1 Random effects and hierarchical models • Systematic missing test • Trial comparison for walking and stopping 6.2 Curve-fitting • Piecewise Linear Regression 36 6.2.1 Kernel-based • Interrupted time analysis 6.2.2 Splines 6.3 Regularization 6.3.1 Ridge, Lasso, and Elastic Net 6.3.2 Structured regularization 6.4 Time series models • UV exposure and birth weight • Testing the difference of mean in time series • Nutrition trends among rural vs. urban populations 6.4.1 ARMA models 6.4.2 Hidden Markov Models • Change point in time course of animal behavior 6.4.3 State-space models 6.5 Spatiotemporal models 6.6 Survival analysis 6.6.1 Kaplan-Meier test • Classification and survival analysis 7 Model selection 7.1 AIC / BIC 7.2 Stepwise selection 7.3 Lasso • Culturally relevant depression scale 37 • Jaccard: This is a distance between pairs of length p binary sequences xi and xj defined as d (xi, xj) = 1− ∑p k=1 I (xik = xjk = 1) p , (35) or one minus the fraction of coordinates that are 0/1. The motivation for this distance is that coordinates that are both 0 should not contribute to similarity between sequences, especially when they may be dominated by 0s. We apply this distance to the binarized version of the species counts. • Dynamic time warping distance: A dynamic time warping distance is useful for time series that might not be quite aligned. The idea is to measure the distance between the time series after attempting to align them first (using dynamic programming). • Mixtures of distances: Since a convex combination of distances is still a distance, new ones can be tailored to specific problems accordingly. For example, on data with mixed types, a distance can be defined to be a combination of a euclidean distance on the continuous types and a jaccard distance on the binary types. In contrast, probabilistic clustering techniques assume latent cluster indica- tor zi for each sample and define a likelihood model (which must itself be fit) assuming these indicators are known. Inference of these unknown zi’s provides the sample assignments, while the parameters fitted in the likelihood model can be used to characterize the clusters. Some of the most common probabilistic clustering models are, • Gaussian mixture model: The generative model here supposes that there are K means µ1, . . . , µK . Each sample i is assigned to one of K categories (call it zi = k), and then the observed sample is drawn from a gaussian with mean µk and covariance Σ (which doesn’t depend on the class assign- ment). This model can be considered the probabilistic analog of K-means (K means actually emerges as the small-variance limit). • Multinomial mixture model: This is identical to the gaussian mixture model, except the likelihood associated with the kth group is Mult (ni, pk), where ni is the count in the i th row. This is useful for clustering count matrices. • Latent Class Analysis: • Hidden Markov Model: This is a temporal version of (any of) the earlier mixture models.The idea is that the underlying states zi transition to one another according to a markov model, and the observed data are some emission (gaussian or multinomial, say) from the underlying states. The point is that a sample might be assigned to a centroid different from the closest one, if it would allow the state sequence to be one with high probability under the markov model. 40 Related to these probabilistic mixture models are mixed-membership models. They inhabit the space between discrete clustering continuous latent variable model methods – each data point is thought to be a mixture of a few underlying “types.” These our outside of the scope of this cheatsheet, but see .... for details. • Medwhat learning algorithm • Unsupervised learning for classifying materials • Clustering survey data • CS 106A survey 8.2 Low-dimensional representations 8.2.1 Principle Components Analysis • Relationship power scale • Survey data underlying variables • Analyzing survey data • Teacher data and logistic regression • Unsupervised learning for classifying materials 8.2.2 Factor analysis • Culturally relevant depression scale 8.2.3 Distance based methods 8.3 Networks • Molecular & cellular physiology • Variational bounds on network structure 8.4 Mixture modeling 8.4.1 EM 9 Data preparation 9.1 Missing data • Systematic missing test 9.2 Transformations • Normalizing differences for geology data 41 9.3 Reshaping 10 Prediction • Homeless to permanent housing 10.1 Feature extraction • Classification based on syllable data • Unsupervised learning for classifying materials 10.2 Nearest-neighbors 10.3 Tree-based methods 10.4 Kernel-based methods 10.5 Metrics • Unsupervised learning for classifying materials 10.6 Bias-variance tradeoff 10.7 Cross-validation • Classification and survival analysis • UV exposure and birth weight • Land investment and logistic regression 11 Visualization The data visualization literature reflects influences from statistics, computer science, and psychology, and a range of principles have emerged to guide the design of visualizations that facilitate meaningful scientific interpretation. We’ll review a few of those here, along with a few practical recommendations for how to brainstorm or critique data visualizations during a consulting session. Some general principles developed by the data visualization community are • Information density: Visualizations that maximize the “data-to-ink” ratio are much more informative than those that represent very little quantita- tive information using a large amount of space. For example, a histogram of a sample is more informative than just a mean + standard error. The flipside is tha reducing the amount of irrelevant “ink” (e.g., excessively dense grid lines) or white space can make it easier to inspect the true data patterns. The general philosophy is that an intricate but informative 42
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved