Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Statistical Analysis for Hypothesis Testing: Variables, Descriptive Stats, and Tests, Study notes of Ecology and Environment

University of Illinois - Urbana-Champaign Ecology and Environment

An introduction to statistical hypotheses, the process of collecting data to test hypotheses, and the role of statistics in determining the significance of differences between means or associations among variables. It covers the distinction between discrete, continuous, and categorical variables, measures of dispersion, and various statistical tests such as t-tests, anova, chi-square tests, and correlations. The document also includes examples and instructions for performing statistical analyses using excel.

Typology: Study notes

2009/2010

Uploaded on 02/24/2010

koofers-user-53c-1 🇺🇸

(1)

10 documents

1 / 24

Partial preview of the text

Download Statistical Analysis for Hypothesis Testing: Variables, Descriptive Stats, and Tests and more Study notes Ecology and Environment in PDF only on Docsity! 1 Introduction to Statistics and Hypothesis Testing Overview We will be using some basic statistics this semester. Because a statistics class is not a required prerequisite for this course, we are going to use the first lab to go over and practice applying some very basic statistical concepts. You will use these concepts repeatedly in your lab write-ups and for your Student-Driven Independent Projects. This abridged statistics manual is organized into three sections: I) An introduction to hypothesis testing. II) An introduction to statistical concepts, and in what settings to apply them. At the end of this section you will find a summary table and a flow chart that may help you decide in which setting to use a particular statistical test. III) The last section provides the mathematical procedures for each of the introduced statistical concepts, and (where applicable) how to run each test using excel. Terms in bold are terms you should know the definitions of. References Modified from Cáceres, C.E. 2007. IB 449, Limnology Lab Manual. Modified from Augspurger, C.K. and G.O. Batzli. 2008. IB 203, Exercises in Ecology Course Manual. 2 I) Introduction to Hypothesis Testing Asking questions in a scientifically meaningful way Statistics are of little use unless you first have a clear hypothesis and a specific quantitative prediction to test. Consider a simple example. You are interested in the research question whether herbivores affect the amount of algae present in lake water. You make that speculation a bit more specific and develop a general hypothesis: The presence of herbivores reduces the abundance of algae in lake water. Now you have something to work with, but it is still a pretty vague notion. What you need to do is apply this hypothesis to a specific study system. You know from surveys of local lakes that some lakes have Daphnia (large herbivorous zooplankton) whereas others do not. So, based on your hypothesis, you can now make a specific, testable prediction: The mean abundance of algae will be greater in lakes without Daphnia than in lakes where Daphnia are present. The hypothesis and its prediction can be formally written as an if-then statement as follows: If herbivores affect the amount of algae in lake water, then the mean abundance of algae will be greater in lakes without Daphnia than in lakes with Daphnia. Once you have a quantitative prediction firmly in mind for your experiment or study system, you can proceed to the next step – forming statistical hypotheses for testing. Statistical hypotheses are possible quantitative relationships among means or associations among variables. The null hypothesis (Ho) states that there is no association among two variables, or no difference among means. For our example, it would be stated as “Ho: algal abundance in Daphnia-absent lakes = algal abundance in Daphnia-present lakes” (or, Ho: A = B). The alternative hypothesis (H1) states the pattern in the data that is expected if your predictions hold true. For our example, “H1: algal abundance in Daphnia-absent lakes > algal abundance in Daphnia-present lakes” (or, H1: A > B). Once our question has taken on this shape, we can begin thinking about experimental design, and start collecting data. Answering your questions Now you have to go out, collect data, and evaluate whether or not the data support your hypothesis. Ideally, we might want to know the mean abundance of algae in ALL lakes with and without Daphnia, a population of lakes that would constitute the universal set. However, this is rarely if ever feasible or possible. Instead, we are usually restricted to pick a subset of lakes (a sample set, or sample in short) from each type of lake, which allows us to obtain an estimate of the overall relationship between algal abundance and Daphnia presence of the universal set. Hopefully, the sample set is representative of the universal set. Selecting unbiased, random methods of sampling can go a long way in ensuring that the sample set has a greater likelihood of being representative of the universal 5 B) Descriptive Statistics (Univariate Statistics) If we take the time to closely investigate any biological species or phenomenon we will find that variation is the norm, not the exception. If we are to describe a particular characteristic of a species for example, it thus does not suffice to choose a “perfect specimen” to measure. One of Charles Darwin's great insights was to realize that for species the interesting thing was the variation within a population, the spread, and not the population mean or a representative abstraction (the Platonic ideal or holotype). Variation - the deviations from the mean - can play a critical role in understanding a natural system. There are several commonly used descriptive statistics to use on discrete and continuous variables. These include measures of central tendency, measures of dispersion, and measures of symmetry and shape. A brief description of the most important of each of these follows below. Histograms as an exploratory tool A good starting point to understanding the variation in a data set is to determine the distribution of values of the measured variable. A distribution is a mathematical map of how the frequency of occurrence changes as a function of variable magnitude. The best way to see and first investigate the distribution is by the construction of a histogram. The x-axis is the variable value (e.g. size), and the y-axis is a measure of the frequency that a variable or range of variables occurs. A histogram gives you a quick, first visual impression of the data you collected. The shape of the distribution that we plotted as a histogram can be described in terms of three categories: descriptors of central tendency, of dispersion, and of symmetry. The following description applies mainly to unimodal distributions (distributions with only one peak). Bimodal or polymodal distributions often (but not always) imply that the sample set consists of more than one population, and the sampling scheme may have to be rethought Measures of central tendency mode: most frequently occurring value (highest peak of the frequency distribution) median: # such that half the values are lower and half are higher. arithmetic mean: a.k.a. the average. The sum of all the values (xi) divided by the number of data points (n): ! x = (∑xi)/n = (x1 + x2 + x3 + … + xi)/n 6 Measures of dispersion range: the difference between the maximum and minimum value: xmax - xmin variance: the average of the squared deviations from the mean in a sample or population. The variance considers how far each point is from the mean. The variance for a population (the universal set) is denoted by σ² and is calculated as: ! 2 " = populationSS N WHERE: SS = Sum of Squares = ! x i " x# $ % & ' 2 The best estimate of the population variance is the sample variance, s2: s2 = ! sampleSS n "1 standard deviation = square root of the variance [st. dev. = σ = √(σ²)]. The standard deviation is a description of the spread of data points around the mean. An approximate way to think about the standard deviation is as an average of the amount that the observations deviate from the mean. If the observations form a normal distribution (see below) around the mean, and ONLY if they form a normal distribution, 68% of the observations are within +/- 1 standard deviation, 95.4% of the observations are within +/- 2 standard deviations, and 99.7% of the observations are within +/- 3 standard deviations of the mean. standard error: The standard error is defined at the standard deviation divided by the square root of the number of observations, or: SE = standard deviation / √n Technically, the standard error is not a measure of dispersion. Instead it is a measure of how certain we are that the sample mean is a good representation of the universal of population mean that we are trying to estimate with our sample. Note that the standard error is large for small sample sizes, and becomes progressively smaller as sample size increases, and hence as our estimate of the “true population mean” gets better. Plotting the standard error as whiskers around the sample mean indicates that the true population mean lies somewhere within this range. When to plot the standard deviation versus the standard error? If we are interested in how the individual observations are distributed around the mean, we plot the standard deviation. If we are interested in knowing the range of possible values of the population mean that we are trying to estimate (i.e. what is the uncertainty associated with our estimate of the population mean), we plot the standard error. 7 Measures of symmetry As stated above, variation is the norm in the natural world. For example, imagine an insect population of identical genetic makeup with determinate growth. If we imagine a perfectly uniform environment, individuals of this insect population might all grow to 3.0 cm in length. However, such environmental consistency does not exist, even in carefully controlled laboratory settings. Slight variations in temperature, humidity, food availability, competition with sibling larvae, food quality, and many other effects lead to variation to the final size of the adult insect. So long as these sources of the variation (or dispersion from the mean) is the result of many random and independent contributions, the mean size of adult individuals of this insect population will still be 3.0 cm, but with a symmetric distribution of sizes surrounding this mean in the form of a normal or Gaussian distribution. Many natural phenomena have distributions approaching a normal distribution. Additionally, normal distributions have convenient properties. For example, all the information we need to plot any given normal distribution is the mean and the standard deviation. Additionally, if, and only if, a distribution is normally distributed, 68.2% of the observations are within +/-1 standard deviation, 95.4% of the observations are within +/- 2 standard deviations, and 99.7% of the observations are within +/- 3 standard deviations of the mean (see figure below). A great wealth of statistical techniques (in fact, all the tests included in this introductory statistics chapter) requires that the data be normally distributed. It is thus necessary to test whether your data are normally distributed. The following simple measure of symmetry will suffice for our purposes in this course. Note that although it is advantageous if your data follow a normal distribution, not all is lost if they don’t. Often, the data can be transformed so as to fit a normal distribution. If that does not help, a number of statistical tests can be substituted for the ones we discuss here that do not require normally distributed data. If you run into these issues, especially for your independent project, COME SEE ONE OF THE INSTRUCTORS FOR HELP! 10 of dragonfly food in each tank after a specified amount of time compared to what we placed in the tank initially and use a t-test to see whether there was any difference in the mean amount of food in the 10 tanks with just dragonflies (treatment 1) and in the mean amount of food in the 10 tanks with added predators (treatment 2). ANOVA (Analysis of Variance) ANOVA looks for differences among treatments (like a t-test), by examining the variance around the mean. The main difference from a t-test is that ANOVA can look for differences in more than two means. In fact, a t-test is really just a simple type of ANOVA. What you need to know is that if the ANOVA is significant, then there is a difference among the treatments. However, if there is a significant difference, ANOVA does not tell you anything about which treatments are significantly different from the others, and you must use other tests, such as multiple pair-wise comparisons via t-tests, to determine which are significantly different from the others. So why bother doing an ANOVA first, why not just start with multiple t-tests? The reason is similar to gambling at a slot machine. Although the likelihood of winning is very small for each individual game, if you pull that slot machine often enough, eventually you will win, just by chance. Similarly, every time you do a t-test, you stand a 5% (0.05) chance of finding a difference even though there is no difference. The more t-tests you run, the greater the chance that one of your pair-wise comparisons, just by chance, shows a significant difference. This means that if you are comparing multiple means of a data set via pair-wise t-tests, the overall likelihood of finding a difference SOMEWHERE in your data set even though there is in reality no difference is greater than 0.05. The ANOVA does not have such a compounding problem. Hence, the approach is to FIRST run an ANOVA on your data set. IF the ANOVA found significant differences between means in your data set, you are now justified to look for where these differences are by doing pair-wise comparisons of the means. An example: Let’s modify the t-test experiment that ad two treatments. Instead we have three treatments now: dragonflies alone (treatment 1), dragonflies and predator type 1 (treatment 2), and dragonflies and predator type 2 (treatment 3). We would use an ANOVA to see if there is a difference among the three treatments in the mean amount of food consumed by the dragonfly larvae. Correlation A t-test is used to test for statistically significant differences between the means of two different treatments. Correlations require a different kind of data. Here you have a list of subjects (individual animals, plots, ponds) and take measurements of two different continuous variables for each subject. Do subjects with a high level of 11 variable X also tend to have a high level of variable Y? These kinds of systematic associations between two variables are described by correlation. A correlation establishes whether there is some relationship between two continuous variables – but does not determine the nature of that relationship, i.e. it cannot be translated into cause and effect. The intensity of the relationship is measured by the correlation coefficient “r”. The value of r varies between -1 and 1. -1 indicates a strong negative relationship between the two variables, 0 indicates no relationship, and 1 a strong positive relationship. If the associated p-value <0.05, we can conclude that the positive or negative relationship between the two variables is unlikely to have resulted by chance. An example: You suspect that there is some relationship between the length of a person’s thighbone and their overall height. To test this, you take a sample of people, and measure their height and the length of their thighbone for each of them. We then run a correlation using thighbone length as the X variable and height as the Y variable, or vice versa. Since in a correlation we do not assume that one variable has a direct effect on the other, but simply shows a systematic relationship in their variation, there are no dependent or independent variables for a correlation, and hence it does not matter, which variable is on which axis (BUT compare to regression!). In our example case we would probably find a relatively strong positive correlation – an r- value of above 0.8 or so, indicating that tall individuals tend to have long thigh bones and vice versa. Note however, that we did not imply that height caused long thighbones, or that long thighbones caused greater height. Nor did we determine whether thighbone length is a good predictor of height. Regression A regression is similar to a correlation. One of the key differences is that a regression DOES test whether changes in one variable are predictive of changes in the other, and provides a simple mathematical model of this relationship that can be used to predict additional values of variable Y for an untested value of variable X. For simple linear regression, this mathematical model is a straight line, following the old y = mx + b (m=slope, b=y-intercept). Once this line is established we can predict values of y from any value of x. A word of caution: It is not advisable to try to infer values of y for values of x that exceed the range of values for x that were experimentally tested in establishing the regression. The regression loses its reliability beyond the tested range. How well we can predict y from x, in other words how strong is the relationship between the two variables is measured by the coefficient of determination, or “r2” (the correlation coefficient squared). r2 ranges from 0 to 1, with 1 indicating that the values of x perfectly predict y; that is if we regress the values, all values fall exactly on the regression line (this never happens in ecology). An interesting note is that the r2 value tells us how much of the variation in our dataset is 12 explained by our statistical model (the regression line). For example, an r2 value of 0.62 indicates that our model captures 62% of the variation in the relationship of y to x. We use the p-value to assess whether the relationship observed between the two variables is due to sampling error or indicative of a real pattern. p-values < 0.05 allow you to reject the Ho that the observed relationship is due to chance – you have found a real association between the variables. An example: We have data on light intensity and algal biomass for twenty locations within a reach of stream. We suspect that higher light intensities may lead to more algal biomass. So we could run a regression on light intensity and algal biomass, using light as our x variable and algal biomass as our y variable. We end up with an equation for the regression line and an r2 value. With a high r2 value we can feel confident in using the equation for predicting values of y, based on x. An interesting note is that the r2 value tells us how much of the variation in our dataset is explained by our statistical model (the regression line). For example, an r2 value of 0.62 indicates that our model captures 62% of the variation in the relationship of y to x. NOTE: As opposed to a correlation, a regression does have dependent versus independent variables. Read the underlined sentence above again. That sentence assumes directionality in the relationship of the two variables. Light intensity is suspected to influence algal biomass, not the other way around. So light is independent, whereas algal biomass is dependent on light intensity. By convention, we plot the independent variable (light) on the x-axis, and the dependent variable (algal biomass) on the y-axis. Another way of thinking about this is that the independent variable is the variable manipulated by us during the experiment (light) to test the response in the dependent variable (algal biomass). We are not likely to try to manipulate algal biomass and expect the light intensity striking the water surface to change as a result. Which variable is the independent vs. dependent, and hence whether it is plotted on the x or the y-axis DOES make a difference in a regression. The reason is that a regression determines the deviations of each data point from the best-fit line (the regression line). There are two directions in which these deviations can be measured: in a horizontal direction, and a vertical direction (see figures below). When we ask: how well can X predict Y, we are interested only in the deviations from the line in the vertical, y-direction. If we confuse which variable is the dependent vs. independent, we calculate a best-fit line in the horizontal, or X- direction, and receive a different (and wrong) answer. 15 III) Calculating your statistics (*Note: these instructions are written for Microsoft Office 2003/2004 (PC/Mac) versions, not the 2007/2008 Office.) Here you will find instructions on how to calculate and plot each of the statistical procedures described in Section II) in the sequence they were introduced. A) HISTOGRAMS Construction of histograms by hand 2. First decide on the best bin width (and hence the number of bins needed to cover the data range) based on maximum and minimum population values and the sample size. If most bins don't have at least several values that fall within them, then the bin size should be increased. Try something like sample size divided by # of bins equals or is larger than 5-10. Another rule of thumb is to choose your number of bins as the square root of n, (n = sample size). Your bins should be the same width. 3. Decide on what value to start your bin boundaries with. As you will see it can make a difference, especially for sample sets with a smaller n. 4. Count the number of population values that fall within a certain bin. If the value is equal to the bin boundary, typically it counts in the bin to the right (a greater than or equal to rule). 5. Plot the bin boundaries along the x-axis. Plot columns in the y direction whose height is equal to the number of values from the population within a given bin. Alternatively, you may also plot columns as the %-value of the counts within a bin with respect to the total sample size. 6. Construction of histograms in Excel 7. The initial steps are the same as described above. There are two ways to obtain a histogram in Excel: Using the FREQUENCY function: 1. I suggest you avoid this until you become or are more familiar with Excel. How it operates differs from one Excel version to another. In some situations it can provide some greater flexibility. 2. Insert sample data in a column. If you have a lot of data you may want to sort it first to find the minimum and maximum numbers. 3. Create an array of bin values in descending order. 4. Use the FREQUENCY function, selecting the data and bin arrays when asked. 5. The frequency function should return an array of numbers that are the values in each bin. You may have to drag/copy the first cell downwards to create the array. 6. Idiosyncrasies of the Excel FREQUENCY function. If you have your bin values in ascending order it computes a cumulative frequency. If you have your bin values in descending order it computes a standard frequency. Bottom line, always - check out your numbers to make sure they make sense. For older versions of Excel you can try using control-shift-enter for the FREQUENCY function. One Excel source indicates that this should yield an array instead of a single number. In some versions of Excel on some platforms you may want to fix your data cell reference with a $. 7. Use the column plot to look at the results. You can plot multiple histograms at once. If you look in the Series window there is a place to label the bin intervals by choosing your bin array. Check your x axis labeling to make sure you are plotting bin intervals. Unfortunately Excel is not very good about placement of x labels. 8. Using the histogram routine (the suggested route): 16 9. The Histogram routine is found under Tools, under Data Analysis. This is an add-in and if you can't find it you may need to install it. Look under Tools at the Add-ins option. If you are lucky you won't need the CD. It varies from platform to platform and by Excel version. 10. Choose the Histogram insert the data and bin arrays in the appropriate spot in the input box. Choose what type of output you want. Make sure you select the chart option. Notice that this will put the results into a separate sheet. Select finish and, that's it, your done. Population distributions vs. histograms: with increasing n (n = sample size) and smaller and more bins the histogram should approach a curve, i.e. a continuous frequency distribution. B) DESCRIPTIVE STATISTICS The formulae and definitions for calculating the various descriptive statistics are given with their explanation in Section II) of this booklet. Excel univariate statistical functions: Below is a list of functions you may find useful in describing distribution characteristics. You can find a list of Excel functions under Insert > Function. You should familiarize yourself with the array of functions available to you. If you want to enter a formula into a cell, start with the equal sign. Then you can build the equation afterwards using standard mathematical operators and by inserting function commands. Available functions include: AVERAGE, FREQUENCY, MEDIAN, MODE, RANK, SKEW, SORT, STDEV (for standard deviation), VAR (for variance). The function arguments (the numbers it acts on) are placed within parentheses after the function. For statistical functions an array of numbers is input into the function, usually using cell references for the beginning and end of the array. For example typing"=AVERAGE(A1:A36)" in a cell will calculate the average of the numbers in the array of cells from A1 through A36. Note that Excel describes what the function does when you insert it into the formula bar. Help will also give you a more complete description. The frequency function can be a bit tricky to use. 17 C) T-TEST Requires: 1 categorical independent variable with 2 treatments and 1 dependent variable a) Use Excel to make a figure 1. Calculate the average and the SE (standard error) for each treatment. (Think about why you need to use the standard error, and not the standard deviation!). Create the following type of table in the Excel spreadsheet: Treatment Average SE Category 1 [Fill in your calculated values] … Category 2 … … 2. Highlight the 6 cells in the first 2 columns of the table. MAC: Click on chart wizard (with bar chart) icon that is located in the toolbar. PC: From the menu bar, pull down Insert and click ‘Chart’. Click on the type of chart and subtype you want to make. Click ‘next’. (We recommend using a bar graph in this instance). 3. Do NOT give the Figure a title. Label x-axis (horizontal with independent variable); label y-axis (vertical with dependent variable). 4. Clean up your graph while on the sheet for labeling axes. (Remove extra legends, gridlines, background coloration, etc). (Click on gridlines in menu and un-do check next to gridline box, etc) i) Select ‘next’; select ‘as object in sheet 1’; select ‘finish’. ii) To change the x-axis so that it starts at 0, click on the axis. Select scale. Minimum = 0. 5. Add Standard Error bars to each average. i) MAC: double click inside one bar; select ‘Y error bars’. PC: right click inside a bar; select Format Data Series select ‘Y error bars’. For both MAC and PC: ii) For display, select ‘both’. For Error Amount, select ‘Custom’. Select triangle (or icon) to R. iii) Highlight SE data from worksheet (for +bar). Repeat for –bar. 6. In a cell BELOW the figure, create a figure caption (we also call this a figure legend). The rule of thumb is that the reader should completely understand the figure without reading any of the text of the paper. Include what the bar and vertical lines indicate (mean +/- SE). Keep it to one complete ‘sentence’. For example: “Figure 1. Comparison of plant species richness (mean +/- SE) between two fields abandoned for one vs. five years.” 20 E) CORRELATION / REGRESSION Requires: two continuous variables. How to create figure in Excel: 1. Have the data for the 2 variables in the first two columns of the worksheet. 2. Make a (XY) scattergram (scatterplot) figure i. Select chart wizard in tool bar ii. Chart type: XY (Scatter); chart subtype: upper left iii. Data range: select all cells on worksheet with data. iv. Don’t give a title to your figure. Label the x (include units) and y- axes. v. Save as object in Sheet 1. Move to lower left of page. vi. In a cell BELOW the figure, create a figure caption (we also call this a figure legend). The rule of thumb is that the reader should completely understand the figure without reading any of the text of the paper. Keep it to one complete ‘sentence’. For example: “Figure 1. Number of fruits as a function of total leaf area (cm2) of a pawpaw tree.” How to calculate CORRELATION statistic in Excel: (no cause-effect in x and y) 1. Calculate the correlation coefficient = r, which measures the strength of association between the two variables. a. Type ‘correlation’ in a cell below column A data. Select cell to its right. b. Select fx in the toolbar, then ‘statistical’, then ‘correl’. c. Array1: Select data in column A d. Array 2: Select data in column B. OK. Interpretation of results: 1. Whether r = significant depends on the degrees of freedom. The critical value that r must be greater than to be significant (P<0.05) will have to be looked up in a correlation table (found at the end of this section). A p-value <0.05 indicates that the relationship (either positive or negative) that you observe between the two variables is due to chance. There is a systematic relationship. 2. Based on your data what can you conclude? Do the data support your hypothesis? 3. To write up the “Results” section of a manuscript, summarize the main pattern in your figure. State the statistical output and refer to Fig. 1 in parentheses at the end of the sentence. Keep the results of a correlation to 1 sentence. Example: “Number of fruits did not increase significantly as total leaf area of the tree increased (r = 0.523, P > 0.05) (Fig.1).” 21 How to calculate REGRESSION statistic in Excel: (assuming cause-effect in x and y) THE QUICK AND DIRTY WAY: 1. Add a regression line to the above figure. a. MAC: highlight chart area within figure. From chart menu select “Add trendline”, for type tab, select “linear” b. PC: right click on one point in the figure. Do as above for MAC. 2. Calculate r2 (values range from 0 to 1) a. Highlight chart area within figure b. From chart menu select add trendline c. For type tab select ‘options’ d. Check ‘display r2 value on chart’ e. Interpret the r2 value. How well does the line represent the data points? 3. Get an equation for the regression line. a. Highlight chart area b. From chart menu, select ‘Add Trendline’ c. For Type tab, select ‘options’ d. Check ‘Display equation on chart’ e. This equation allows you to predict further y values for any given x value. THE THOROUGH WAY: (the only way to get a p-value for your regression) 1. Select ‘TOOLS’, then ‘DATA ANALYSIS’. (If you don’t see it, select “Add-Ins” and then select “Analysis Tool Pac” to install the data analysis tools; this has to be done only once. Go back and select “Data Analysis” from the Tools menu.) 2. Select ‘REGRESSION’. 3. Input the dependent variable in the field “INPUT Y RANGE” by highlighting the appropriate values in the spreadsheet. Input the independent variable in the field “INPUT X RANGE” similarly. 4. Under “OUTPUT OPTIONS” select “NEW WORKSHEET PLY”. OK. 5. Interpretation of output: you get an r2 value, as well as two p-values. The r2 value indicates what proportion of the variation is explained by your model (the linear regression line). The p-value you are interested in to test whether the relationship between your variables is due to sampling error is the p-value associated with the x-variable. (Ignore the p-value of the y- intercept for the purposes of this course). A p-value of <0.05 indicates that the relationship you are testing is unlikely to be due to sampling error. F) ANOVA Requires: 1 categorical independent variable with > 2 treatments. How to create a figure in Excel: Follow directions under t-test. How to do an ANOVA test in Excel: 1. Data must be organized with each treatment in a separate column. 22 2. Go to the tools menu in Excel. Select “Data Analysis” at the bottom of the menu. (If you don’t see it, select “Add-Ins” and then select “Analysis Tool Pac” to install the data analysis tools; this has to be done only once. Go back and select “Data Analysis” from the Tools menu.) 3. Select “ANOVA: Single Factor”, then “input range” (data you want to analyze), grouped by “column”, “Labels in first row”, alpha = 0.05, and “output range” = an empty cell in spreadsheet. Interpretation of results: Whether the ANOVA is significant depends on the degrees of freedom (DF). The critical value that the F-statistic must be greater than to be significant (P < 0.05) will have to be looked up in a table (ask TA). Based on your analysis, what can you conclude? Do the data support your hypothesis? Recall, that the ANOVA tells you only whether there is a significant difference among treatment means, not WHICH treatment means are significantly different from each other. If the ANOVA found a significant difference, you have to now do multiple comparison tests (in our case one t-test for every treatment pair) to determine which treatments are significantly different from each other. To write the “Results” section of a manuscript, summarize the following, each in one sentence: 1) The conclusion based on ANOVA. Look at the figure and ANOVA. At the end of the sentence put statistical test and p- value in parentheses and refer to the figure. “Conclusion 1 (AOVA P < 0.??) (Fig.1).” 2) The conclusion based on multiple comparisons. Look at the t- tests. At the end of the sentence put statistical test and p-value in parentheses. “Conclusion 2 (t-test, P < 0.??).”

Documents

questions

Statistical Analysis for Hypothesis Testing: Variables, Descriptive Stats, and Tests, Study notes of Ecology and Environment

Related documents

Partial preview of the text