Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli

quantitative methods for social sciences book summary, Appunti di Statistica

quantitative methods for social sciences book summary, prof Uberti

Tipologia: Appunti

2019/2020

In vendita dal 25/09/2020

gemma-sabatini
gemma-sabatini 🇮🇹

4.8

(4)

10 documenti

1 / 13

Toggle sidebar

Spesso scaricati insieme


Documenti correlati


Anteprima parziale del testo

Scarica quantitative methods for social sciences book summary e più Appunti in PDF di Statistica solo su Docsity! Statistics is the art and science of designing studies and analyzing the data that those studies produce. its ultimate goal is translating data into knowledge and understanding the world around us. There are three main components for answering a statistical question: - design refers to how we obtain data and answer questions of interest - description summarizes and analyzes the data that are obtained - inference makes decisions and predictions based on the data for answering the statistical question. Statistical description and inference are complementary: the first provides useful summaries and helps you find patterns in the data, while inference helps you make predictions and decide whether observed patterns are meaningful. Fundamental for statistic inference is probability. The entities that we used to measure in a study are called subjects. The population is the set of all the subjects of interest. In practice, we usually have data for only some of the subjects who belong to that population. These subjects are called sample. The population is the total set of subjects in which we are interested. A sample is the subset of the population for whom we have or plan to have data often randomly selected. Descriptive statistics refers to methods for summarizing the collected data where data contains either sample or population. the summaries usually consist of graphs and numbers such as averages and percentages. A descriptive statistical analysis usually combines graphical and numerical summaries. Inferential statistics are used when data are available for sample only, but we want to make a decision or prediction about the entire population. A parameter is a numerical summary of the population. A statistic is a numerical summary of a sample taken from the population. Randomness means that each subject in the population has the same chance of being included in the sample. A variable is any characteristic observed in a study. The data values that we observe are called observations each of which can be a number or in a category. - a variable is called categorical if each observation belongs to one set of categories. - a variable is called quantitative if observation on it take numerical values that represent different magnitudes of the variable. Graphs and numerical summaries describe the main features of a variable: - for quantitative variables, key features to describe are the centre and the variability of the data (spread). - for categorical variables a key feature to describe is the relative number of observations in the various categories. For a quantitative variable, each value it can take is a number, and we classify quantitative variables as being either discrete or continuous. - the quantitative variable is discrete if its possible values form a set of separate numbers such as 0;1;2;3… - a quantitative variable is continuous if its possible values form an interval. The first step to numerically summarizing data about a variable is to look at the possible values and count how often each occurs. - for a categorical variable each observation falls in one of the categories. The category with the highest frequency is called the modal category. - for a quantitative variable the numerical value that occurs most frequently is the mode. The proportion of observation that falls in a certain category is the frequency (count) of observation in that category divided by the total number of observations. The percentage is a proportion multiplied by 100. Proportions and percentages are also called relative frequencies and serve as a way to summarize the measurements in categories of a categorical variable. A frequency table is a listing of possible values for a variable together with the number of observations for each value. The two primary graphical displays for summarizing categorical variable are: - a pie chart: is a circle having a slice of pie for each category. the size of the slice corresponds to the percentage of observation in the category. - a bar graph: displays a vertical bar for each category. the height of the bar is the percentage of observation in the category. typically, the vertical bars for each category are apart not side-by- side. A histogram is a graph that uses bars to portray the frequencies of the possible outcomes for a quantitative variable. Steps for constructing a histogram: - First, divide the range of the data into intervals of equal width: for a discrete variable with few values use the actual possible values. - Then, count the number of observations (the frequency): in each interval forming a frequency table. - on the horizontal axis: label the values or the endpoints of the intervals. grow a bar over each value or interval with height equal to its frequency. A graph for a data set describes the distribution of the data, that is, the values the variables take and the frequency of occurrence of each value. The distribution of the data (or so-called data distribution) can also be described by a frequency table The shape of the distribution is often described as symmetric or skewed. To skew means to pull in one direction a distribution skewed to the left is when the left tail is longer than the right tail and vice versa. A data set collected over time is called a time series. The mean is the sum of the observations divided by the number of observations. It is interpreted as the balance point of the distribution. The median is the middle value of the observation when the observations are ordered from the smallest to the largest. The mean refers to averaging that is adding up the data points and dividing by how many there are. The median is the point that splits the data in two half the data below it and half above it. Basic properties of the mean: - the mean is the balance point of the data. - for skewed distribution, the mean is pulled in the direction of the longer tail relative to the median. - the mean can be highly influenced by an outlier which is an usually small or large observation. An outlier is an observation that falls well above or well below the overall bulk of data. The shape of a distribution influences whether the mean is larger or smaller than the median: - perfectly symmetric the mean equals the median. - skewed to the right the mean is larger than the median. variable should be the response variable. If there is a clear explanatory response relation that dictates which way, we compute the conditional proportions. 3 cases when investigating Association: - the variables could be categorical (food type): in this case, the data are displayed in a contingency table and we can explore the association by comparing conditional proportions. - one variable could be quantitative, and one could be categorical (income/gender): we can compare the categories using summaries of centre and variability for the quantitative variable and graphics such as side-by- side boxplots. - both variables could be quantitative: in this case, we analyze how the outcome on the response variable tends to change as the value of the explanatory variable changes. The scatterplot is a graphical display for two quantitative variables using the horizontal (x) axis for the explanatory variable X and the vertical (y) axis for the response variable Y. The values of X and Y for subject are represented by a point relative to the two axes. The observations for the n subjects are endpoints on the scatterplot. - two quantitative variables X and Y are said to have a positive Association when high values of X tend to occur with high values of Y, and low values of X tend to occur with low values of Y. As X goes up Y tends to go up. - two quantitative variables have a negative Association when high values of a variable tend to pair with low values of the other. When the data points follow a roughly straight-line trend the variables are said to have an approximately linear relationship. - a summary measure called the correlation describes the strength of the linear Association. The correlation summarizes the direction of the association between two quantitative variables and the strength of its linear straight- line trend. - denoted by r it takes values between plus-minus one. - a positive value of r indicates a positive association and vice versa. - the closer r is to plus-minus one the closer the data points fall to a straight line and the stronger the linear association is. - the closer r is to zero the weaker the linear association is. - the value of the correlation does not depend on variable units. Two variables have the same correlation no matter which is treated as the response variable. For an observation X on the explanatory variable let zx denote the z-score that represents the number of standard deviation and direction that X falls from the mean of X: o to obtain r, you calculate the product zx zy for each observation then find a typical value of those products. The product of the z-scores for any point in the upper right quadrant and lower left quadrant is positive, such points contribute to a positive correlation. In the upper left and lower right quadrant is negative such points contribute to a negative correlation. The regression line predicts the value for the response variable Y as a straight-line function of the value X of the explanatory variable. Let Y-hat denotes the predicted value of Y. The equation for the regression line has the form: Y- hat = A+B X - in this formula A is the Y-intercept and B is the slope. - Y-intercept is the predicted value of Y when X equals 0. - the slope B in the equation equals the amount that Y-hat changes when X increases by 1 unit. - for two X values that differ by 1 the Y- hat value differ by B. Each observation has a residual. A positive residual occurs when the actual Y is larger than the predicted value Y-hat so that why minus Y- hat > 0. In a scatterplot, the vertical distance between the point and the regression line is the absolute of the residual. The actual summary measure used to evaluate the regression line is the residual sum of squares: S(residual)2. Among many possible lines that could be drawn through data points is a scatter plot the least square method gives what we call the regression line. This method produces the line that has the smallest value for the residual sum of squares using Y hat = A+B X. Regression formulas for y-intercept and slope: B = r. (Sy / Sx) A = Mean y – b (Mean x) Extrapolation refers to using a regression line to predict Y values for X values outside the observed range of data. Be cautious with influential outliers, when an observation has a large effect on results of a regression analysis it is said to be influential, but two conditions must hold: - its X value is relatively low or high compared to the rest of the data - the observation is a regression outlier falling quite far from the trend that the rest of the day to follow. When both of these happen, the line tends to be pulled toward that data point and away from the trend of the rest of the points. A lurking variable is a variable usually unobserved that influences the association between the variables of primary interest. Correlation does not imply causation more generally as association does not imply causation. Lurking variables can affect association in many ways. - it may be a common cause of both the explanatory and response variable. More commonly there are multiple causes and the association among them makes it difficult to study the effect of any single variable. When two explanatory variables are both associated with the response variable but are also associated with each other; confounding occurs. Experimental study: - a researcher conducts an experimental study or more simply an experiment by assigning subjects to certain experimental conditions and then observing outcomes on the response variable. The experimental conditions which correspond to the assigned values of the explanatory variable are called treatments. Observational study: - in an observational study the researcher observes values of the response variable and explanatory variables for the sampled subjects without anything being done to the subject. A sample survey selects the sample of subjects from a population and collects data from them. The sampling frame is the list of subjects in the population from which the sample is taken. A (simple) random sample of n subjects from a population is one in which each possible sample of that size has the same chance of being selected. Selecting a simple random sample: - you first number the subjects in the sampling frame. - you then generate a set of those numbers randomly. - Finally, you sample the subject whose numbers are generated. When using a simple random sample of n subjects, the approximate margin of error= 1 / square root n x 100% - Sampling bias: bias may result from the sampling method. The main way this occurs is if the sample is not random. - Nonresponse bias: a second type of bias occurs when some sample subjects cannot be reached or refused to participate. - Response bias occurs in the actual responses made. Key parts of a sample survey: - Identify the population of all the subjects of interest. - construct a sampling frame which attempts to list all the subjects in the population. - use a random sampling design implemented using random numbers to select n subjects from the sampling frame. - be cautious about sampling bias due to nonrandom samples and sample under coverage, response bias from subjects not giving their true response from poorly worded questions and non-response bias from the refusal of subjects to participate. Control comparison group: - an experiment normally has a primary treatment of interest, but it should also have a second treatment for comparison to help you analyze the effectiveness of the primary treatment. This second is called the control group. Randomization helps to prevent bias from one treatment group tending to be different from the other in some way. It is important that the treatment groups be treated as equally as possible ideally the subjects are blind to the treatment to which they are assigned. When neither the subject nor those having contact with the subject know the treatment assigned the study is called double-blind. Clusters: Divide the population into a large number of clusters such as city blocks. Select a simple random sample of the clusters. Use the subjects in those clusters as the sample. This is a cluster random sample. - it is a preferable sampling design if: o a reliable sampling frame is not available or the cost of selecting a simple random sample is excessive. - a disadvantage is that: o we usually need a larger sample size with a cluster random sample than with a simple random sample in order to achieve the particular margin of error. A stratified random sample divides the population into separate groups called strata and then selects a simple random sample from each stratum. Probability: With a randomized experiment or random sample or other random phenomenon (such as simulation), the probability of a particular outcome is the proportion of times that the outcome would occur in a long run of observations. Independent trials: Different trials of a random phenomenon are independent if the outcome of any one trial is not affected by the outcome of any other trial. A random variable is a numerical measurement of the outcome of a random phenomenon. Often the randomness results expected behavior of the sampling distribution of the sample mean when the population distribution is normally distributed: - for a random sample of size n from a normally distributed population having mean m, and standard deviation s, then regardless of the sample size n the sampling distribution of the sample mean is also normally distributed with its centre described by the population mean m, and the variability described by the standard deviation of the sampling distribution which equals the population standard deviation divided by the square root of the sample size. - for a random sample size n from a population having mean m, and standard deviation s, the sampling distribution of the sample mean has its centre described by the population mean m, and the variability described by the standard deviation of the sampling distribution which equals standard deviation divided by the square root of n. - for a random sample of size n from the population having mean m, and standard deviation s, then as the sample size n increases the sampling distribution of the sample mean approaches an approximately normal distribution. REVIEW QUESTIONS: Difference between descriptive and inferential statistics: - we collect data for a sample of subjects. They are usually just a small part of the population, which is the set of all subjects in which we are interested. - We use Descriptive statistics to summarize the sample data with numbers and graphs. - Inferential statistics are used to make decisions and predictions about the entire population based on the sample data. Difference between categorical and quantitative variable: - A categorical variable has observations that fall into one of a set of categories, such as preferred place to shop. - A quantitative variable takes numerical values such as grade point average. how can you describe a set of data graphically? what is the disadvantage - advantage of histograms? - for categorical variables data are displayed using pie charts and bar graphs. - for quantitative variables the histogram is a graph of a frequency table, showing bars above internal values. the histogram does not show the individual observations but more easily handles large amounts of data. how can you numerically summarize categorical data? - for categorical data, the data are summarized using frequencies (for counts) of categories, the proportions of observations in a category or the percentage of observations in a category. - we describe the category with the most observations as the modal category. how can you describe the centre of a set of data numerically? - for a quantitative variable, numerical summaries describe the centre and variability of the data and relative positions. - measures of the centre describe a representative or typical observation. o the mean is the balance point of a distribution and is calculated as the sum of observation divided by the number of observation. o the median divides the ordered data set into two points of equal numbers of observations, half below and half above that point. o the mode, is the most frequent value. - for highly skewed distributions or distribution having extreme outliers in one direction, the mean is drawn in the direction of the longer tail of the distribution and may be a misleading summary. how can you describe the variability of the set of data numerically? how can you interpret the value of a standard deviation? - the range is the difference between the largest and smallest observations. - the deviation for an observation is its distance from the mean. - the square root of the average squared deviations, called the standard deviation, describes the typical distance from the mean. - for a Bell-shaped distribution, by the empirical rule, about 68% 95% 99.7% (all or nearly all) the data for within 1, 2, 3, standard deviations of the mean respectively. what is a Z score? what values of Z should be unusual if a distribution is Bell-shaped? - the Z score tell us the number of standard deviation that an observation falls from the mean. o for example, z = - 1.4. means that the observation falls 1.4 standard deviations below the mean. - for an approximately bell-shaped distribution since all or nearly all the observations fall within 3 standard deviation of the mean, z-scores above or below 3 would be unusual. What are the measures of position, and how do a boxplot, IQR summarize positions? - the percentiles are measures of position that tell us points above which or below which a certain percentage of data falls. - the lower quarter of observations fall below the 25th percentile, called the first quartile. - the upper quarter falls above the 75th percentile, called the third quartile. - these two Q span the middle of the data. - the five-number summary consists of these quartiles, the median (which is the 50th percentile and 2nd quartile) and the minimum and maximum values. - the boxplot portrays this five-number summary, using a box for the middle half of the data, between Q1 and Q3, while highlighting potential outliers. - the interquartile range (IQR) he is a measure of variability that equals the distance between Q3 and Q1 Difference between response and explanatory variable: - in practice most studies have more than one variable and explore associations between the variables. For example, do students who spend more time studying tend to have higher GPA's? - The statistical analysis studies how the value of a response variable (the outcome of interest such as GPA) depends on the value of an explanatory variable (such as time spent studying). What is a contingency table? - For categorical variables, contingency tables summarize the counts of observations at the various combinations of categories of the two variables. They can show the person tiles in the various categories for the response variable (such as yes or no for whether one believes in life after death) separately for each category of the explanatory variable (such as gender or religious affiliation). How can you use an equation to describe the relation between two quantitative variables? - A regression line Y-hat= a + bx describes how the predicted valued Y- hat of the response variable relates to the explanatory variable X when the scatterplot indicates a linear relation. - The slope b of this line describes the effect of unit increase. How can you describe the Association between two quantitative variables graphically and numerically? - for quantitative variables, scatter plots display their data, with one point for each subject, using X&Y axis to represent the two variables. They show whether the Association is positive (trending upwards) or negative (trending downward). - when a relationship approximately follows a straight line, the correlation r, describes the direction and the strength of association. - It satisfies [-1 £ r £ 1], with values farther from 0 representing stronger straight-line associations. Why does association does not imply causation? - An association may occur merely because of a lurking variable that is associated with both of the variables. - For example, we might observe a positive association between a math achievement test score and height for children from various grade levels in a school. But this association may be explained by the way that child’s age is associated with both, math achievement and with height. What are the main ways of gathering data? - An experiment randomly assigns subjects to experimental conditions called treatments and then observes the outcome. - Randomly assigning units to treatments in an experiment balances groups with respect to lurking variables and leaves two possible causes for a difference in the response to the treatment: either the treatment or random variation. - Random assignments in experiments allow for cause and effect conclusions. - An observational study merely observes an available sample of subjects without conducting an experiment. Another type of non- experimental study uses a sample survey to take a sample of subjects from a population and observes them, usually by obtaining answers to questions on a questionnaire. - With random sampling, each subject in the population has an opportunity of being in the sample. Random sampling enables us to make inferences from the sample to the population. Difference among sampling bias, response bias, in sample surveys: - Sampling bias can occur from using non-random samples or having under coverage (some subjects have no chance of being sampled) - Severe non-response bias occur when many sampled subjects refuse to participate, and response bias occurs when subjects respond incorrectly (perhaps lying) or a question asked is confusing or misleading. What is meant by the probability of an outcome? - The probability of an outcome is the proportion of times it occurs in the long run. Two properties that probability of an outcome must satisfy: - Must fall between 0 and 1. - Total of the probabilities of the possible outcomes equals 1. What is a random variable? - A random variable is a numerical measurement of the outcome of a random phenomenon.
Docsity logo


Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved