Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli

Applied statistics for business analytics (Unicatt) 1 modulo, Appunti di Statistica Applicata

Appunti dettagliati di tutte le lezioni del corso Applied Statistics for business analytics del Prof. Viganò (anno accademico 2021-2022, ITEM Unicatt) del primo modulo del corso. Comprende anche tutti i laboratori di R Studio e Radiant, con screenshot delle schermate e spiegazione scritta. Comprensivo anche della correzione degli esercizi in preparazione dell'esame parziale (Voto parziale: 27 su 28; voto finale: 30)

Tipologia: Appunti

2021/2022

In vendita dal 06/02/2022

--federica
--federica 🇮🇹

4.4

(18)

24 documenti

1 / 141

Toggle sidebar

Spesso scaricati insieme


Documenti correlati


Anteprima parziale del testo

Scarica Applied statistics for business analytics (Unicatt) 1 modulo e più Appunti in PDF di Statistica Applicata solo su Docsity! 1 APPLIED STATISTICS FOR BUSINESS ANALYTICS (ASBA) PROF. GIOVANNI VIGANÒ, PROF. EMILIO GREGORI 10.01.2022 STATISTICAL THINKING IN DECISION MAKING Statistical thinking in decision making: first of all we define the problem, we design the research (sources of data), we collect data and we use statistical procedures to proceed with data analysis; after all we came up with the information, that we hopefully use in order to generate knowledge. Generally speaking this information, that is the outcome of the data analysis, together with the theoretical experience and literature, we communicate the decision solving the starting problem that has originating the overall decision making approach. Statistics is fundamental in business analytics. It’s useful in order to generate innovation and generate prediction, basing our decision on the available data. In order to do so, we apply statistical methods to the business intelligence, including other sources: business management, computer science and past data. 2 KEYWORDS VARIABLE: a specific characteristics about a set of individuals or objects. If the set of units are the private clients of a bank, the characteristics will be like the age, the gender, the amount of deposits and the level of satisfaction. The type of characteristics can be very different: qualitative, quantitative, continuous or not. DATA: any observation that have been collected about a characteristics. POPULATION: complete set of all the items that interest an investigator. SAMPLE: an observed subset of the population. PARAMETER: a quantitative measure that describes a specific characteristics of a population. For example the proportion of private clients of the bank that are female. STATISTICS: a quantitative measure that describes a specific characteristics of a sample. The difference between parameter and statistics: the first one is the characterizes of the population, with the second we refer to a measure that describes the characteristics of a sample. PROCESS OF INFERENCE: The process of inference is basing our analysis on a sample and then infer it on the entire population. We take only a small sample of data because of time consuming to collect the info on the entire population. 5 FREQUENCY DISTRIBUTION TABLE: It’s the most widely used technique to describe the data that we have. It’s a table used to organize data. The left column (called classes or groups) includes all possible responses on a variable being studied. The right column is a list of frequencies, or number of observations, for each class. A relative frequency distribution is obtained by dividing each frequency by the number of observations and multiplying the resulting proportion by 100%. An example: Hospital patients by unit BAR DIAGRAM EXAMPLE: A frequency distribution is a list or a table containing class groupings (intervals or ranges within which the data fall) and the corresponding frequencies with which data fall within each class; the distribution condenses the raw data into a more useful form and allows for a quick visual interpretation of the data, so that you have to choose: the number of classes (depending upon the number of observations), the width of each class (equal or unequal widths); but classes must be inclusive and non overlapping. 6 Frequency distribution for classes of values: Example: a manufacturer of insulation randomly selects 20 winter days and records the daily high temperature. The width is 10. They don’t overall because it’s ≤20 and >20. The relative frequency is the absolute frequency/total. The frequency density it’s the ratio between the relative frequency and the width of the class. HISTOGRAMS: graphical representation of the distribution of discrete or continuous variables classified into intervals. For its construction the intervals are reported on the horizontal axis. Over each interval, a rectangle is drawn. For classes of equal width, (software – usually) the height of each rectangle represents the frequency of the class. (Otherwise: over each interval, a rectangle is drawn with an area proportional to the relative frequency of that class; therefore the heights of these rectangles are determined by dividing the area by the size (width), thus the height of each rectangle represents the frequency density (ci) which may be interpreted as the amount of frequency per unit interval. i.e. it is the area of the rectangle to be proportional to the frequency) How to calculate the density: 7 MEASURES OF CENTRAL TENDENCY MEAN OR AVERAGE: The sum of observations divided by the number of observations. MEDIAN: it’s the observation that falls in the middle of the ordered sample. We take again the same example of the mean, but now we calculate the median. We put the value in ascending or descending order. Find the position of the 50th percentile: 0.5(n+1)=0.5(6+1)=3.5, this is the position where we find the median! The median will fall in there and to compute the median I take the extreme of the values which are closest to the position I have discovered (2 and 4) and I take the mean of them. So the median is going to be 3. Sometimes is a better measure of center tendency, because it’s able to exclude the extreme value. Median 2 2 942 4 1. Put the values in ascending (or descending) order 2. Find the position of the 50th percentile: 0.5(n+1) = 0.5(6+1) = 3.5 Take the value in the position of the nearest integer below and the value in the position of the nearest integer above and compute the mean of the two values, i.e. mean of the 3rd (equal to 2), and the 4th (equal to 4): (2+4)/2=3 4 2 9 2 4 2EX: number of multimedia devices at ome (another sample of size: 6). Compute median. 10 STANDARD DEVIATION: the deviation of an observation from the mean, is the difference between them. The standard deviation is the square root of the variance. The variance is the corrected mean of squared deviations. The population variance is: The sample variance is: We need to introduce a correction factor: n-1 and not n at the denominator! By taking the square root we can have: They provide the variation considering the mean and has the same units as the original data. Example: 11 EMPIRICAL RULE: If the data is distributed as a bell (bell-shaped), then the interval: µ±1σ contains about 68% of the values in the population or the sample. µ±2σ contains about 95% of the values in the population or the sample. µ±3σ contains almost all (about 99.7% of the values in the population or the sample. Z-SCORE: measure of position for a single observation: z = (observation – mean) / (standard deviation) It’s the number of standard deviations that the value falls above or below the mean. For bell shaped distribution, if |z|>3, the observation is an outlier. 12 11.01.2022 Bivariate descriptive statistics How can we display and show the type of relationship linking two variables? Two types of ways: crosstabs and scatterplots. It’s just a graphical display of the relationship of two variables. Starting from the raw data in the dataset: - Cross tables list the number of observation for every combination of values for two categorical or ordinal variables (called also contingency tables), or numerical discrete with a small number of different values. We have a dataset with 750 cases and for all of them we have the info on the geographical location and industry type (qualitative values, nominal). 15 You can do the same by row. 0.4 is the relative frequency of manufacturing in the eastern area. The difference of interpretation is that in the first table, where we have the relative frequency by total, we have divided each cell by the total 750 (N). if we are interested in differences in terms of geographical location in the various types of industry, we should use the relative frequencies by column. We are isolating by industrial type (column) and find the relative frequencies by column. In case we are interested in show that there’s a different distribution of type of industry per geographical area, we use relative frequencies by row (the 1000 is by row and not column). 16 Conditional frequencies can be explained by side-by-side chart. One of the possibility is the following, or using a stacked bar chart: 17 We can build graphs also for crosstabulation: We choose which type of graph in relation to the type of info we want to deliver. If we don’t want to compare, but just show distribution is better a side by side. independence in high cases because they basically have the same dimension. When we have to produce our cross tabulation, we have to carefully choose the best way to present our data. PROBABILITY Random experiment: phenomena whose outcome cannot be predicted with certainty. It’s something we can observe or run whose outcome it’s not certain. Sample space: set of all possible outcomes of the random experiments. (in flipping a coin it’s going to be head or tails) The probability of an outcome is the proportion of times that an outcome would occur out of all the possible outcomes in one experiment (Bayesian approach) and in a very long sequence of experiments (frequentist approach). if we are considering the experiment flipping a coin, the probability we expect for heads is 50% and for tails is 50%. 20 In this case the difference here is the Var of the distribution, in the red case we have a bell shaped distribution, symmetric, with few cases on the tails, mostly are in the center. In the blue and black distribution it’s not like that, they have few more cases for the blue, a lot of cases on the tail for the black. Example: Suppose X is a normal with mean 8.0 and standard deviation 5.0. x ~ N(8.0,5.0) Problem 1: find P(X<8.6) We draw a function bell shaped, with µ=8.0, on the right there would be 8.6. the probability that x is lower than 8.6 the probability is the area on the left side of the 8.6. Through the use of radiant application we created the same function and calculated that P(x<8.6)= 0.548, while the opposite is P(x>8.6)=0.452. If we sum the two possibilities we get 100%. 21 Problem 2: find the X value so that only 20% of all values are below this X (i.e. the X value s.t. the probability to observe a smaller value is 20%). We need to compute the specific value x that gives a probability of 20%. I have to consider the 20% probability behind the distribution and identify x. We input on radiant probability (0.2), red area in the chart under the curve. So 3.792 is the x value. What we have used is the general version of the normal distribution, with mean µ and variance sigma^2. We have the possibility to transform the distribution for the continuous case. For any pair of values a and b of Y (included plus and minus infinity), we can transform this variable as follows: The standard normal table shows values of the cumulative normal distribution function. For a given Z-value a, the table shows F(a), which is the area under the 22 curve from negative infinity to a. for every z value we have the corresponding F(z value). By using the standardization I can transform the quantity a and b as follow: I take the difference between a and b and µ divided by sigma. It’s a linear transformation. So I can write that the probability of x included in a and b is equal to the probability of Z between (a-µ)/sigma and (b-µ)/sigma. The table gives the probability F(a) for any value a The first column of the table is z 0 0.01 0.02 0.03 that are the second decimals of x, while on the column we have the first decimals. The cross value is the probability. If z = 1.13 the probability is 0.8708 so F(z=1.13)=87% 25 - It’s possible to obtain statistical results of a sufficiently high precision based on samples. Parameter: quantitative measure that describes a specific characteristics of a POPULATION. The statistic is a quantitative measure that describes a specific characteristics of a SAMPLE. We have a simple random sample: - if every object in the population has the same probability of being selected; - the objects are selected independently; [IID: independent and identically distributed] - a simple random sample is the ideal against which other sampling methods are compared. Before observing, sample units are unknown and random: drawing a sample is running n random experiment. Sample unit’s characteristic: outcome of a random experiment. The characteristics: they are numerical value from each sample unit (for qualitative characteristics: 1 vs 0 for having or not that characteristics), i.e. from each random experiment -> n random variables. The formula is valid regardless the values actually observed over the sample units: it’s a function (linear combination) of n random variables. A statistic is a random variable! The functions of a random variables: if P(x) is the probability function of a discrete random variable X and g(x) is some function of X, then the expected value of function g is: 26 Also for a statistic before observing data in the sample; probability distribution for a statistic is useful to assess how much the result is reliable for the entire population (risk of being wrong or probability of being correct). A sampling distribution is a probability distribution of all of the possible values of a statistic for a given size sample selected from a population. It’s the distribution of the outcomes that I get when I select my sample. Example: Assume that we have a population of size N=4; consider a random variable x which is the age of the individuals. The values are x: 18,20,22,24 (years). They are parameters. The distribution is uniform, 25% each. Now we consider all possible samples of size n=2 out of that population of size N=4. It means that I consider all the possible pairs I can have in that population. I can have 16 possible pairs Clearly the sample s consid ering are sample with replacement (I can select the identical person twice). When I have drawn all the possible samples of size two out of the population, I compute the sample mean for the 16 samples. I can draw the distribution of the sample mean. It’s not uniform anymore, it’s bell shaped. Why sampling distribution are relevant? 1. Sampling distribution allow to find out probabilities and quantiles about the statistics used as estimator; 27 2. Probabilities and quantiles are the basis for dealing with uncertainty of the estimations. 3. Sampling distribution is the basis of statistical inference: confidence intervals and hypothesis testing. SAMPLE MEAN: let X1,X2,…,Xn represent a random sample from a population. The sample mean value of these observations is defined as: Let x1,x2,…,xn be the observed data, the sample mean computed over these data is: When we compute the standard deviation for the samples there could be variations. The standard error of the means is a measure of the variability in the mean from sample to sample. (also called standard deviation of the sample mean). It measure the sampling error. The lower the s.e. the more precise the estimation. In order to increase the accuracy you increase the sample size (increase the denominator). The distribution of the sample means is normal? Central limit theorem For any population distribution, the sample means from the population will be approximately normal as long as the sample size is large enough. If we have a sample size large enough, we can be quite sure that the distribution of the sample mean is approximately normal. As the sample size get large, we know that the sampling distribution becomes almost normal. If the sample size increases the distribution of the sample mean will tend to become a normal distribution. For ANY distribution. For any population distribution of the population, if I increase the sample size taking out a non normal distribution, the distribution of the sample mean will tend to a normal. Mean of the sample mean E(X)= E(X1+X2н͙нXn) n E(X1)+E(X2Ϳн͙нE(Xn) n P+Pн͙нP n nP n P = E(X)= PROOF NOT EXAMINABLE 30 What about the shape? It’s not normal. It can assume two different outcomes with specific probability. The distribution is binominal by definition. If we know that it’s normally distribution we can use the standard normal distribution to find out probabilities and quantiles. We can generalize to the normal distribution. The shape of the binominal distribution it’s this: P(x): probability of success of x in n trials. There’s something which is actually very helpful, in the estimation of the probabilities when we have proportion distributed like a binominal. For any population distribution the sample means from the population will be approximately normal as long as the sample size is large enough (the sample proportion is a sample mean). So we can use the central limit theorem also for proportion. So we can use all the properties, in particular the standardization process. We need information about the characteristics of the population. To apply this rule n≥30. If not, probabilities and quantiles can be obtained from the binominal distribution, difficult to compute but software can do it. DATASET E-BOTTLE.RDA 31 We have opened Radiant (open R studio and type radiant::radiant () ), import the dataset by loading it on Radiant. Then you should have imported the dataset. E-bottle is a dataset about the consumption of water. The information is These are all the data included in the data frame. We have two objectives: 1. Estimate the average quantity of daily water that persons think they should drink to be healthy. 2. Estimate the percentage of persons thinking that the e- bottle is absolutely useful for drinking the needed quantity of water to be healthy. How to solve? With radiant. First of all the first request is a mean, so we use again the menu basics and we enter the means sub menu (single mean). The variable is should (numeric), so we put it in the variable selection. In the menu we have mean: 2.208. It is the mean of the sample. Not the population. Can we be 100% sure that the mean we have is the same of the population? No. Questionnaire and CodeBook Variable Label Should How many liters of water per day should you drink to be healthy? Do How many liters of water per day do you usually drink? Think Do you think the e-bottle can help to drink the daily quantity of water a person need to be healthy? Value Label 1 Absolutely yes 2 Maybe yes 3 I don't know 4 I guess no 5 Absolutely not Gender Gender of the person Value Label 0 Male 1 Female Years_study How many years have you passed in education? Age How old are you? (Years) Timephones Approximately, how many hours per day do you spend on your smartphone? 32 By using the functions on Radiant we can see how to plot the results in a graph. These are all the blue bars, the times that a certain respond occurs and then we have the mean value observed reported, it’s the black vertical line. The dash lines are very useful 35 Unbiasedness: a point estimator T is said to be an unbiased estimator of the parameter if its expected value is equal to that parameter: It means that the distribution of the estimator is centered on teta. T1 is unbiased T2 is biased. The bias can be represented with a formula: let T be an estimator of teta; the bias in T is defined as the difference between its mean and teta. The bias of an unbiased estimator is 0. A point estimator Tn is said to be an asymptotically unbiased estimator of the parameter teta if its expected value is equal to that parameter only when the sample size is large enough: It’s better to have an estimator with small variance. The most efficient estimator is the one with smallest variance. If I have two unbiased estimator, I choose the one with lowest variance. Var(T1) < Var (T2) à T1 is more efficient than T2. The relative efficiency is: Unbiasedness ‡ A point estimator is said to be an unbiased estimator of the parameter T if its expected value is equal to that parameter: ‡ Examples: ± The sample mean is an unbiased estimator of ʅ ± The sample variance s2 is an unbiased estimator of ʍ2 ± The sample proportion is an unbiased estimator of P T șE(T) x pÖ Most Efficient Estimator ‡ Suppose there are several unbiased estimators of T ‡ The most efficient estimator or the minimum variance unbiased estimator of T is the unbiased estimator with the smallest variance ‡ Let and be two unbiased estimators of T, based on the same number of sample observations. Then, ± is said to be more efficient than if ± The relative efficiency of with respect to is the ratio of their variances: )Var(T)Var(T 21  )Var(T )Var(T Efficiency Relative 1 2 1T 2T 1T 2T 1T 2T Unbiasedness ‡ A point estimator is said to be an unbiased estimator of the par meter T if its expected value is equal to that parameter: ‡ Examples: ± The sample mean is an unbiased estimator of ʅ ± The sample variance s2 is an unbiased estimator of ʍ2 ± The sample proportion is an unbiased estimator of P T șE(T) x pÖ 36 Therefore, the smaller the standard error, the more reliable the estimator. It’s not only to base our conclusion of the characteristics of the population parameter on the point estimate. We can’t say that we are 100% sure. We introduce the concept of confidence interval: a point estimate is a single number; a confidence interval provides additional information about the variability. The confidence interval is based on the point estimate, but it adds other information. The width of the confidence interval, with at extreme we have the lower and the upper confidence limit. How can we define the confidence interval estimator? It’s a rule for determining an interval that is likely to include the parameter. The corresponding estimate is called a confidence interval estimate. FOR THE POPULATION MEAN: THE STUDENT’S T DISTRIBUTION We start with our process from the population, we need to know the distribution of the population: is the sample large enough to consider it a normal distribution? Is it already distributed as a normal? The sample mean in this case is normally distributed: it’s bell shaped, symmetric, the mean of the sample mean is equal to the mean of the population, its standard deviation is known, etc. We need now to define also the notation of a confidence interval, we start from this: say that a and b are two values, if P(A<teta<B)=1-alfa, then the interval from a to b, computed on observed data, is called a 100(1-alfa)% confidence interval of teta. The quantity 100(1-alfa)% is called the confidence level of the interval. Alfa is always between 0 and 1. Alfa is basically a quantity that we can set in order to control the confidence. In repeated samples of the population, the true value of the parameter teta would be contained in 100(1-alfa)% of intervals calculated this way. The confidence interval calculated in this manner is written as: a<teta<b with 100(1-alfa)% confidence 37 The confidence interval is basically build on two different quantities: the point estimate (based on the sample data) and the margin of error (ME). The ME is explaining the sampling error. The notation is the following: ME is based on the reliability factor (r.f.), closely linked to the alfa level, we can change it. The other component of the ME is the standard error, it provides the information on the variability of the specific variable that we are considering; if the variability is higher. Some of these information can be set and controlled fully by the researchers. Let’s see the cases VARIANCE OF POPULATION IS KNOWN (very unusual) The assumptions we have to introduce is that the population variance is known and that the population is normally distributed. If it’s not normal, large sample (CLT). We are now able to estimate the confidence interval: X with the bar is the sample mean, so the estimator, then we have the ME, which is sum and subtracted, with Z(Alfa/2) that is basically representing the confidence. Alfa can be set by the researched and reflects the confidence we want to have that the true parameter is included in my confidence interval. 40 Since I’m using an estimate for my estimation of the population variance, I need to introduce a distribution which is different from the normal distribution, but very similar. It’s the student T distribution. We are going to use it anytime we don’t know the variance of the population. Student’s t distribution has n-1 degrees of freedom. It’s clearly a transformation. It’s closely related to the estimation of S. Characteristics of t value: - It depends on the degrees of freedom (d.f.), the numbers of observation free to vary after te sample mean has been calculated d.f.=n-1 - The shape of the student’s t, is symmetric and bell shaped. The difference between this and a normal is that the tales are fatter, so there’s more area under them, so more uncertainty. If the d.f. move up, the shape tends to become very similar to the normal. If we have high value of d.f. we can say that we have a normal. But how to change the d.f.? When I have a sample size quite big, the student t distribution will look like a normal, so I can replace the t value with the same old z score. When nà∞, tàZ. 41 The shape of the confidence interval: we have the point estimate, the t value (which would be bigger with the student t compared with the z values): It is going to be bigger than the confidence interval of the normal distribution, because it reflect the higher uncertainty. The ME is the part of the formula on the right. And the width of the interval is 2ME. ___________________________________________________________________ Let’s see it on R-Studio. (e-bottle example). I want to create a confidence interval, for the 95% the average amount of daily water people think they should drink to be healthy. We compute the confidence interval for the mean at a 95% confidence level. The variable to consider is “should”. In order to run I use the command T test. Define the dataset where I have the variable (e-bottle) and then $, which defines the variable part, then specify “should”. Otherwise, if I don’t specify the confidence level, R-studio considers 95% confidence. Forget the alternative hypothesis for today. 42 If I want to change the confidence level, I specify conf.level=0.9 (90%) confidence level. The size has changed? Yes. The width is smaller or bigger? Smaller. CONFIDENCE INTERVAL WITH PROPORTION What we have seen so far is the population mean. But we can make the calculation of the confidence interval also for proportion, if CLT. The idea is the same. Standard deviation: For the sample: 45 By default in the single proportion estimation menu we have the upper and lower confidence limits 2.5% 2.037 and 97.5% 2.38. I can use the same identical results in order to reject or not claims about the characteristics of the population. I can use the functions provided as alternative hypothesis and the comparison value we are interested in. if the question is 1. I have to use greater than, the confidence 95% and the comparison value is 2 liters. The output: 46 I can do the same with the proportion: What is an hypothesis? It’s a claim. The claim is always about a population parameter. There are different ways with which we can set an hypothesis, but the 47 approach is the following: We can’t ever proof that anything is true, but we can proof that something is wrong. We need to collect evidence that the statement is wrong. The statement to be tested is the contrary to what we are interested in: we are going to assume this statement as true and prove that it’s wrong. Example in Fineco: The hypothesis will be two: a null hypothesis and an alternative. We always consider 2 hypothesis. Hypothesis ͞/ĨĂŶĞǁƐĞƌǀŝĐĞŝƐ offered for free to clients, the average amount of deposits will be greater ƚŚĂŶϭ͘ϮŬΦ͟ ͞/ĨĂŶĞǁƐĞƌǀŝĐĞŝƐ offered for free to clients, the average amount of deposits will remain ĞƋƵĂůƚŽϭ͘ϮŬΦ͟ H0 H1 P = 1.2 P > 1.2 Null hypotesis Alternative hypotesis ǀĞƌĂŐĞĂŵŽƵŶƚŽĨĚĞƉŽƐŝƚƐƉĞƌĐůŝĞŶƚ͗ϭ͘ϮŬΦ Ex.: ͞/ĨĂŶĞǁƐĞƌǀŝĐĞŝƐŽĨĨĞƌĞĚĨŽƌĨƌĞĞƚŽĐůŝĞŶƚƐ͕ƚŚĞ ĂǀĞƌĂŐĞĂŵŽƵŶƚŽĨĚĞƉŽƐŝƚƐǁŝůůďĞŐƌĞĂƚĞƌƚŚĂŶϭ͘ϮŬΦ͟ Claim Offering for free a new service to clients to have average amount of deposits higher than 1.2 Decision Yes No Do investments Do nothing Can we prove that this is true? How? The amount of deposits per client has a normal distribution. The new service is offered to a sample of 100 clients and after one month the sample mean of deposits is checked. ǀĞƌĂŐĞĂŵŽƵŶƚŽĨĚĞƉŽƐŝƚƐƉĞƌĐůŝĞŶƚ͗ϭ͘ϮŬΦ Ex.: ͞/ĨĂŶĞǁƐĞƌǀŝĐĞŝƐŽĨĨĞƌĞĚĨŽƌĨƌĞĞƚŽĐůŝĞŶƚƐ͕ƚŚĞ ĂǀĞƌĂŐĞĂŵŽƵŶƚŽĨĚĞƉŽƐŝƚƐǁŝůůďĞŐƌĞĂƚĞƌƚŚĂŶϭ͘ϮŬΦ͟ Offering for free a new service to clients to have average amount of deposits higher than 1.2 Decision Yes No Do investments Do nothing ͞/ĨĂŶĞǁƐĞƌǀŝĐĞŝƐŽĨĨĞƌĞĚĨŽƌĨƌĞĞƚŽĐůŝĞŶƚƐ͕ƚŚĞ ĂǀĞƌĂŐĞĂŵŽƵŶƚŽĨĚĞƉŽƐŝƚƐǁŝůůƌĞŵĂŝŶĞƋƵĂůƚŽϭ͘ϮŬΦ͟Claim to test Does the sample mean give evidence that this is wrong? 50 The outcomes can be positive or negative. The point of view is: we are basing our decision on a sample; the sample don’t always says the true. Sometimes we need to face the fact that the test we are running is wrong. Which are the possible situations? The actual situation is on the column and decision by row. The actual situation is that the null hypothesis is true. For real. If it’s correct I can have two outcomes: I reject the null hypothesis or I accept it. It’s better if we accept the null hypothesis. Sometimes it happen to make mistake and the TYPE 1 ERROR is rejecting the null hypothesis (that is true!). we call the TYPE 1 ERROR the significance of the test; the probability of committing a type 1 error is alfa. So we can control this type. The aim will be to minimize this kind of error, it’s quite severe to take this mistake. Another type is the TYPE 2 ERROR, so null hypothesis is false but we don’t reject it. We are missing an opportunity. It’s less severe. the power of the test is when you correctly reject the null hypothesis, the aim is to increase it with probability (1-ß). We start by fixing our alfa, our significance level, is very important. It should be lower then 10%, be default usually it’s 5%. It’s the reliability factor translated in the hypothesis testing. We always start from alfa and then we derive the power of the test. Then a decision rule is established. The hypothesis test decision deals with the assumption that the null hypothesis is the status quo, we assume that it’s true. Then we formulate a decision rule, the decision will be based on the sample date that we have collected. This will be design in order to minimize Type 1 error. Alfa is the probability, level of significance, it reflects the size of possible error that we can make when running the test. If we go back to the distribution, we can have a graphical representation of alfa. As we said alfa is a probability. It can be identified as an area in the tail of the distribution, as it identifies the critical value (x* with bar): it’s a threshold we use in order to decide if reject or not the null hypothesis. Alfa we set alfa, we have an area 51 which identifies a critical value (similar to what we did with the table). We have so x*, this threshold is making a split in the area of the distribution. This split dividing the space in two, gives as a rejection area (on right of x* and an acceptation area (on the left). If we reject the null hypothesis, we’ll make the investment. So this is the way we can say if the value is “far enough”, not basing our answer just on graphical decision. Anytime we have a sample mean at the right of the critical value, we reject the null hypothesis. And in the case of Fineco we do the investment. TYPES OF HYPOTHESIS: Another possible hypothesis could be a: a. simple hypothesis: it specifies a single value for a population parameter of interest (ex: H0:µ=3) b. composite hypothesis: it specifies a range of values for a population parameter (ex. H0:µ≥3) c. One-sided alternative hypothesis: involve all possible values of a population parameter on either on side or the other of the value specified by a simple or a composite null hypothesis (ex. H1:µ>3) d. Two-sided alternative hypothesis: involves all possible values of a population parameter other than the value specified by a simple null hypothesis (ex. H1:µ≠3) The level of significance, alfa It’s very important! It defines the unlikely values of the sample statistic if the null hypothesis is true; it defines the rejection region of the sampling distribution. It’s designated by alfa (level of significance), the typical values are 0.01, 0.05 or 0.10. it’s selected by the researched are at the beginning and it provides the critical value(s) of the test. Decision rule: reject H0 if the test statistic is more extreme than the critical value (the test statistic falls into the rejection region). This is the shape of the test that we’ll consider by testing an hypothesis on the mean of the population. The test statistic t, is the difference between the sample mean minus µ0 (the null hypothesis, for the Fineco it was 1.2), divided by the standard error S (estimate of the standard deviation of the population) divided by the square root of the sampling size (n). This is the test statistic. Do we know all the information in there? Yes. This quantity is distributed 52 like a Student T with n-1 degrees of freedom. The shape is bell shaped, symmetric and if n is great the CLT applies, so normal distributed. It works only if the population is normally distributed or if the CLT applies. Those are all the possible levels of significance and rejections regions that we can set. They will be set accordingly to the type of hypothesis we have set. The first case is a two sided hypothesis, I have set my alfa, but since it’s two, we split alfa in 2. So we have two critical values. In the second case we have a one sided hypothesis, we have just one critical value. Upper tail test and lower tail test are the last two cases. Alternative way: P-VALUE It’s the probability of obtaining a test statistic more extreme (≥ or ≤ ) than the observed sample value given H0 is true. Before we started by setting alfa. Let’s take again the Fineco example. We start now from the sample mean, we are not determining the critical regions. We can compute the area corresponding to the value of x bar (basically in the example before it’s the area under the right tail, at the right of x bar). We derive the value of the area, let’s make like 0.075, this is going to be the observed p-value. If we take alfa 10%. We’ll reject the null hypothesis. Because p-value is 7.5% and alfa 10% is greater than the area. Alfa is setting and defining critical value, since the p-value is smaller, we’ll reject, because we are in the rejecting region. With 5% alfa we won’t reject. Our decision changes in terms of the level of confidence we chose. It’s another way to do the same thing. Basically it’s also called the observed level of significance, smallest value of alfa for which H0 can be rejected. If the P-value is less than alfa: the test statistic falls into the rejection region, therefore reject H0 (and vice versa). Ͳ DECISION RULE Reject H0 if ࢖ െ ࢋ࢛࢒ࢇ࢜ ൏ ࢻ Two-sided test of the proportion of a population with large sample p0 p - value = 2Pr ÖP > Öp |H0( ) Öp p - value = 2Pr ÖP < Öp |H0( ) 55 statistic, the one for the mean. We have the d.f. and the corresponding p value of the t value. The p-value is small, even lower than 1%, so we are going to reject the null hypothesis. We can base our decision just at looking at the p-value. The second example is about the proportion, the process is quite similar to the confidence interval. We want to understand whether proportion who say absolutely yes is greater than 10%. We take all frequencies from the think variable, taken the sum, then sum all the people responding 1. We run our test by using the same identical command we have used before, for the confidence interval on the proportion. We have again the p-value: 0.003. we set alfa at the 5%. We reject the null hypothesis. USING C.I. FOR A TEST We can use confidence interval in order to reject or accept an hypothesis. We are going to understand how by using R-studio. We’ll use the e-bottle case. We ask: Is the average quantity of daily water that persons think they should drink to be healthy different than 2 liters? 56 We use t.test (dataset e bottle 2$should, null hypothesis equal 2 liters, alternative is 2 sided hypothesis). alfa is going to be 5%, so we can omit this specification. The output: we have the 95% confidence interval, ranging from 2.037 and 2.37, it’s the range of acceptable values for the characteristics we asked. 2 is out of it. Reject H0. The p-value is 1.7% is less then alfa, so we reject the null hypothesis also from the p-value point of view. There’s a clear link from the t.test results and the confidence interval that we set: a 95% confidence interval was set by alfa 5%. They excluded the value 2. It means that we can say that we can reject the null hypothesis. We have to compare the confidence interval with the null hypothesis, not with the observed data. We can modify the confidence level by changing so changing alfa. With alfa 1% we have a 99% confidence level. What happens? The confidence interval is bigger. Do we still reject? No, it’s included and moreover the p-value is bigger than alfa. According to the value of alfa that we choose we might change the outcome of our decision. It’s another way to verify my claims. GROUP COMPARISONS We might sometimes be interested to respond to such questions: 57 1. Is the average quantity of daily water that persons think they should drink to be healthy greater than the average quantity of water effectively drunk? 2. Is the average quantity of daily water that persons think they should drink to be healthy related to gender (the average for males is different from the average for females)? To respond to the first one, we take the sample, just by looking by the first 8 ID, we see that there’s a gap between should and do. Are the variables $should and $do somehow linked? Yes, they are two dependent samples, or paired samples, from two populations. It’s always the same person, providing information. Let’s look at the second question, $should and $gender, if we are comparing male and female dependent? No, they are different people, with different characteristics. I have independent sample, from two populations. $should for males and $should for females. When we have group comparison we always have to distinguish dependent and independent samples. A. Dependent samples: matched pairs (two objects/subjects very similar, apart the characteristic under study); two characteristics for each object/subject; repeated measurement on each object/subject (ex: before- after); B. Independent samples: some characteristic measured on two different random samples; some measurement on two sub-groups of sample items identified by a categorical variable (ex: male-female). These are the two questions’ tests: 60 On average they tend to drink less water than they think and we can approximate this difference of range 0.48 to 0.75 liters. This was for dependent samples. How can we process the comparison for independent samples? In order to check, we can use the experiment at the aim number 2. We need to transform the group in variables. We are telling R-studio to do it. We need to type the tilde sign. You have to specify TRUE if the variables are dependent, otherwise R-studio will consider them automatically independent. If we look at the p-value, it’s going to be about 0.000002, very small, so we reject H0. It means that there’s difference between man and woman. Confidence Interval (ex.: 99%) > t.test(e_bottle_2$Should, e_bottle_2$Do, alternative=("two.sided"),paired=TRUE, conf.level=0.99) Paired t-test data: e_bottle_2$Should and e_bottle_2$Do t = 11.957, df = 119, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 99 percent confidence interval: 0.4849122 0.7567545 sample estimates: mean of the differences 0.6208333 d ± tn -1,a 2 sd n I am 99% confident that the difference between perceived adequate quantity of water and quantity actually drunk is between 0.48 and 0.76 liters. 61 If we would have to make these calculations with out own hands we should have to assume that the two variances are equal. With the software we don’t need it. The pooled sample variance is the average of the two sample variances: In order to test with the pool variance with independent variance, we have to specify var.equal=TRUE We should pay attention to the fact that we can only test TWO GROUPS AT A TIME. we can’t do it simultaneously. In case we are interesting, for a factor with more than 2 levels, we should only select 2 groups at time and compare them. Then run the test with the same identical procedure seen before. We can do it also by Radiant, we’ll see it in the Lab. GOODNESS OF FIT TEST We have explored this concept when we saw the tools to explore the distribution of variables. A common assumption ‡ The populations have the same variance ‡ The common variance is estimated through ƚŚĞ͞pooled͟ƐĂŵƉůĞvariance ± A weighted average of the two sample variances s X 2 = s Y 2 = s 2 ¬ sp 2 = nX - 1( ) sX2 - nY - 1( ) sY2 nX + nY - 2 HP testing (Aim 2) ʹ another way #1) Transform e_bottle_2$gender in a factor gndr<-factor(e_bottle_2$Gender) #2) Run the test w/method class 'formula' t.test(e_bottle_2$Should~gndr, alternative="two.sided", conf.level=0.99, var.equal=TRUE) Two Sample t-test data: e_bottle_2$Should by gndr t = -4.5486, df = 118, p-value = 1.318e-05 alternative hypothesis: true difference in means is not equal to 0 99 percent confidence interval: -1.1585207 -0.3120676 sample estimates: mean in group 0 mean in group 1 1.889706 2.625000 At any level of significance, reject H0, accept HA H0 : mX - mY = 0 the average quantity of daily water that persons think they should drink to be healthy is significanlty different between males and females t = x - y( ) - 0 sp 2 nX + sp 2 nY ~ Tnx+nY - 2. Poulations variances assumed equal Pay attention for indep.samples ‡ Only two groups at a time! ‡ For a factor with more than two levels, select cases! ± Create a new dataframe (newdata) from DFname, for instance, selecting two grouping categories (e.g.: ͞Level1͟ĂŶĚ͞Level2͟ͿŽĨƚŚĞvariable varname (e.g.) newdata <- DFname[ which(DFname$varname =='Level1' | DFname$varname =='Level2'),] 62 For example: CITY – HILLS – MOUNTAINS -> Is there a difference in the average amount of water living in these different areas? Are those evenly distributed or not? If I have 3 categories, city, hills, mountains and I measure the liters they drink, what is the percentage I should find for each category? 0.3333 (1/3 each). The null hypothesis that here we want to test is whether for a particular distribution the cases are evenly distributed or not. If reasonably we can assume that the distribution is different from a uniform distribution, where all the category has the same chance to occur. This is called the goodness of fit test. We want to apply this to the e-bottle case. I have 5 categories, so the condition is 20% each category. So the test is a Chi-Squared test, where the distribution of the variable. The null hypothesis is the Chi-square test is that there’s no differences in the distribution. We can use the p-value and the approach is always the same. What about qualitative variables? Sometimes we have qualitative data and we want to make group comparison. In this new chapter we’ll understand how to make group comparison on the proportion, then we’ll introduced the chi square test for independence. 1. Is there an association between the opinion about the usefulness of the e-bottle and gender? do opinions depend on gender? 2. The proportion of person with a positive opinion about the e-bottle (absolutely & maybe) among males is different than among females? Let’s try to respond to question 1. Is there a difference? Yes. 65 We first of all can compute the mean. We digit mean(nameofdataset$variable) and it calculates the mean. After we compute the median with the same method. We can compute min and max. we can also compute the range, difference between min and max. we can compute the quantile in the parenthesis we have to specify which variable and which quantile we want (in percent). IQR is the interquartile range, then we can compute the standard deviation. And also the variance. We can do the same things for the other variables. For some variables it’s meaningful, but for others they are not. Let’s try to make a graph. We want to have a box plot of the variable age. We just have to write boxplot and the name of the variable. How can we read it? We have the minimum at 72, the maximum at 84, we have the first quartile like 76, the third on 81 about, and the median value is the black line, about 77. It’s called 5 number summary. There’s a kind of tail on the positive side of out distribution. We can copy and past the image in the format we prefer. We can also make an histogram, with command hist, we just need the name of the variable and we have our hist ogr am. This is the basic level of the histogram. 66 We can be interested in understanding how the variable gender changes, we could create two box plot and then create a comparison. You have to remember which number is the gender, 1 and 2. And we can see that for males we have the lower part of the distribution a little bit more wire. We can also compute the absolute frequencies, for the ages it’s not so useful, so we’ll use the income. We have the possible outcome of the variable income. 121 people are rich, 243 middle income and 85 poor. If we want to know the relative frequencies. 27% of people are rich, 54% middle income, 19% are poor. If we want to see the same thing but comparing with the gender, we have this kind of table, with joint frequencies: we have 69% male poor; 13% rich and female. 67 Another option, with conditional frequencies. With margin = 1. If we sum up each line we have 100%. So we are standardizing. If we put the margin = 2 we have still 100%. It depends on which way you want to standardize, row or column. Margin means that we are standardizing looking at the total number of one of the variable we have. In margin = 1 the relative frequencies is compute to regard to total number of people that are male or female; if you sum up on row (30+53+15). In the last, with margin = 2 you have 100% if you sum up by column. Q2. Explore the social and relational participation of the elderly people that have been interviewed. Do women spend more time alone rather than men? Is it true that people with a higher education level spend more time outside than people with a lower education level? We have to check if these two groups have different behavior. The idea is that we have to check the mean of the two groups. We take the mean of the time spent alone by male and compare it with the one of females. We use the command aggregate with the variable standalone, and we aggregate by gender, so by=list(Elderly$gender), the last command is what we want to compute, so the mean. We can have problems. If we look at the variable time out we have some N/A data. We have to explain to R-studio how to behave when we have N/A data. 70 The difference between the two means is the mid point of the interval. 24.01.2022 SIMPLE LINEAR REGRESSION Relationship between numerical variables We start from an example. E-bottle case 1. Is there a relationship between the quantity of daily water that persons think they should drink to be healthy and the years of education? 2. Can the quantity of daily water that persons think they should drink be expressed as a linear function of the years of education? We study the relationship between Should and years study. By looking at the scatterplot, which is the relationship? Seems positive, as the years of study grow, we also have a growth in the number of liters recommended to drink. What we might be interested in, is measuring the intensity of this relationship. Covariant, correlation.. they are measure of relationship. COVARIANCE 71 This measure gives us the idea of the size and sign of the relationship. The problem is that it’s not standardized, so it’s hard to make comparison. The correlation index is more usable index. When we specify ro is the population correlation, when we have r it’s refer to the sample. CORRELATION COEFFICIENT By using the correlation coefficient, when it’s close to -1 we have perfect negative linear association; when +1 perfect positive linear association; in case of 0 we have independence or other type of relationship, not linear. 72 correctly set the hypothesis, it’s an hypothesis testing. The test is t, we know the distribution, the parameter is the linear correlation. The null hypothesis is ro=0, the linear correlation. P value is the smaller probability to which we can reject the null hypothesis. It’s small enough, so we reject the null hypothesis. INTRODUCTION TO LINEAR MODELS Regression analysis is used to: - Predict the value of a dependent variable based on the value of at least one independent variable; - Explain the impact of changes in an independent variable of the dependent variable 75 The blue one is the coefficient of the slope, the blue one of the intercept. We can have a look at the summary results of the model. We saw that the output was divided in two different components. The coefficients table and the summary (from residual to F-statistic) table. We are now trying to understand how to interpret these results and which are the information they provide. INFERENCE AND EVALUATION Also in the estimation of the parameters we can make inference, in order to understand if they are significant or not. We’ll evaluate the good of the model. The model can be good or bad, it’s good if the model is close to the observed value in the sample. We need measures in order to understand how much were we able to minimize the variable. Let’s start form the table of coefficient: What is the hypothesis test that we want to verify on the B parameter? The main aim focuses on the B1 slope. The set of hypothesis we need to verify is if the B1 76 value is equal to 0 (status quo) or if different. We have done something very similar when we saw the test or r coefficient correlation. We start from the Multiple R-squared, it’s SSR/SST. The main aim is to minimize the error. As small as possible. To do so we define the value Yi-Y marked; in order to assess whether the goodness is bad or not, we take the sum of all the distances Tabl of coefficients Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.004325 0.074045 -27.07 <2e-16 *** Years_study 0.275637 0.004732 58.26 <2e-16 *** t = b1 - b1 sb1 = b1 sb1 The test statistics has a ^ƚƵĚĞŶƚ Ɛ͛ T distribution with n-2 d.f. H0: ɴ1 = 0 (no effect of X on Y) H1: ɴ1 z 0 (otherwise) sb1 = VARIANCE OF THE LINEAR MODEL SAMPLE DEVIANCE OF X1 xi - x( )2 = n - 1( ) sX2 i=1 n å P-Value Reject H0: $Years_study has a significant impact on $Should The coefficient of determination « Residual standard error: 0.1744 on 118 degrees of freedom Multiple R-squared: 0.9664, Adjusted R-squared: 0.9661 F-statistic: 3394 on 1 and 118 DF, p-value: < 2.2e-16 > anova(m1) Analysis of Variance Table Response: Should Df Sum Sq Mean Sq F value Pr(>F) Years_study 1 103.203 103.20 3393.7 < 2.2e-16 *** Residuals 118 3.588 0.03 --- SignifFRGHVµ ¶µ ¶µ ¶µ¶µ ¶ SSR ¦ i(yÖ  y)2 ¦(y  yÖ )2 i i SSE ¦(y  y)SST 2 i Total Sum of Squares Regression Sum of Squares Error (residual) Sum of Squares R2 = SSR / SST = 1 ʹSSE/SST 96.64% of the variability of $Should is explained by the linear regression model 77 between the predicted and the observed. We take the square (we remove the fact that they can be positive or negative). There’s another part of variability which is the overall variability. It’s the sum of square regression, it’s the sum of the square differences between the Yi and the mean. The distance from the sample mean and the fitted model. On the Anova table, it can be run in R studio. When we type, we have information on the sum of the square for the model and for the residuals. Residuals are the value 3.58, the sum of the square errors. By using those two measures I’m able to defined multiple r squared. It’s the ratio between some of the square regression, divided by total variability. SSR/SST We can even change the formula by replacing SST with 1-SSE/SST. Because SSR+SSE=SST. The top value for the R squared is 1 (100%). It’s the total variability. Since in SST I have two components, I want to minimize the error, we can range from 0 and 1. When 1= we are able to explain 100% of the overall variability. When it’s 0 I’m able to explain nothing. The greater, the better. I’m not just able to run a test, but also to provide a confidence interval of values, for the intercept and for the slope. The confidence interval can be run by confint setting the level of confidence. Otherwise you have 95%. By typing it we have the values of the confidence interval for b1… they are derived in the same way of when we were setting them for the mean. How can we interpret 0.27? it means that for every extra year of study, we have an increase of 0.27 in the amount of liters that the people think they should drink. 80 The representation is 3D. Drawing an area. We can obtain specific values and interpret them statistically. The model assumptions in this case, of the standard multiple regression model: 81 We are able to evaluate and compare the model. The method of the evaluation will not differ that much from the one for the simple linear regression model. How to do that? We have three models. 82 To understand which variable is useful, we watch the p value of each variable. According to the value of alfa. The third model is not the best. It’s not a great gain from the first model’s multiple-r squared. When you add variables you increase complexity. It’s not worth it. Adjusted coefficient of determination, R^2 The problem with R^2 is that never decreases when a new x variable is added to the model, even if the new variable is not an important predictor variable. This can be a disadvantage when comparing models. So the net effect of adding a new variable is losing a degree of freedom. The adjusted coefficient of determinator uses the same formula of R squared, but with somehow weighting. The adjusted R^2 provides a better comparison between multiple regression models with different numbers of independent variables. It penalizes the excessive use of unimportant independent variables. Usually the value is less then R^2. The f statistic is the hypothesis statistic to verify the joint significance of the model. It’s behind the multiple R^2. The F-statistic, providing information on the so called joint significance of the model. 85 25.01.2022 Yesterday we say the multiple regression model, we have set parameters B, B0 is the intercepts and then we have k Betas that represent the others variables influences. If we look at the size and signs of the coefficients, we understand how they influence, how much are they significant. But there’s a last point missed. Years has a higher impact then do? It depends by the units of measurement. We are measuring the change on the dependent variable by using liters and years! It’s difficult to understand which have the greater impact. The way we can do it is the STANDARDISATION PROCESS. We standardize coefficients in order to neutralize the unit of measurements problem that we observe. In a multiple linear regression model the impact of one independent variable on the dependent variable is related to the unit of measurement of the variables. To get a relative measure of the relevance of each independent variable Xj in influencing the dependent Y. This is the interpretation: 86 We can compute standardization by (not examinable): The bstar coefficient changed, higher impact on years_study influencing the variable study. QUANTITATIVE AND CATEGORICAL PREDICTORS There’s a way in which we can also use in a multiple regression model also categorical predictors. It’s not only about binary variables, but also categorical variables. Dependent variable must be a continuous quantitative variable. Covariates: independent variables We have the following questions (dataset: ebottle_2): 1. Study the joint relationship between should, education and gender 2. Study and evaluate the association between should and think 3. Do gender and education interact on should? DUMMY VARIABLES They are the ones that can assume only values 0 and 1. To convert a categorical variable with k categories in k-1 numerical variables: Value 1: the case presents the category Values 0: otherwise 87 We need only k-1 dummy variables: the category for which all the dummy variables are equal to zero is the baseline category (the one we use as reference). The regression coefficient of the dummy variable is the average difference of Y between the category and the baseline. This is an example I choose as reference category the gender male and I assign it 0, 1 to female. So male will be my reference category. For what concerns about the think variable, the possible categories are 5, so they generate 4 categories. We choose to consider as reference category the first one, absolutely yes. There’s no 1 for absolutely yes. REGRESSION WITH NOMINAL VARIABLES We’ll use R-Studio. So we have 52 females and 68 men. I want to check gender, so we check the various types of gender we have, the list in the dataset. (before it was a 1 and 2) Then we need to convert as numeric the variable gender by using the command as.numeric. We created the dummy, we associated the value 1 to female. 90 combination corresponds for one of the categories. If we separate gender, we have differential effects between group. Slopes of the interpolation line are clearly different. There’s a Z variable with some interaction with the X variable. We can treat interaction of X and Z on Y by modeling linear regression model, including an interaction term. In order to find out and provide a measure of interaction between the two interaction variables. We are somehow isolating the effect of the interaction. In order to do that we have to create a new variable, which is the product between the two variables of interest. On average, for males, for a unit increase of the quantity of water actually drank, the increase of the quantity of water supposed to be drank is equal to 91 1.02967 + 0.19770(0)=1.02967, ceteris paribus. If we consider female, on average we can say, for a unit increase, the increase of the quantity supposed is 1.02+0.197. clearly the impact could be sizable with the category in which you have assigned value 1. So combined effect will be understandable when we have the characteristic female. We can create an interaction term. We don’t have to include it every time, but just if we believe that there’s some inferential effect. They are not different enough. When we have one in a positive direction and the other one negative, we have a clear interaction. DIAGNOSTICS Is there any place in R studio about the verification of the assumptions we stated yesterday? (the 5 assumptions) Diagnostics techniques: we generally use them if we want to explore possible violations of the linear model assumptions. We are going to consider the linear regression model where should is Y and we’ll consider 3 independent variables: do, years study, timephones. The question is: 1. Predict the quantity of daily water that a person: who drinks 2 liters of water a day, is 15 years 92 Here we can’t check the assumptions. The first diagnostic technique we are exploring is the linearity assumption. Is the assumption for which we would like to find evidence in the fact that the relationship between Y and X is linear or not. We can display a plot; by looking at the two graph only the one on the left is linear. Components plus residual is a display the relationship between residuals and the original independent variable. Then you interpret the results you get. In order to run it, you need to install package car. 95 We can draw the histogram of the residuals and comment on it. How should the distribution of the residuals look like to meet the assumptions? Normal and centered on 0. Quite ok. Possible problems? A wide is longer, observations really far from the mean. They could be outliers. Another way is the qqPlot. It provides information to meet the assumption of normality of the errors and the homoschedasticity, it also provides information of the outliers. If they are distributed on the blue line they are good. We have met the assumptions. Observations out of the blue area, they are outliers. Those numbers like 95 and 83, it means that the records are probability outliers. They significantly influence the results. They can be problems, consider about drop them out of the data. 96 There are even other techniques that we can use to check the assumption, like the spread level plot. Where we have the relationship between absolute standardize residual and fitted values. The two lines should be overlapping. This case we can have possible problems. The assumption n4 is that random error terms are not correlated with one another. A possible display is the errors terms with the original Y values. If you have a well fitted model the errors should be floating around the 0 line. We have possible solutions, for example check if there’s a statistical correlation between the error test by running a test on it. It’s the Durbin Watson test: it checks for auto correlations: 97 The set of hypothesis of this test is basically checking whether the correlation between the errors is equal to 0. In order to exclude that we have correlation in the errors, should we reject or not the null hypothesis? No, we should not. The final assumption that we can meet and check is that there’s no linear relationship for the Xjs. If there is a violation of this assumption the model cannot be fitted. The software neglects one of the independent variable (the second). When you run your model basically you don’t have the result. MULTICOLLINEARITY It’s a problem. Collinearity is the high correlation exists among two or more independent variables. This means the correlated variables contribute redundant information to the multiple regression model. It’s not so evidence, sometimes you have leaker relationship, so you need to set a threshold. Define a test that is able to provide a threshold and let us state the correct decision whether to fit or not certain variables. Including two highly correlated explanatory variables can adversely affect the regression results: - No new information provided - Can lead to unstable coefficients (large standard error: low t-values and large p-values) - Coefficients signs and values may not match prior expectations.
Docsity logo


Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved