Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli

Second Partial - Statistics Notes (30/30), Schemi e mappe concettuali di Statistica

Second Partial - Statistics Notes (30/30)

Tipologia: Schemi e mappe concettuali

2023/2024

In vendita dal 22/01/2024

claudia.appunti222
claudia.appunti222 🇮🇹

4

(5)

44 documenti

Anteprima parziale del testo

Scarica Second Partial - Statistics Notes (30/30) e più Schemi e mappe concettuali in PDF di Statistica solo su Docsity! STATISTICS – SECOND PARTIAL INTERVAL ESTIMATION The interval estimator can be obtained by adding and subtracting from an (appropriate point estimator an amount called the margin of error (ME), which in turn depends on the standard error and on the distribution of the estimator. Point estimator ± margin of error → Point estimator ± (reliability factor × standard error) The reliability factor depends on the required confidence level (1-α), and on the distribution of the point estimator, and the width of the interval. The confidence level (1-α) is the probability that the random confidence interval contains the unknown parameter. We can be confident at the (1-α) level that the observed interval is one of those including the parameter. The interpretation is that (1-α)% of the intervals built around the parameter, we can be confident at the (1-α) level that the observed interval is one of those including the parameter. Parameter Estimator Estimate Samples and Variance(s) Standard error, SE or se Population Distribution Percentile μ 𝑋 ?̅? σ known 𝜎 √𝑛 Normal/large samples Normal σ unknown 𝑠 √𝑛 Normal/large samples Student’s t* (n-1) μX- μY 𝑋 − 𝑌 ?̅? − 𝑦 In de pe nd en t S am pl es σ2 X and σ2 Y known 𝜎 𝑛 + 𝜎 𝑛 Both Normal or Large samples Normal σ2 X = σ2 Y unknown 𝑠 𝑛 + 𝑠 𝑛 Both Normal or Large Samples Student’s t* (nx-ny-2) σ2 X ≠ σ2 Y unknown 𝑠 𝑛 + 𝑠 𝑛 Both Normal or Large samples Student’s t (Only with R) P ai re d S am pl es σ2 D known 𝜎 √𝑛 = 𝜎 + 𝜎 − 2𝜎 𝑛 Joint Normal or Large samples Normal σ2 D unknown 𝑠 √𝑛 = 𝑠 + 𝑠 − 2𝑠 𝑛 Joint Normal or Large samples Student’s t* (n-1) p 𝑃 ?̂? ?̂?(1 − ?̂?) 𝑛 Large sample Normal pX-pY 𝑃X - 𝑃Y ?̂?X - ?̂?Y independent ?̂? (1 − ?̂? ) 𝑛 + ?̂? 1 − ?̂? 𝑛 Large sample Normal The precision of an interval is given by: - Sample size. - Confidence level (and distribution used). - Dispersion: variance. For the margin of error to be at most ME, it is necessary to select a sample with size: 𝑛 ≥ / . If our estimator is the sample proportion, we have that 𝜎 = 0.5 to account for the maximum value of the standard error.  Chi-squared goodness of fit test: ?̂? = ( )  Chi-square test of independence: 𝜒 ̂ = ∑ ∑ HYPOTHESIS TESTING: procedure for assessing whether a given hypothesis on a population parameter θ is supported by the available empirical evidence. It can be simple or composite, unilateral or bilateral. It does not allow to conclude whether a hypothesis is true or false; it is rather a procedure for assessing whether or not the observations in the random sample drawn from the population support a hypothesis on θ. - H0 (null hypothesis): the hypothesis held to be true unless empirical evidence is clearly against it. It must always specify at least one value for the parameter. It may simple or composite but must always include at least one specific value. - H1 (alternative hypothesis): the ‘novelty’ that contrasts the ‘status quo’. It is typically the hypothesis one is interested in proving. A statistical test is a procedure for two contrasting hypothesis and assessing whether to: - Accept (or fail to reject) H0. - Reject H0. A statistical test is typically based on a test statistic 𝜃 = f (X1, X2, …, Xn). The distribution of the estimator must depend on the parameter and must be fully determined if the parameter is known. A test based on the test statistic 𝜃 defines a rejection (or critical) region. The rejection region includes all the samples corresponding to realisations of 𝜃 that are unfavourable to H0 and thus lead to the rejection of H0 in favour of H1. State of nature (never known as the parameter is not known) Statistical decision (based on a sample) Reject H0 Fail to reject H0 H0 is true Type I error Correct decision H1 is true Correct decision Type II error - Type I error: reject H0, given that H0 is true. The significance level of the test [α = Pr(R|H0)] is the maximum probability of incurring an error of the first type when the parameter is one of the values specified by H0. By convention this is the more serious error. This is why H0 is the status quo, and we are conservative towards H0. - Type II error: fail to reject H0, given that H0 is false. β = Pr(A|H1) is the probability of incurring an error of second type when the parameter is one of the values specified by H1. Note that it can be interpreted as the percentages of sample with samples means wrongly leading to not reject H0. It does not refer to the decision made on the basis of a specific sample, but to the decision made on the basis of a generic sample. The power of the test (π = 1 – β) is the probability of correctly rejecting H0, P(R|H1). Among all the tests with significance level α, the test with lowest probability of committing an error of the second type (β) is chosen. The p-value (observed significant level) is the probability of observing a value of the test statistic that is more extreme than the observed realisation under the assumption that the null hypothesis is true. - If p-value ≥ α → not reject H0. The observed sample is not particularly extreme, or at least not extreme enough based on the threshold defined based on α. It therefore leads not to reject the null hypothesis. - If p-value < α → reject H0. There is enough empirical evidence to reject (fail to accept) H0. SIMPLE LINEAR REGRESSION ANALYSIS The simple linear regression model of a random variable Y given a specific value of the explanatory variable is defined as: 𝑌 = 𝛽 + 𝛽 𝑥 + 𝜀 where: ▪ 𝑌 = dependent variable (random) ▪ 𝑥 = independent variable (deterministic, known) ▪ 𝛽 , 𝛽 = intercept and slope of the linear model (unknown) ▪ 𝜀 = error (random) The least squares approach obtains estimates of the linear equation coefficients b0 and b1 in the model by minimizing the sum of squared residuals 𝑆𝑆𝐸 = 𝑒 . The Total Sum of Squares (SST) is the sum of squared deviations of the observed values y1, y2, …, yn from their mean: 𝑆𝑆𝑇 = (𝑦 − 𝑦) = (𝑛 − 1) 𝑠 Total sum of squares = dispersion of the yi The sum of the squares of the regression (SSR) is the sum of squared deviations from the predictions from the mean of the dependent variable: 𝑆𝑆𝑅 = (𝑦 − 𝑦) Regression sum of squares = dispersion explained by the line The sum of squared errors (SSE) is the sum of the squared differences between the observed values and those predicted using the regression line: 𝑆𝑆𝐸 = (𝑦 − 𝑦 ) = 𝑒 Error sum of squares = dispersion not explained by the line Note that: 𝑆𝑆𝑅 = 𝑆𝑆𝑇 ⋅ 𝑟 = (𝑛 − 1)𝑠 ⋅ 𝑟 = (𝑛 − 1)𝑠 = (𝑛 − 1)𝑏 𝑠 θ 𝜃 (under the strong assumption) 𝜗 𝑠/√𝑛 𝛽 𝛽 ~𝑁 𝛽 , 𝜎 1 𝑛 + ?̅? (𝑛 − 1)𝑠 𝑏 𝑠 1 𝑛 + ?̅? (𝑛 − 1)𝑠 𝛽 𝛽 ~𝑁 𝛽 , 𝜎 (𝑛 − 1)𝑠 𝑏 𝑠 (𝑛 − 1)𝑠 The least squared regression line: 𝑥 observed value of X. 𝑦 value of Y observed at X=xi. 𝑦 = 𝑏 + 𝑏 𝑥 prediction of Y corresponding to X=xi using a straight line with intercept 𝑏 and slope 𝑏 . 𝑒 = (𝑦 − 𝑦 ) = (𝑦 − 𝑏 − 𝑏 𝑥 ) the distance for each point (xi, yi) from the linear equation is defined as the residual. 𝑏 = ( ̅)( ) ( ̅) = = 𝑟 The slope of the regression line. For each additional unit of x, based on the regression line, the y is ‘expected’ to increase on average by 𝑏 . 𝑏 = 𝑦 − 𝑏 ?̅? The intercept of the regression line. Can be interpreted as the y of x = 0. By comparing SSE (or SSR) with its maximum value (SST), one can define a relative measure of goodness of fit of the regression line, the co-called coefficient of determination: 𝑅 = 𝑆𝑆𝑅 𝑆𝑆𝑇 = 1 − 𝑆𝑆𝐸 𝑆𝑆𝑇 = 𝑟 R2 measures the total information contained in the data explained by the model (or the proportion of the variation in the dependent variable explained by the regression line). It takes on values between 0 and 1. It seems strongly correlated to the strength of the linear relationship between the two variables. The variances of the least-squares estimators depend on the variance of the errors, that can be estimated with its unbiased estimator (mean square for error). In order to study the properties of the estimators, it is necessary to make assumptions on the characteristics of the distribution of Y or, equivalently, on the characteristics of the distribution of the unique random component of the linear model: Yi=β0+ β1xi+εi. Weak assumption: - The erratic component in the linear more is a random variable with expected value equal to zero: E(ε) = 0. - The erratic component in the linear model is a random variable with constant variance, 𝜎 whatever xi (homoscedasticity): Var(ε) = 𝜎 . - The errors ε1, ..., εn are uncorrelated. 𝐶𝑜𝑟(𝑦 , 𝑦 ) = 𝐶𝑜𝑟(𝛽 + 𝛽 𝑥 + 𝜀 , 𝛽 + 𝛽 𝑥 + 𝜀 ) = 𝐶𝑜𝑟(𝜀 , 𝜀 ) = 0 Thus, random variables – i.e. the sample realisations (Y1,…, Yn) are uncorrelated. Strong assumption: - All the weak assumptions. - The erratic component has normal distribution 𝜀 ~𝑁(0, 𝜎 ). Hence, 𝑌 ~𝑁(𝛽 + 𝛽 𝑥 , 𝜎 ). If the variance of errors is not constant, the estimated residuals cannot be used to estimate the “common variance”. In addition, since the standard error of the least squares estimators depend also on the standard error of residuals, the t-statistic used to test the hypotheses on the regression’s coefficients are not totally trustable. Predictions about the dependent variable (y) at a given value of the explanatory variable (x): On the expected value of Y i.e. μg = E(Yg). To a specific value of xg, corresponds a sub-population of value Y whose average is estimated with ?̂? = 𝛽 + 𝛽 𝑥 . It can be proved to be: Var ?̂? = 𝜎 + ̅ ( ) . On the exact value of Y i.e. Yg. The latter ‘incorporates’ the unobservable error εg, which caused the deviation of from the regression line (i.e. from μg). It incorporates the uncertainty associated with the deviation of the specific realizations of the dependent variable from the expected value predicted by the model. The variance of the prediction in this case must also include the extra-variability due to the presence of error associated with the forecast: Var 𝑌 = 𝜎 Var(?̂? ). Instead of using a point estimator, it is always recommendable to build a confidence interval for the prediction, so that the uncertainty on prediction is incorporated in the estimate.  Extrapolating is very risky! θ 𝜃 𝜗 𝜎 𝜎 = ( ) 𝑠 = 𝑆𝑆𝐸 (𝑛 − 2) = 𝑀𝑆𝐸 MULTIPLE LINEAR REGRESSION ANALYISIS The multiple linear regression model of a random variable Y is associated with several explanatory variables: ▪ 𝑌 = dependent variable (random) ▪ 𝑥 , 𝑥 , … , 𝑥 = independent variable (deterministic, known) ▪ 𝛽 = intercept of the linear model (unknown) ▪ 𝛽 , 𝛽 , … , 𝛽 = model parameters ▪ 𝜀 = error (random variable) The coefficient b1 represents the average change in y corresponding to a unit change in x1 assuming every other variable constant. The interpretation with descriptive data (F/M, A/B/C): the average of F differs from the average of M by the coefficient b1. The least squares procedure computes the estimated coefficients to minimize the sum of residuals squared (SSE). Among all the (hyper)planes, we choose the one that minimizes the sum of squared errors. The Total Sum of Squares (SST) is the sum of squared deviations of the observed values y1, y2, …, yn from their mean: 𝑆𝑆𝑇 = (𝑦 − 𝑦) = (𝑛 − 1) 𝑠 Total sum of squares = dispersion of the yi The sum of the squares of the regression (SSR) is the sum of squared deviations from the predictions from the mean of the dependent variable: 𝑆𝑆𝑅 = (𝑦 − 𝑦) Regression sum of squares = dispersion explained by the line The sum of squared errors (SSE) is the sum of the squared differences between the observed values and those predicted using the regression line: 𝑆𝑆𝐸 = (𝑦 − 𝑦 ) = 𝑒 Error sum of squares = dispersion not explained by the line The coefficient of determination to assess the goodness of fit of the model: 𝑅 = 𝑆𝑆𝑅 𝑆𝑆𝑇 = 1 − 𝑆𝑆𝐸 𝑆𝑆𝑇 = 1 − 𝑠 (𝑛 − 𝐾 − 1) 𝑠 (𝑛 − 1) In the case of multiple regression, the adjusted R2 which also considers the sample size but above all the number of explanatory variables in the model. It is used to compare the fit of models with a different number of explanatory variables. 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 = 1 − 𝑆𝑆𝐸/(𝑛 − 𝐾 − 1) 𝑆𝑆𝑇/(𝑛 − 1) = 1 − 𝑠 𝑠 SST is the total sum of squares for the dependent variables, that is the sum of squared deviations of the values observed from their mean, SSE is the sum of the regression’s squared residuals, that is the sum of the squared differences between the values observed and their prediction based on the considered model. The variance of the least-squares estimators depends on the variance of the errors. The square root is called standard error of the model or standard error of the residuals. To determine if sets of several coefficients are all simultaneously equal to 0, we use the F test: 𝐹 − 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 = 𝑆𝑆𝑅/𝐾 𝑆𝑆𝐸/(𝑛 − 𝐾 − 1) = 𝑀𝑆𝑅 𝑠 Note that this quantity assumes only non-negative values. Multicollinearity is a problem that arises if some of the explanatory variables are highly correlated with each other. In this case, a change in one of these explanatory variables also leads to a change in the others. This leads to high standard errors of the coefficients, as it is not possible to vary a single variable ‘while keeping all other fixed’. θ 𝜃 𝜗 𝜎 … 𝑠 = 𝑆𝑆𝐸 (𝑛 − 𝐾 − 1) = 𝑀𝑆𝐸
Docsity logo


Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved