Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli

Econometrics – Economics and Finance, Appunti di Econometria

Econometrics, Sergio Pastorello, corso CLEF (economics and finance), anno 2022. Integrazione di slide, appunti presi a lezione e libro. Contiene tutta la teoria necessaria per passare l'esame. Esame passato con 29.

Tipologia: Appunti

2021/2022

In vendita dal 20/06/2023

AnnaM00
AnnaM00 🇮🇹

6 documenti

1 / 107

Toggle sidebar

Spesso scaricati insieme


Documenti correlati


Anteprima parziale del testo

Scarica Econometrics – Economics and Finance e più Appunti in PDF di Econometria solo su Docsity! ECONOMETRICS Informations about the course • Class starts at 8.30 • Office hours: Tuesday 3-5pm • The course is an introduction to the main econometrics tools and techniques: linear regression and least squares estimation; instrumental variables estimation; panel data and/or limited dependent variable models • Aims: learn how to conduct autonomously and correctly an empirical research; learn how to interpret and discuss results reported by others. • Textbook: HGL, Hill, Griffiths e Lim, Principles of econometrics, international student version, 5th ed, Wiley, 2018; PER (not necessary) Colonescu, Principles of econometrics using R, 2016. • Software: R, Rstudio, which is much more advanced and appreciated than Stata, but more difficult. You typically need to instal first R and then Rstudio. Download the data which you can find on the textbook website https://bookdown.org/ccolonescu/RPoE4 • How to succeed: go through the textbook and the scripts right after class; understand the logic; don’t keep questions for yourself. • Tools: textbooks, slides, notes, website, scripts, data, forum, examples from HGL, other exercises from HGL. • Exam: 16 questions (5 multiple on theory, 11 empirical exercises to be used on software). Weekly quiz: very much like the exam. The II mid is the same as the full exam. List of topics discussed during classes 19/09/2022 (3): course intro, intro to econometrics, the simple linear regression model, economic model and econometrics model, Least Squares Principle, Least Squares estimation (HGL chap. 1 complete, chap. 2 up to par. 2.3.1 included; slides intro complete, 1 complete, 2 1-15). 22/9/2022 (6): introduction to R and Rstudio: loading the data, descriptive statistics and plots. Least squares estimation of the simple linear regression model. Economic interpretation of OLS estimates. Marginal effects and elasticities (HGL chap. 2 up to par. 2.3.2 included; slides 2 16-17). 29/9/2022 (9): Properties of the LS estimator: unbiasedness, variance. Gauss-Markov Theorem, distribution of the LS estimator. Estimating the errors' variance. Standard error of the parameters. Applications in R (HGL chap. 2 up to par. 2.7.2 included; slides 2 23-39). 3/10/2022 (12): Nonlinear models. Conf interval and test of simple hypothesis. Applications in R (HGL chap. 2 up to par. 2.9 included, chap. 3 up to par. 3.2.3 included: slides 2 complete, 3 1-13). 6/10/2022 (15): Test of hypothesis on a single parameter using critical values / rejection regions and p-values. Point estimation, interval estimation and hypothesis testing of a linear combination of parameters. Applications in R (HGL chap 3 complete, appendixes excluded; slides 3 complete). 10/10/2022 (18): Out-of-sample point and interval forecast; goodness-of-fit, variance decomposition, R2 coefficient, R2 with pros and cons (HGL chap 4 up to par. 4.2.2 included; slides 4 1-14). 13/10/2022 (21): Effects of rescaling the data. Choosing the functional form of the model; residuals plots, heteroskedasticity, misspecification. Normality, Jarque-Bera test. Applications in R (HGL chap 4 par. 4.4.1 included; slides 4 15-38). 17/10/2022 (24): Natural and corrected prediction, and generalized R squared for loglinear and loglog models. Multiple linear regression model: introduction, Gauss-Markov assumptions, Least Squares estimation, properties of the LS estimator, distribution, interval estimation, test of simple hypotheses. Nonlinear in variables models and point estimation for nonlinear function of the parameters. Applications in R  (HGL chap 4 complete except for the  appendixes; chap 5 up to par. 5.6 included, except par. 5.2.4; slides 4 complete, 5 1-44). Homework Homework 1: Solve exercises 2.1, 2.2 (Hint: to compute probabilities under the normal distribution in R, look at the manual page for command "pnorm") and 2.16 of HGL5. Homework 2: Solve exercise 2.17 of HGL. Homework 3: Solve exercise 2.23 and 2.28 of HGL. How are Data generated? • Experimental: (almost never the case in economics) it is when we conduct or observe the outcome of an experiment. A key aspect of experimental data is that the values of the explanatory variables (i.e. control variables) can be fixed at specific values in repeated trials of the experiment. • Quasi-Experimental: pure experimental data are such that individuals are randomly assigned to the control and treatment groups. With quasi-experimental data, allocation to the control and treatment groups is not random but based on another criterion. • Non experimental: (almost always the case in economics). Observational, like survey data. Data on all variables are collected simultaneously, and the values are neither fixed nor repeatable. Economic Data Types • Data characteristics - aggregation level: micro or macro - flow (over a period of time) or stock (at a particular point in time) - quantitative or qualitative Cross-section (yᵢ, xᵢ, zᵢ) i=1,2, …, N Data where observational units are observed at the same time (or during the same particular time period). Time series (yt, xt, zt) t=1,2, …,T It is data collected over discrete intervals of time, e.g. observing stock indices for a certain lapse of time, like monthly, quarterly, etc. The key feature of time-series data is that the same economic quantity is recorded at a regular time interval. Panel (we are not going to have time to study it) (yit, xᵢt, zᵢt) i=1,2, …, N t=1,2, …,T Also knows as “longitudinal”, is data that has both dimensions: cross-sectional and temporal. A panel has observations on individual micro-units that are followed over time. If we have the same number of time period observations for each micro-unit, which is the case here, we have a balanced panel. Chapter 2: simple linear regression model Like all models, the regression model is based on assumptions, which are conditions under which the analysis in subsequent chapters is appropriate. An economic model Economic example in order to understand regression models: suppose that we are interested in studying the relationship between household income and expenditure on food, and only those whose income is $1000 per week. We randomly select a number of households from this population and ask about their food consumption per person last week. We denote as y the weekly food expenditure, which is a random variable, since its value is unknown to us until a household is selected and the question is asked and answered. Economic theory suggests a certain level of consumption for a household with 1000 income. In econometrics, the consumption is depicted by the curve below, which is a pdf (probability density function) f(y). The amount spent on food per person will vary from one household to another for a variety of reasons: food preference, ages, eating habits. These and many other factors, will cause weekly expenditures on food to vary from one household to another, even at the same level of income. The pdf f(y) describes how expenditures are “distributed” over the population, and is conditional upon household income. If x = weekly income = 1000, then the conditional pdf is f(y|x=1000), with conditional mean of y equal to E(y|x=1000). The conditional variance of y is var(y|x=1000)=σ², measuring the dispersion of y around the mean. Economic theory tells us that expenditure on economic goods depends on income. Consequently, we call y the “dependent variable” and x the “independent” or “explanatory” variable. In econometrics, we recognize that real-world expenditures are random variables, and we want to use data to learn about the relationship. We can use econometrics to make predictions about changes. In order to investigate the relationship between expenditure and income, we must build an economic model and then a corresponding econometric model that forms the basis for a quantitative or empirical economic analysis. In order to use data, we must now specify an econometric model that describes how the data on household income and food expenditure are obtained and that guides the econometric analysis. This is the case of having two households: one is richer than the other. - The distribution is similar - The mean is different An Econometric model Given the economic reasoning in the previous section, and to quantify the relationship between food expenditure and income, we must progress to an econometric model. Suppose a three-person household has an unwavering rule that each week they spend $80 and then also spend 10 cents of each dollar of income received on food: y = 80 + 0.10x Knowing this relationship we know that in a week in which the income is $1000, the household will spend $180 on food or that if income increases by $100 then expenditure will be $190. These are predictions of food expenditure given income. Predicting the value of one variable given the value of another, or others, is one of the primary uses of regression analysis. A second primary use of regression analysis is to attribute, or relate, changes in one variable to changes in another variable. To that end, let “Δ” denote “change in” in the usual algebraic way. Much of economic and econometric analysis is an attempt to measure a causal relationship between two economic variables. Claiming causality here, that is, changing income leads to a change in food expenditure, is quite clear given the household’s expenditure rule. It is not always so straightforward. In reality, many other factors may affect household expenditure on food. To account for these realities, we suppose that the household’s food expenditure decision is based on the equation: y = β₁ + β₂x + e In addition to the unknown parameters β₁ and β₂, (instead of 80 and 0.10), the equation contains an error term e, which represents all those other factors affecting weekly food expenditure. Regression analysis is a statistical method that uses data to explore relationships between variables. A simple linear regression analysis examines the relationship between a y-variable and one x-variable. It is said to be “simple” because there is only one x- variable. The y-variable is called the dependent variable, the outcome variable, the explained variable, the left-hand-side variable, or the regressand. y = β₁ + β₂x + e is a simple linear regression model, and as such it requires assumptions: - the first assumption is that the above relationship holds for the members of the population under consideration - the unknown β₁ and β₂ are called population parameters. The field of statistics was developed because, in general, populations are large, and it is impossible (or impossibly costly) to examine every population member. Statistical and econometric methodology examines and analyses a sample of data from the population. After analyzing the data, we make statistical inferences, i.e. conclusions or judgments about a population based on the data analysis. The data generating process The sample of data, and how the data are actually obtained, is crucially important for subsequent inferences. For the household food expenditure example, let us assume that we can obtain a sample at a point in time (cross-sectional data), randomly selected. Let (yᵢ, xᵢ) denote the ith data pair, i = 1, ... , N. The variables yᵢ and xᵢ are random variables, because their values are not known until they are observed. This is our third assumption, the homoskedasticity assumption: at each xᵢ the variation of the random error component is the same. Assuming the population relationship yᵢ = β₁ + β₂xᵢ + eᵢ the conditional variance of the dependent variable is: var(yᵢ|xᵢ) = var(β₁ +β₂ xᵢ + eᵢ|xᵢ) = var(eᵢ|xᵢ) = σ² The simplification works because by conditioning on xᵢ we are treating it as if it is known and therefore not random, making therefore the component β₁ + β₂xᵢ not random. The conditional homoskedasticity assumption implies that at each level of income the variation in food expenditure about its mean is the same. That means that at each and every level of income we are equally uncertain about how far food expenditure might fall from their mean value E(yᵢ|xᵢ) = β₁ + β₂xᵢ. If this assumption is violated, and var (eᵢ|xᵢ) ≠ σ², then the random errors are said to be heteroskedastic. Generalising the Exogeneity Assumption So far we have assumed that data pairs have been drawn from a random sample and are iid. What happens if the sample values of the explanatory variable are correlated? And how might that happen? A lack of independence occurs naturally when using financial or macroeconomic time- series data. Suppose we observe the monthly report on new housing starts, y, and the current 30-year fixed mortgage rate, xt, and we postulate the model yt = β₁ + β₂xt + et. The data yt, xt can be described as macroeconomic time-series data. In contrast to cross-section data where we have observations on a number of units, at a given point in time, with time-series data we have observations over time on a number of variables. Both yt and xt are random because we do not know the values until they are observed. Furthermore, each of the data series is likely to be correlated across time. The assumption that the pairs (yt, xt) represent random iid draws from a probability distribution is not realistic. When considering the exogeneity assumption for this case, we need to be concerned not just with possible correlation between xt and et, but also with possible correlation between et and every other value of the explanatory variable, namely, xs, s = 1, 2, ... , T. If xs is correlated with xt, then it is possible that xs (say, the mortgage rate in one month) may have an impact on yt (say, housing starts in the next month). Since it is xt, not xs, that appears in the equation yt = β₁ + β₂xt + et, the effect of xs will be included in et, implying that E(et|es)≠0. We can use xs to help predict et, meaning that the pairs (yt,xt) are not independent. To extend the strict exogeneity assumption to models where the values of x are correlated, we need to assume E(et|es)=0. This means that we cannot predict the random error at time t, using any of the values of the explanatory variable. To write this assumption in a more convenient form, we introduce the notation x = (x₁, x₂, … , xN). That is, we are using x to denote all sample observations on the explanatory variable. Then, a more general way of writing the strict exogeneity assumption is E(eᵢ|x)=0. It is a weaker assumption than E(eᵢ|xᵢ)=0, and independent pairs, and it enables us to derive a number of results for cases where different observations on x may be correlated as well as for the case where they are independent. Error Correlation Fourth assumption. It is also possible that we have correlation between the random error terms. With cross-sectional data, data on households, individuals, or firms collected at one point in time, there may be a lack of statistical independence between random errors for individuals who are spatially connected. We may also have error correlation in a time-series context, so that cov(et,et+1) ≠ 0, cov(et,et+2) ≠ 0 and so on. This is called serial correlation, or autocorrelation. The starting point in regression analysis is to assume that there is no error correlation. In time-series models, we start by assuming cov(et,es|x) = 0 for t ≠ s, and for cross- sectional data we start by assuming cov(eᵢ,ej|x) = 0 for i ≠ j. Variation in x Fifth assumption. In a regression analysis, one of the main objectives is to estimate β₂ = ΔE(yᵢ|xᵢ)/Δxᵢ. Recall from elementary geometry that “it takes two points to determine a line.” You will find out later that in fact the more different values of x, and the more variation they exhibit, the better our regression analysis will be. Point is, we cannot estimate the effect of changing income if we only analyse households with income $1000. Error normality We have explicitly made the assumption that food expenditures, given income, were normally distributed. We implicitly made the assumption of conditionally normally distributed errors and dependent variable by drawing classically bell-shaped curves. It is not at all necessary for the random errors to be conditionally normal in order for regression analysis to “work”, however, when samples are small, it is advantageous for statistical inferences that the random errors, and dependent variable y, given each x- value, are normally distributed. One argument for assuming regression errors are normally distributed is that they represent a collection of many different factors. The Central Limit Theorem says roughly that collections of many random factors tend toward having a normal distribution. When the assumption of conditionally normal errors is made, we write eᵢ|xᵢ ∼ N(β₁+β₂xᵢ,σ²). It is a very strong, but also optional and sixth assumption. Summarising the Assumptions If these SR assumptions hold (SR1-SR6), then regression analysis can successfully estimate the unknown population parameters β₁ and β₂ and we can claim that β₂=ΔE(yᵢ,xᵢ)/Δxᵢ = dE(yᵢ,xᵢ)/dxᵢ measures a causal effect. SR1. Econometric Model The data (yᵢ,xᵢ), i = 1,2,...,N come from a population satisfying: yᵢ = β₁ + β₂xᵢ + eᵢ SR2. Strict Exogeneity Conditionally on x = (x₁, x₂, . . . , xN)′, the errors have mean zero: E(eᵢ|x) = 0, i = 1,2,…,N Strict exogeneity implies: - E(yᵢ|x)=β₁+β₂xᵢ - E(eᵢ)=0 - Cov(eᵢ,xᵢ) = 0 for i = 1,2,...,N (We have used the Law of Iterated Expectations (LIE): y, x r.v. E(y) = E[E(y|x)]. The Law of Iterated Expectation states that the expected value of a random variable is equal to the sum of the expected values of that random variable conditioned on a second random variable.) SR3. Homoskedasticity Var(eᵢ|x) = σ², i = 1,2,…,N if the error eᵢ is conditionally homoskedastic, the dependent variable yᵢ must be, too: Var(eᵢ|x) = Var ( β₁ + β₂xᵢ + eᵢ|x) = Var(yᵢ|x) = σ² SR4. Conditionally uncorrelated errors Cov(ei, ej|x) = 0 if i≠j - it can be strengthen by assuming independent eᵢ - SR3 and SR4 allow to compute variances and covariances of the OLS estimator SR5. Explanatory Variables Must Vary In the sample, the x variable takes at least two distinct values, otherwise we would obtain a vertical line. This is crucial if we want to estimate the slope β₂ = ΔE(yᵢ|xᵢ)/Δxᵢ SR6. Conditionally Normal Errors eᵢ|x ∼ N(0, σ²), i = 1,2,…, N This assumption is not crucial for most properties and results The random error e and the dependent variable y are both random variables. There is, however, one interesting difference between them: y is “observable” and e is “unobservable.” If the regression parameters β₁ and β₂ were known, then for any values yᵢ and xᵢ we could calculate eᵢ = yᵢ - (β₁ + β₂xᵢ). What comprises the error term e? The random error e represents all factors affecting y other than x, or what we have called “everything else”: 1. Any other factor that affects the error term, such as omitted explanatory variables. 2. Any approximation error that arises because the linear functional form we have assumed may be only an approximate to reality. 3. Any element of random behaviour that may be present in each individual, such as unpredictable human behaviour. If we have omitted some important factor, or made any other serious specification error, then assumption SR2 E(eᵢ|x)= 0 will be violated, which will have serious consequences. Estimating the Regression Parameters The economic and econometric models we developed in the previous section are the basis for using a sample of data to estimate the intercept and slope parameters, β₁ and β₂. For illustration we examine typical data on household food expenditure and weekly income from a random sample of 40 households. Representative observations and summary statistics are given in Table 2.1. We will call the random estimators b₁ and b₂, the ordinary least squares estimators. “Ordinary least squares” is abbreviated as OLS. If we plug the sample values yᵢ and xᵢ into the formulas, then we obtain the least squares estimates of the intercept and slope parameters β₁ and β₂. It is interesting, however, and very important, that the formulas for b₁ and b₂ are perfectly general and can be used no matter what the sample values turn out to be. This suggests once again that b₁ and b₂ are random variables. When actual sample values are substituted into the formulas, we obtain numbers that are the observed values of random variables. To distinguish these two cases we call: • the least squares estimators: the rules or general formulas for b₁ and b₂, which are random variables • the least squares estimates: the numbers obtained when the formulas are used with a particular sample, applied to the observed data. A convenient way to report the values for b₁ and b₂ is to write out the estimated or fitted regression line, with the estimates rounded appropriately: ŷᵢ = b₁ + b₂xᵢ One of the characteristics of the fitted line is that it passes through the point defined by the sample means (x̄, ȳ), which follows from writing ȳ= b₁+b₂x̄ Interpreting the estimates Once obtained, the least squares estimates are interpreted in the context of the economic model under consideration. In our food expenditure vs income example: • The value b₂ = 10.21 is an estimate of β₂. The regression slope β₂ is the amount by which expected weekly expenditure on food per household increases when household weekly income increases by $100. Thus, we estimate that if weekly household income goes up by $100, expected weekly expenditure on food will increase by approximately $10.21, holding all else constant. • The intercept estimate b₁ = 83.42 is an estimate of the expected weekly food expenditure for a household with zero income. In most economic models we must be very careful when interpreting the estimated intercept. The problem is that we often do not have any data points near x = 0, something that is true for the food expenditure data. If we have no observations in the region where income is zero, then our estimated relationship may not be a good approximation to reality in that region. So, although our estimated model suggests that a household with zero income is expected to spend $83.42 per week on food, it might be risky to take this estimate literally. Elasticities Income elasticity is a useful way to characterize the responsiveness of consumer expenditure to changes in income. The elasticity of a variable y with respect to another variable x is ε = percentage change in y = 100 (Δy/y) = Δy * x percentage change in x = 100 (Δx/x) = Δx y In the linear economic model we have shown that β₂ = ΔE(y|x) Δx so the elasticity of mean expenditure with respect to income is ε = ΔE(y|x) * x = β₂ * x_ Δx E(y|x) E(y|x) To estimate the elasticity we replace β₂ with b₂. Given that elasticity depends on income, how do I summarise it with one number which represents the model? Most commonly the elasticity is calculated at the “point of the means” (x̄, ȳ), because it is a representative point in the regression line, or with mean elasticity: • Elasticity at the means: ε̂ = b₂ x̄ (not recommended) ȳ • Mean elasticity: ε̂ = 1 ∑ ε̂ᵢ , where ε̂ᵢ = b₂ xᵢ (preferable) N yᵢ This estimated income elasticity takes its usual interpretation. We estimate that a 1% increase in weekly household income will lead to a 0.71% increase in expected weekly household expenditure on food, when x and y take their sample mean values, (x̄, ȳ) = (19.60, 283.57). Assessing the Least Squares Estimators • The formulas of b₁ and b₂ define two estimators (random variables) • In a sample, b₁ and b₂ define two estimates (realizations of a r.v.). It is like flipping a coin: before you flip it, the outcome is random, and after you flip it, it is a number. Using data, we estimate the parameters of the regression model yᵢ = β₁ + β₂xᵢ + eᵢ using the least squares formulas, obtaining b₁ and b₂. We can wonder, how good are our least square estimates b₁ and b₂, but this question is not answerable, given that we will never know the true values of the population parameters β₁ and β₂, and that estimates have no properties. All we can do is examining the quality of the least squares estimation procedure, which will then be applied to all samples we collect, which would lead to different b₁ and b₂. This sampling variation is unavoidable, as yᵢ (household food expenditures) are random variables. Their values is not known until the sample is collected. Consequently, when viewed as an estimation procedure, b₁ and b₂ are also random variables, because their values depend on the random variable y. In this context, we call b₁ and b₂ the least squares estimators. We can investigate the properties of the estimators b₁ and b₂, which are called their sampling properties, and deal with the following important questions: 1. If the least squares estimators b₁ and b₂ are random variables, then what are their expected values, variances, covariances, and probability distributions? 2. The least squares principle is only one way of using the data to obtain estimates of β₁ and β₂. How do the least squares estimators compare with other procedures that might be used, and how can we compare alternative estimators? We examine these questions in two steps to make things easier. - In the first step, we investigate the properties of the least squares estimators conditional on the values of the explanatory variable in the sample, x. It means that when we consider all possible samples, the x values in the sample stay the same from one sample to the next; only the random errors and food expenditure values change. This assumption is clearly not realistic but it simplifies the analysis. By conditioning on x, we are holding it constant, or fixed, meaning that we can treat the x-values as “not random.” - In the second step we return to the random sampling assumption and recognise that (yᵢ, xᵢ) data pairs are random. We will however note that most of our conclusions from step one still hold. The Estimator b₂ We rewrite the formula for b₂ to facilitate analysis. Until now we have seen b₂ in the “deviation from the mean” form: b₂ = ∑ (xᵢ - x̄) (yᵢ - ȳ) ∑ (xᵢ - x̄) ² Using assumptions SR1 we can write b₂ as a linear estimator. where The term wᵢ depends only on x. Because we are conditioning our analysis on x, the term wᵢ is treated as if it is a constant. We remind you that conditioning on x is equivalent to treating x as given, as in a controlled, repeatable experiment. Any estimator that is a weighted average of yᵢ’s, as in this formula, is called a linear estimator. With more algebra, we can express b₂ in a theoretically convenient way, which is however not useful for computations, as it depends on β₂ and on eᵢ’s: b₂ = β₂ + ∑wᵢeᵢ The Expected Values of b₁ and b₂ The OLS estimator b₂ is a random variable since its value is unknown until a sample is collected. If our model assumptions hold, then E(b₂|x) = β₂; that is, given x the expected value of b₂ is equal to the true parameter β₂. When the expected value of any estimator of a parameters equals the true parameter value, then that estimator is unbiased. Therefore, the least squares estimator b₂ given x is an unbiased estimator of β₂. We will later show that the least squares estimator b₂ is unconditionally unbiased also, E(b₂) = β₂. The intuitive meaning of unbiasedness comes from the sampling interpretation of mathematical expectation. The parameter β₂ is not random. It is a population parameter we are trying to estimate. Conditional on x we can treat xᵢ as if it is not random. Then, conditional on x, wᵢ is not random either, as it depends only on the values of xᵢ. The only random factors in are the random error terms eᵢ. We can find the conditional expected value of b₂ using the fact that the expected value of a sum is the sum of the expected values (LIE), here is the proof: We have used the following assumptions to state that b₂ is an unbiased estimator, meaning that E(b₂|x) = β₂ • E(wᵢeᵢ|x) = wᵢE(eᵢ|x), because conditional on x the terms wᵢ are not random, and constants can be factored out. • E(eᵢ|x) = 0 We have shown that conditional on x, and under SR1–SR5, the least squares estimator is linear and unbiased, which is an important sampling property: the unbiasedness property is related to what happens in all possible samples of data from the same population. The fact that b₂ is unbiased does not imply anything about what might happen in just one sample. • If assumptions SR1–SR5 hold, then the least squares estimators are conditionally unbiased. This means that E(b₁|x) = β₁ and E(b₂|x) = β₂. • Given x we have expressions for the variances of b₁ and b₂ and their covariance. Furthermore, we have argued that for any unbiased estimator, having a smaller variance is better, as this implies we have a higher chance of obtaining an estimate close to the true parameter value. 
 Now we will state and discuss the famous Gauss-Markov theorem: Given x and under the assumptions SR1–SR5 of the linear regression model, the estimators b₁ and b₂ have the smallest variance of all linear and unbiased estimators of β₁ and β₂. They are the best linear unbiased estimators (BLUE) of β₁ and β₂. 1) “Best” means in comparison to other linear and unbiased estimator. Not the best out of all possible estimators. 2) They are the best within their class because they have the minimum variance. 3) If assumptions SR1-SR5 are not true, then b₁ and b₂ are not BLUE. 4) Gauss-Markov does not depend on the assumption of normality (SR6). 5) In the simple linear regression model, if we want to use a linear and unbiased estimator, then we have to do no more searching: b₁ and b₂ are the ones to use. 6) Gauss-Markov applies to the least squares estimators, not to estimates from a single sample. Of course if we forget the unbiasedness and linearity we can find much better estimators, but we are limiting our search to efficient (min variance), linear and unbiased. Remember than this theorem holds for the estimators, not for the estimates. 
 The Probability Distributions of the LS Estimators The properties of the least squares estimators that we have developed so far do not depend in any way on the normality assumption SR6. If we also make this assumption, that the random errors eᵢ are normally distributed, with mean zero and variance σ², then the conditional probability distributions of the least squares estimators are also normal, in fact (under SR1-SR6): • If eᵢ is normal, then so is yᵢ • The least square estimators are linear estimators of the form b₂ = ∑ wᵢyᵢ. Given x, this weighted sum of normal random variables is also normally distributed. What if the errors are not normally distributed? Can we say anything about the probability distribution of the least squares estimators? The answer is, sometimes, yes. Central limit theorem: If assumptions SR1–SR5 still hold, and if the sample size N is sufficiently large, then the least squares estimators have a distribution that approximates the normal distributions shown above. Estimating the variance of the error term The variance of the random error term, σ², is the one unknown parameter of the simple linear regression model that remains to be estimated. The conditional variance of the random error eᵢ is 
 
 if the assumption E (eᵢ|x) = 0 is correct. Since the “expectation” is an average value, we might consider estimating σ² as the average of the squared errors This formula is unfortunately of no use since the random errors eᵢ are unobservable! However, although the random errors themselves are unknown, we do have an analog to them, namely, the least squares residuals. Recall that the random errors are eᵢ = yᵢ - β₁ - β₂xᵢ the least squares residuals are obtained by replacing the unknown parameters by their least squares estimates:
 êᵢ = yᵢ - ŷᵢ = yᵢ - b₁ - b₂ xᵢ It seems reasonable to replace the random errors eᵢ by their analogs, the least squares residuals, so that This estimator, though quite satisfactory in large samples, is a biased estimator of σ². But there is a simple modification that produces an unbiased estimator: The 2 that is subtracted at the denominator is the number of regression parameters, in this case β₁, β₂, in the model, the degrees of freedom, which makes the estimator σ̂² unbiased, so that E(σ̂²|x) = σ². Estimating the Variances and Covariances of the Least Squares Estimators Having an unbiased estimator of the error variance means we can estimate the conditional variances of the least squares estimators b₁ and b₂ and the covariance between them. Replace the unknown error variance σ² in the formulas with σ̂² to obtain: The square roots of the estimated variances are the “standard errors” of b₁ and b₂. These quantities are used in hypothesis testing and confidence intervals. They are denoted as se(b₁) and se(b₂). Interpreting the Standard Errors The standard errors of b₁ and b₂ are measures of the sampling variability of the least squares estimates b₁ and b₂ in repeated samples. b₂ and b₁ are random variables, whose values change with the sample data. As such, they have probability distributions, means and variances. If SR6 holds, and the random error terms eᵢ are normally distributed, then b₂|x ~ N(β₂,var(b₂|x) = σ² / ∑(xᵢ - x̄)²) The estimator variance, or its square root, the standard error, which we might call the true standard deviation of b₂, measures the sampling variation of the estimates b₂, and determines the width of the pdf (in repeated samples). The standard error of b₂ is thus an estimate of what the standard deviation of many estimates b₂ would be in a very large number of samples and is an indicator of the width of the pdf of b₂. Estimating Nonlinear Relationships Economic variables are not always related by straight-line relationships; in fact, many economic relationships are represented by curved lines and are said to display curvilinear forms. Fortunately, the simple linear regression model y = β₁ + β₂x + e is much more flexible than it looks at first glance, because the variables y and x can be transformations. Example: consider a model from real estate economics in which the price (PRICE) of a house is related to the house size measured in square feet (SQFT). As a starting point, we might consider the linear relationship PRICE = β₁ + β₂SQFT + e In this model, β₂ measures the increase in expected price given an additional square foot of living area, implying a constant relationship. However, it may be reasonable to assume that larger and more expensive homes have a higher value for an additional square foot of living area than smaller, less expensive homes. We will illustrate the use of two approaches to build an appropriate model: first, a quadratic equation in which the explanatory variable is SQFT²; and second, a log-linear equation in which the dependent variable is ln(PRICE). In each case, we will find that the slope of the relationship between PRICE and SQFT is not constant, but changes from point to point. Quadratic Model The shape of the curve is determined by b: if b > 0 the curve is U-shaped, if b < 0 the curve has an inverted-U shape. The slope of the function is given by the derivative dy/dx = 2bx, which changes as x changes. The elasticity or the percentage change in y given a 1% change in x is ε = slope * x ∕ y = 2bx² ∕y. Is this explanatory variable random or not and why does it matter? We address these questions in this section. Laboratory and other controlled experiments can claim that the values of the independent variable are fixed, not random. We are now going to treat cases in which x-values are random. Random and Independent x A scientist can use a random number between 0 and 100 to determine the amount of x. This can be a useful approach to avoid claims that the experiment is rigged to produce a particular outcome. What are the sampling properties of the least squares estimator in this setting? Is the least squares estimator the best, linear unbiased estimator in this case? In order to answer these questions, we make explicit that x is statistically independent of the error term e. The assumptions for the independent random-x model (IRX) are as follows: Compare the assumptions IRX2, IRX3, and IRX4 with the initial assumptions about the simple regression model, SR2, SR3, and SR4. You will note that conditioning on x has disappeared. The reason is because when x-values and random errors e are statistically independent E(eᵢ |xj) = E(eᵢ) = 0, var(eᵢ |xj) = var (eᵢ) = σ² and cov(eᵢ, ej|x) = cov (eᵢ, ej) = 0.
 Conditioning has no effect on expected value and variance of statistical independent random variables. Also, it is extremely important to recognize that “i” and “j” simply represent different data observations that may be cross-sectional data or time-series data. What we say applies to both types of data. The least squares estimators b₁ and b₂ are the best linear unbiased estimators of β₁ and β₂ if assumptions IRX1–IRX5 hold. The one apparent change is that an “expected value” appears in the formulas for the estimator variances. For example, We must take the expected value of the term involving x. In practice, this actually changes nothing, because we estimate the variance in the usual way. The estimator of the error variance remains σ̂² = ∑ êᵢ²/(N-2) and the interpretations remain the same. Thus, the computational aspects of least squares regression do not change. What has changed is our understanding of the DGP (data generating process). Furthermore, if IRX6 holds then, conditional on x, the least squares estimators have normal distributions. 
 Random and Strictly Exogenous x Statistical independence between xᵢ and ej, is a very strong assumption and most likely only suitable in experimental situations. A weaker assumption is that the explanatory variable x is strictly exogenous, which refers to a statistical assumption. In regression analysis models, the independent, explanatory variable x is also termed an exogenous variable because its variation affects the outcome variable y, but there is no reverse causality; changes in y have no effect on x. The independent variable x is strictly exogenous if E(eᵢ|xj) = 0 for all values of i and j. This is exactly assumption SR2. The implications of strict exogeneity Strict exogeneity implies quite a bit. If x is strictly exogenous, then the least squares estimator works the way we want it to and no fancier or more difficult estimators are required. If, on the other hand, strict exogeneity does not hold, then econometric analysis becomes more complicated, which, unfortunately, is often the case. There exist some statistical tests that can be used to check for strict exogeneity. The common practice is to check that the implications of strict exogeneity are true. If these implications don’t seem to be true, either based on economic logic or statistical tests, then we will conclude that strict exogeneity does not hold and deal with the consequences. The two direct implications of strict exogeneity are: 1) E(eᵢ)=0, the average of all factors omitted from the regression model is zero. 2) cov(xᵢ, eᵢ)=0, there is no correlation between the omitted factors associated with observation j and the value of the explanatory variable for observation i. If either of these implications is not true, then x is not strictly exogenous. Random Sampling The food expenditure example we have carried through this chapter is another case in which the DGP leads to an x that is random. We randomly sampled a population and selected 40 households. These are cross-sectional data observations. For each household, we recorded their food expenditure (yᵢ) and income (xᵢ). Because both of these variables’ values are unknown to us until they are observed, both the outcome variable y and the explanatory variable x are random. The same questions are relevant. What are the sampling properties of the least squares estimator in this case? Is the least squares estimator the best, linear unbiased estimator? Such survey data is collected by random sampling from a population. The idea is to collect data pairs (yᵢ, xᵢ) in such a way that the ith pair is statistically independent of the jth pair. This ensures that xj is statistically independent of eᵢ if i≠j. If the conditional expectation E(eᵢ|xᵢ)=0, then x is strictly exogenous, and the implications are E(eᵢ)=0 and cov(xᵢ,eᵢ)=0. Note also that if we assume that the data pairs are independent, then we no longer need make the separate assumption that the errors are uncorrelated.
 What are the properties of the least squares estimator under these assumptions? They are the same as in the cases of statistical independence between all xj and eᵢ and strict exogeneity in the general sense. The least squares estimators are the best linear unbiased estimators of the regression parameters and conditional on x they have a normal distribution if SR6 (or IRX6) holds.
 In this case, the data pairs are independent and identically distributed, iid. In statistics, the phrase random sample implies that the data are iid. When discussing examples of the implications of strict exogeneity, we showed how the strict exogeneity assumption can be violated when using time-series data if there is correlation between e and a future or past value x (t ≠ s). For an example of how strict exogeneity fails with random sampling of cross-sectional data, we need one where eᵢ is correlated with a value xᵢ corresponding to the same ith observation. Interval Estimation and Hypothesis Testing Interval Estimation The example’s estimate b₂ = 10.21 is a point estimate of the unknown population parameter β₂ in the regression model. Interval estimation proposes a range of values in which the true parameter β₂ is likely to fall, and the precision with which we have estimated it. Such intervals are often called confidence intervals or interval estimates. As we will see, our confidence is in the procedure we use to obtain the intervals, not in the intervals themselves. The t-Distribution Let us assume that assumptions SR1–SR6 hold for the SLR model. In this case, we know that given x, the least squares estimators b₁ and b₂ have normal distributions. For example, the normal distribution of b₂, the least squares estimator of β₂ is (the tilde means “exact”) A standardized normal random variable is obtained from b₂ by subtracting its mean and dividing by its standard deviation: 
 The Alternative Hypothesis The alternative hypothesis H₁ is accepted if the null hypothesis is rejected. For the null hypothesis H0∶βk = c, the three possible alternative hypotheses are as follows: • H1∶βk > c • H1∶βk < c • H1∶βk ≠ c. The Test Statistic The sample information about the null hypothesis is embodied in the sample value of a test statistic. Based on the value of a test statistic, we decide either to reject the null hypothesis or not to reject it. A test statistic has a special characteristic: its probability distribution is completely known when the null hypothesis is true, and it has some other distribution if the null hypothesis is not true. If the null hypothesis H0∶βk = c is true, then we can substitute c for βk and it follows that If the null hypothesis is not true, then the t-statistic does not have a t-distribution with N − 2 degrees of freedom, but some other. However, we will never know if H0 is true or false. The Rejection Region The rejection region depends on the form of the alternative. It is the range of values of the test statistic that leads to rejection of the null hypothesis. It is possible to construct a rejection region only if we have • A test statistic whose distribution is known when the null hypothesis is true • An alternative hypothesis • A level of significance The rejection region consists of values that are unlikely and that have low probability of occurring when the null hypothesis is true. The chain of logic is “if a value of the test statistic is obtained that falls in a region of low probability, then it is unlikely that the test statistic has the assumed distribution, and thus, it is unlikely that the null hypothesis is true.” If the alternative hypothesis is true, then values of the test statistic will tend to be unusually large or unusually small. The terms “large” and “small” are determined by choosing a probability α, called the level of significance of the test, which provides a meaning for “an unlikely event.” The level of significance of the test α is usually chosen to be 0.01, 0.05, or 0.10. If we reject the null hypothesis when it is true, then we commit what is called a Type I error. The level of significance of a test is the probability of committing Type I error, so P(Type I err) = α. If we do not reject a null hypothesis that is false, then we have committed a Type II error. A Conclusion When you have completed testing a hypothesis, you should state your conclusion. Do you reject the null hypothesis, or do you not reject the null hypothesis? As we will argue below, you should avoid saying that you “accept” the null hypothesis, which can be very misleading. Moreover, we urge you to make it standard practice to say what the conclusion means in the economic context of the problem you are working on and the economic significance of the finding. Rejection Regions for Specific Alternatives The level of significance of a test, α, is the probability that we reject the null hypothesis when it is actually true, which is called a Type I error. One-Tail Tests with Alternative ‘‘Greater Than’’ (>) When testing the null hypothesis H0∶βk = c against the alternative hypothesis H1∶βk > c, reject the null hypothesis and accept the alternative hypothesis if t ≥ t(1−α, N−2). If the null hypothesis is true, then the test statistic (3.7) has a t-distribution, and its value would tend to fall in the center of the distribution, to the left of the critical value, where most of the probability is contained. One-Tail Tests with Alternative “Less Than” (<) When testing the null hypothesis H0∶βk = c against the alternative hypothesis H1∶βk < c, reject the null hypothesis and accept the alternative hypothesis if t ≤ t(α, N−2). The rejection region for a one-tail test is in the direction of the arrow in the alternative. If the alternative is >, then reject in the right tail. If the alternative is <, reject in the left tail. Two-Tail Tests with Alternative ‘‘Not Equal To’’ (≠) When testing the null hypothesis H0∶βk = c against the alternative hypothesis H1∶βk ≠ c, reject the null hypothesis and accept the alternative hypothesis if t ≤ t(α/2, N−2) or if t ≥ t(1−α/2, N−2). When the null hypothesis is true, the probability of obtaining a value of the test statistic that falls in either tail area is “small.” Examples of hypothesis tests A 5-step procedure for a hypothesis test: 1. Determine H0 and H1 2. Specify the test statistic and its distribution under H0 3. Select α and determine the RR 4. Compute the sample value of the test statistic 5. State a conclusion 1. In the food data, H₀: β₂ = 0, H₁: β₂ >0 2. t= b₂/se(b₂) ∼ tN-2 if H₀ is true 3. α = 0.05 → tc = t(1−α,N−2) = t(0.95,38) = 1.686. RR: t ≥ tc 4. In the sample, t = 4.88 5. We reject H₀ and accept H₁ To open a new supermarket in some neighbourhood, the owner wants to be sure that there is strong evidence supporting β₂ > 5.5 1. H₀ : β₂ ≤ 5.5, H₁ : β₂ > 5.5 2. t=(b₂−5.5)/se(b₂) ∼ t(N-2) if H₀ is true 3. α=0.01 → tc =t(0.99,38) =2.429. RR:t≥tc 4. In the sample, t = (10.21 − 5.5)/2.09 = 2.25 5. We can't reject H0. The evidence supporting β2 > 5.5 is not very strong Left tail test 1. H₀ : β₂ ≥ 15, H₁ : β₂ < 15 2. t = (b₂ − 15)/se(b₂) ∼ t(N−2) under H0 3. α = 0.05 → tc = t(α,N−2) = t(0.05,38) = −1.686 RR: t ≤ tc 4. In the sample: t = (10.21 − 15)/2.09 = −2.29 5. We reject H0 (for α = 5%) and accept H1 On average, every household spends 7.5$ in food per additional 100$ income 1. H0 :β2 =7.5, H1 : β2 ≠7.5 2. t = (b₂ − 7.5)/se(b₂) ∼ t(N−2) if H₀ is true 3. α = 0.05 → t(α/2,N−2) = −2.024 and t(0.975,N−2) = 2.024
 RR: t ≤ −2.024 or t ≥ 2.024 4. In the sample, t = (10.21 − 7.5)/2.09 = 1.29 5. We do not reject H₀ Is the relationship between consumption and income statistically significant? 1. H₀: β₂ = 0, H₁: β₂ ≠ 0 2. t = b₂/se(b₂) ∼ t(N−2) if H₀ is true 3. α = 0.05 → t(α/2,N−2) = −2.024 and t(0.975,N−2) = 2.024
 RR: t ≤ −2.024 or t ≥ 2.024 4. In the sample, t = 10.21/2.09 = 4.88 5. We reject H₀ This test is computed automatically by R (and by any other software) The p-Value When reporting the outcome of statistical hypothesis tests, it has become standard practice to report the p-value (an abbreviation for probability value) of the test. If we have the p-value of a test, p, we can determine the outcome of the test by comparing the p-value to the chosen level of significance, α, without looking up or calculating the critical values. P is the smallest significance level for which H₀ is rejected. The rule is p-Value Rule: reject the null hypothesis when the p-value is less than, or equal to, the level of significance α. That is, - if p ≤ α, then reject H0 - if p > α, then do not reject H0. The task of predicting y0 is related to the problem of estimating E(y0|x0) = β₁ + β₂x0. The outcome y0 = E(y0|x0) + e0 = β₁ + β₂ x0 + e0 is composed of two parts, the systematic, nonrandom part E(y0|x0) = β₁ + β₂x0, and a random component e0. We estimate the systematic portion using Ê(y0|x0) = b₁ + b₂x0 and add an “estimate” of e0 equal to its expected value, which is zero. Therefore, the prediction ŷ0 is given by ŷ0 = Ê(y0|x0) + 0 = b₁ + b₂x0. We distinguish between them because, although E(y0|x0) = β₁ + β₂x0 is not random, the outcome y0 is random, as we are adding the error term which is random. Consequently we will see that there is a difference between the interval estimate of E(y0|x0) = β₁ + β₂ x0 and the prediction interval for y0. Following from the discussion in the previous paragraph, the least squares predictor of y0 comes from the fitted regression line ŷ0 = b₁ + b₂x0. That is, the predicted value ŷ0 is given by the point on the least squares fitted line where x = x0. How good is this prediction procedure? The least squares estimators b₁ and b₂ are random variables, their values vary from one sample to another. It follows that the least squares predictor ŷ0 = b₁ + b₂x0 must also be random. To evaluate how well this predictor performs, we define the forecast error, which is analogous to the least squares residual, 𝑓=y0 −ŷ0 =(β₁ +β₂x0 +e0)−(b₁ +b₂x0). We would like the forecast error to be small, implying that our forecast is close to the value we are predicting. Taking the conditional expected value of f, we find which means, on average, the forecast error is zero, and ŷ0 is an unbiased predictor of y0. However, unbiasedness does not necessarily imply that a particular forecast will be close to the actual value. The probability of a small forecast error also depends on the variance of the forecast error. Although we will not prove it, ŷ0 is the best linear unbiased predictor (BLUP) of y0 if assumptions SR1–SR5 hold. This result is reasonable given that the least squares estimators b₁ and b₂ are best linear unbiased estimators. Using what we know about the variances and covariances of the least squares estimators, we can show (see Appendix 4A) that the variance of the forecast error is Notice that some of the elements of this expression appear in the formulas for the variances of the least squares estimators and affect the precision of prediction in the same way that they affect the precision of estimation. We would prefer that the variance of the forecast error be small, which would increase the probability that the prediction ŷ0 is close to the value y0, we are trying to predict. Note that the variance of the forecast error is smaller when i.the overall uncertainty in the model is smaller, as measured by the variance of the random errors σ² ii.the sample size N is larger iii.the variation in the explanatory variable is larger iv.the value of (x0 − x̄)² is small The new addition is the term (x0 − x̄)², which measures how far x0 is from the center of the x-values. The more distant x0 is from the centre of the sample data the larger the forecast variance will become. Intuitively, this means that we are able to do a better job predicting in the region where we have more sample information, and we will have less accurate predictions when we try to predict outside the limits of our data. In practice we replace σ² by its estimator σ̂² to obtain The square root of this estimated variance is the standard error of the forecast Defining the critical value tc to be the 100(1 − α/2)-percentile from the t-distribution, we can obtain a 100(1 − α)% prediction interval as ŷ0 ± tcse(𝑓 ) Our predictions for values of x0 close to the sample mean x̄ are more reliable than our predictions for values of x0 far from the sample mean x̄. A point prediction is given by the fitted least squares line ŷ0 = b₁ + b₂x0. The prediction interval takes the form of two bands around the fitted least squares line. Because the forecast variance increases the farther x0 is from the sample mean x̄, the confidence bands are their narrowest when x0 = x̄, and they increase in width as |x0 − x̄| increases. This is the representation of the forecast intervals (dotted lines): the smaller the variance, the higher the accuracy. The straight line in the middle is the point forecast / point prediction. Prediction in the food expenditure model • For x0 = $2.000 and α=0.05:
 - ŷ₀ = b₁ + b₂ x₀ = 83.4160 + 10.2096(20) = 287.6089 - ^Var(f ) = 8214.30 - se(f ) = 90.6328
 - ŷ₀ ±tc se(f)=[104.1323,471.0854] • The forecast interval is large because the model residual variance is large: the model explains a low fraction of the observed FOOD_EXP variability. PREDICTION VS FORECAST • Prediction: estimating the conditional expectation of the dependant variable given a level of x. I.e. ȳ given a level of x. It is an estimation problem. So the OLS estimator is optimal according to Gauss Markov. • Forecast: approximate a random variable, the actual value of the dependant variable given some value of the explanatory variable. Not the expected value of y, but just y. It is not an estimation problem, as we are forecasting random variables. • In terms of point estimation, forecast and prediction is the SAME. The difference arises when we look at the noise: the error term. The additional term is σ². Measuring Goodness-of-Fit Two major reasons for analyzing the model
 yᵢ = β₁ + β₂xᵢ + eᵢ are to explain how the dependent variable (yᵢ) changes as the independent variable (xᵢ) changes and to predict y0 given an x0. These two objectives come under the broad headings of estimation and prediction. To develop a measure of the variation in yᵢ that is explained by the model, we begin by separating yᵢ into its explainable “systematic” and unexplainable random components: yᵢ =E(yᵢ|x)+eᵢ While both of these parts are unobservable to us, we can estimate the unknown parameters β₁ and β₂ and, analogous to what we just did, decompose the value of yᵢ into yᵢ = ŷᵢ + êᵢ where ŷᵢ =b₁ +b₂xᵢ and êᵢ = yᵢ − ŷᵢ. Subtract the sample mean ȳ from both sides of the equation to obtain yᵢ − ȳ = (ŷᵢ − ȳ)+êᵢ 4.10 The difference between yᵢ and its mean value ȳ consists of a part that is “explained” by the regression model ŷᵢ - ȳ and a part that is unexplained êᵢ. This breakdown leads to a decomposition of the total sample variability in y into explained and unexplained parts. Recall from your statistics courses that if we have a sample of observations y₁, y₂, ..., yN, two descriptive measures are the sample mean ȳ and the sample variance The numerator of this quantity, the sum of squared differences between the sample values yᵢ and the sample mean ȳ, is a measure of the total variation in the sample values. If we square and sum both sides of (4.10) and use the fact that the cross-product term ∑(ŷᵢ − ȳ)êᵢ = 0, we obtain ∑(yᵢ − ȳ)² = ∑(ŷi − ȳ)² + ∑ êᵢ² which gives us a decomposition of the “total sample variation” in y into explained and unexplained components. Specifically, these “sums of squares” are as follows: 1. ∑(yᵢ − ȳ)² = total sum of squares = SST: a measure of total variation in y about the sample mean. It is the observed deviance in the sample. 2. ∑(ŷᵢ − ȳ)² = sum of squares due to the regression = SSR: that part of total variation in y, about the sample mean, that is explained by, or due to, the regression. Also known as the “explained sum of squares.” Model deviance. 3. ∑êᵢ² = sum of squares due to error = SSE: that part of total variation in y about its mean that is not explained by the regression. Also known as the unexplained sum of squares, the residual sum of squares, or the sum of squared errors. this process, the least squares residuals will also be scaled. This will affect the standard errors of the regression coefficients, but it will not affect t-statistics or R². 3. If the scale of y and the scale of x are changed by the same factor, then there will be no change in the reported regression results for b₂, but the estimated intercept and residuals will change; t-statistics and R² are unaffected. The interpretation of the parameters is made relative to the new units of measurement. Choosing a Functional Form Although the world is not “linear,” a straight line is a good approximation to many nonlinear or curved relationships over narrow ranges. What does economics really say about the relation between food expenditure and income, holding all else constant? We expect there to be a positive relationship between these variables because food is a normal good: as income rises, we expect food expenditures to rise, but we expect such expenditures to increase at a decreasing rate. For a curvilinear relationship like this, the marginal effect of a change in the explanatory variable is measured by the slope of the tangent to the curve at a particular point. The simple linear regression model is much more flexᵢble than it appears at first glance. By transforming the variables y and x we can represent many curved, nonlinear relationships and still use the linear regression model. We have already introduced the quadratic and log-linear functional forms, now we are going to introduce an array of other possibilities. The variable transformations that we begin with are as follows: 1. Power: if x is a variable, then xp means raising the variable to the power p; examples are quadratic (x²) and cubic (x³) transformations. 2. The natural logarithm: if x is a variable, then its natural logarithm is ln(x). Some Useful Functions, Their Derivatives, Elasticities, and Other Interpretation: (remember log functions can only be applied to positive variables) A Linear-Log Food Expenditure Model A linear-log equation has a linear, untransformed term on the left-hand side and a logarithmic term on the right-hand side, or y = β₁ + β₂ln(x). Because of the logarithm, this function requires x>0. It’s an increasing or decreasing function, depends on the sign of β₂. There is a convenient interpretation using approximations to changes in logarithms. Consider a small increase in x from x0 to x₁. Then y0 = β₁ + β₂ln(x0) and y₁ = β₁ + β₂ln(x₁). Subtracting the former from the latter, and using an approximation the equation gives The change in y, represented in its units of measure, is approximately β₂/100 times the percentage change in x. (Semi-elasticity) Remark Given alternative models with different transformations of both variables, and some of which have similar shapes, what are the guidelines to choose a functional form? 1. Choose a shape that is consistent with what economic theory tells us about the relationship. 2. Choose a shape that is sufficiently flexible to “fit” the data. 3. Choose a shape so that assumptions SR1-SR6 are satisfied, ensuring that the least 
 squares estimators have the desirable properties described. Using Diagnostic Residual Plots When specifying a regression model, we may inadvertently choose an inadequate functional form (not coherent with economic theory or doesn’t fit the data) or one where SR1-SR6 don’t hold. We should ask whether there is any evidence that assumptions SR3 (homoskedasticity), SR4 (no serial correlation), and SR6 (normality) are violated. If there are no violations of the assumptions, then a plot of the least squares residuals versus x, y, or the fitted value of y, ŷ, should reveal no patterns. - Figures 4.7(b)–(d) show patterns associated with heteroskedasticity. Figure b has a “spray-shaped” residual pattern that is consistent with the variance of the error term increasing as x-values increase; Figure c has a “funnel-shaped” residual pattern that is consistent with the variance of the error term decreasing as x-values increase; and Figure d has a “bow-tie” residual pattern that is consistent with the variance of the error term decreasing and then increasing as x-values increase. - Figure e shows a typical pattern produced with time-series regression when the error terms display a positive correlation, corr(et, et−1) > 0 (violation of SR4). There are sequences of positive residuals followed by sequences of negative residuals. - Figure f shows a typical pattern produced with time-series regression when the error terms display a negative correlation, corr(et, et−1) < 0. In this case, each positive residual tends to be followed by a negative residual, which is then followed by a positive residual and so on. Heteroskedasticity: the variance of errors is not constant For the linear-log model of food expenditure reported in Example 4.4, the Jarque-Bera test statistic value is 0.1999 with a p-value of 0.9049. We cannot reject the null hypothesis that the regression errors are normally distributed, and this criterion does not help us choose between the linear and linear-log functional forms for the food expenditure model. In these examples, we should remember that the Jarque-Bera test is strictly valid only in large samples. - If H₀ true, JB is small (and p-value is large) - If H₀ is false, JB is large (and p-value is small) Polynomial Models Economics students will have seen many average and marginal cost curves (U-shaped) and average and marginal product curves (inverted-U shaped) in their studies. Higher order polynomials, such as cubic equations, are used for total cost and total product curves. Quadratic and Cubic Equations The general form of a quadratic equation y = a0 + a1x + a2x² includes a constant term a0, a linear term a1x, and a squared term a2x². Similarly, the general form of a cubic equation is y = a0 + a1x + a2x² + a3x³. We consider the simple quadratic and cubic forms, y = β₁ + β2x² and y = β₁ + β2x³ We have already discussed the properties of the simple quadratic function. Using derivative rules, the derivative, or slope, of the cubic equation is dy/dx =3β2x². The slope of the curve is always positive if β2 > 0, except when x = 0, yielding a direct relationship between y and x. If β2 < 0, then the relationship is an inverse one. The slope equation shows that the slope is zero only when x = 0. The term β₁ is the y-intercept. The elasticity of y with respect to x is ε = slope * x/y = 3β₂x² * x/y. Both the slope and elasticity change along the curve. A wage equation • r: return of one additional year of education, we want to know what r is • If EDUC is the number of years of education, WAGE EDUC = (1 + r )EDUC × WAGE0 log(WAGE) = log(WAGE0)+log(1+r)EDUC = β₁ +β₂EDUC Given that log(1 + r) ≈ r, β₂ measures the return to education • In the cps4_small dataset, ^log(WAGE) = 1.6094 + 0.0904 × EDUC (s.e.) (0.0864) (0.0061) • rˆ ≈ 9 % • 95% interval estimation of r: [7.8%,10.2%] • How can we measure the goodness-of-fit of this model? The standard R² is unlikely to be helpful: it uses the variability of log(WAGE), not of WAGE. • How can we predict WAGE, instead of log(WAGE)? How can we predict goodness of fit in terms of y instead of log(y)? Log-Linear Models: NEVER compare R² of different model when not transformed in something comparable 1) log(y)= β₁ + β₂ x +e R²=0.2 in terms of log y 2) y= β₁ +β ₂x+e R²=0.1 in terms of y Prediction in a Log-Linear Model • “Natural” predictor of y: ŷn =exp(b₁ +b₂x) • “Alternative”, or “corrected” predictor (preferable): ŷc= exp (b₁+b₂x+ σ̂²/2) =ŷn eσ̂²/2 • For EDUC=12, ŷn = 14.7958 and ŷc = 16.9964 How does the correction affect our prediction? Recall that σ̂² must be greater than zero and e0 = 1. Thus, the effect of the correction is always to increase the value of the prediction because eσ̂²/2 is always greater than one. The natural predictor tends to systematically under-predict the value of y in a log- linear model, and the correction offsets the downward bias in large samples. A Generalized R² Measure • In the linear model: • In a nonlinear model these definitions are not equivalent • Usually we use the generalized R²: • In the wage data, R² = 0.1782 and R²g = 0.1859 Prediction intervals in the loglinear model • To compute the interval forecast of y₀: 1. Compute ^log(y₀) = b₁ + b₂x₀ 
 2. Compute the standard error of the forecast error at x₀, se(f₀) 3. Compute the critical value tc corresponding to some confidence level α 
 4. The 95% interval forecast of log(y₀) is [^log(y₀) − tc se(f₀) , ^log(y₀) + tc se(f₀) ] 5. The 95% interval forecast of y₀ is computed by exponentiation: 
 [exp (^log(y0) − tcse(f0)) , exp (^log(y0) + tc se(f0))] i.e. exponentiate the bounds! • Note: The confidence level is preserved because the exp function is monotonically increasing. Log-log models ln(y) = β₁ + β₂ln(x) Prediction in a Log-log Model The corrected predictor Q̂c is the natural predictor Q̂n adjusted by the factor eσ̂²/2 Therefore Q̂c = Q̂n eσ̂²/2 = exp(^ln(Q))eσ̂²/2 en d(0,8°) O ZN vin (g)- gf ve [EIA] rt rsa DE(epe) - &p(45° I 4 Egr) pe 1 Ù n ! Sacito sceovati ovucted" / “Aoiguteat’ Pudcelm: Forca tera ° - ì : reft, joe }os 209 “n n° »: = SR: È eg -e[or(a3] selen (Aapeot ©) ce[arR) 00 MR2. Strict Exogeneity Conditionally on X = (xᵢ, i = 1,2,…,N) the errors have mean zero: E(eᵢ|X) = 0, Strict exogeneity implies: - E(yᵢ|X)= β₁ +β₂xᵢ₂ +β₃xᵢ₃ + … + βK xᵢK - E(eᵢ)=0 - Cov(eᵢ, xjk) = 0 for k = 1,2,…,K and (i, j) = 1,2,…,N (We have used the Law of Iterated Expectations (LIE): y, x r.v. E(y) = E[E(y|x)]. The Law of Iterated Expectation states that the expected value of a random variable is equal to the sum of the expected values of that random variable conditioned on a second random variable.) MR3. Conditional Homoskedasticity Var(eᵢ|X) = σ², i = 1,2,…,N if the error eᵢ is conditionally homoskedastic, the dependent variable yᵢ must be, too: Var(yᵢ|X) = σ², they do not depend on X. MR4. Conditionally uncorrelated errors Cov(yᵢ, yj|X) = Cov(ei, ej|X) = 0 if i≠j With cross-sectional data, this assumption implies that there is no spatial correlation between the errors. With time-series data, that there is no correlation over time. MR5. No Exact Linear Relationship Exists Between the Explanatory Variables It is not possible to express one of the explanatory variables as an exact linear function of the others: the only values of c₁,c₂,…,cK for which c₁xᵢ1 +c₂xᵢ2 +···+cKxᵢK = 0 for all observations i=1,2,...,N are the values c₁ = c₂ = … =cK =0. The x are not exactly collinear. If one or more of the cK’s can be nonzero, the assumption is violated, and it is not possible to separately estimate the effects of changes in each of the involved variables. Also, explanatory variables must vary, otherwise we cannot estimate the model. In other words, you can’t say that one variable is a linear function of the other, as this wouldn’t add any new information. Necessary condition for this to hold: N > K (number of observations larger than the number of parameters) MR6. Conditionally Normal Errors (optional) eᵢ|X ∼ N(0, σ²), i = 1,2,…, N This assumption implies that the conditional distribution of y is also normally distributed, yᵢ |X ∼ N(E(yᵢ |X), σ²). We call it optional for two reasons. First, it is not necessary for many of the good properties of the least squares estimator to hold. Second, as we will see, if samples are relatively large, it is no longer a necessary assumption for hypothesis testing and interval estimation. Estimating the parameters of the multiple regression model LS Estimation procedure • To find an estimator for estimating the unknown parameters we follow the least squares procedure: we find those values of (β₁ , β₂ , β₃) that minimize the sum of squared differences between the observed values of yᵢ and their expected values E(yᵢ|X) = β₁+ xᵢ₂ β₂ + xᵢ₃ β₃. It is a straight forward calculus exercise. The solutions give us formulas for the least squares estimators for the β coefficients. You are not supposed to memorise these formulas, as computer softwares calculate them. • Formulas for b₁, b₂, and b₃, obtained by minimising are estimation procedures, called the least squares estimators of the unknown parameters. Since their values are not known until the data are observed and the estimates calculated, the least squares estimators are random variables. The solution is b = (b₁, b₂, … , bK)’ • Remember that predictions are reliable only for “reasonable” values of the x. 
 Extrapolation is always risky! Estimation of the error variance σ² • For this parameter, we follow the same steps that were outlined in Section 2. Under MR1, MR2, MR3 we know that σ² = var(eᵢ|X) = var(eᵢ) = E(eᵢ²|X) = E(eᵢ²) Thus we can think of it as the population mean of the squared errors eᵢ². A natural estimator for this population mean is σ̂² = ∑ eᵢ² / N, but given that the squared errors are unobservable, it becomes σ̂² = ∑ êᵢ² / (N-K), i.e. the mean of the lest square residuals, made unbiased by N - K. Measuring the Goodness-of-Fit The proportion of variation in the dependent variable explained by all the explanatory variables included in the model. The coefficient of determination is • The value of R² is also equal to the squared sample correlation coefficient between ŷᵢ and yᵢ. R² is the fraction of observed sample variance explained by all regressors. • The interpretation of R² is based on ȳ=ŷ. True if the model includes an intercept (β₁). Finite sample properties of the LS estimator • What we can say about the sampling distribution of the least squares estimator depends on what assumptions can realistically be made for the sample of data being used for estimation. For the simple regression model introduced in Chapter 2 we saw that, under the assumptions SR1 to SR5, the OLS estimator is best linear unbiased in the sense that there is no other linear unbiased estimator that has a lower variance. • The same result holds for the general multiple regression model under assumptions MR1–MR5. The Gauss–Markov Theorem: If assumptions MR1–MR5 hold, the least squares estimators are the Best Linear Unbiased Estimators (BLUE) of the parameters in the multiple regression model. • The implications of adding assumption MR6, that the errors are normally distributed, are also similar to those from the corresponding assumption made for the simple regression model. Conditional on X, the least squares estimator is normally distributed. Using this result, and the error variance estimator σ̂², a t-statistic that follows a t-distribution can be constructed and used for interval estimation and hypothesis testing. • These various properties, BLUE and the use of the t-distribution for interval estimation and hypothesis testing, are finite sample properties: as long as N > K, they hold irrespective of the sample size N. However, there may be situations where N approaches infinity. • To accommodate such situations we use what are called large sample or asymptotic properties. These properties refer to the behavior of the sampling distribution of an estimator as the sample size approaches infinity. Recap: Gauss-Markov Theorem: In the multiple regression model, and under assumptions MR1- MR5, the OLS estimator is BLUE. • The GM Theorem is a finite sample result: it holds for any N (with N > K) • Under MR6, the OLS estimator is (conditionally on X) normally distributed, another finite sample property. • If MR6 doesn’t hold, OLS estimator is asymptotically (for large N) normally distributed. • This is an asymptotic property: we will often use this kind of properties later, in more complicated contexts. The variances and covariances of the LS estimators The variances and covariances of the least squares estimators give us information about the reliability of the estimators b₁, b₂, and b₃. Since the least squares estimators are unbiased, the smaller their variances, the higher the probability that they will produce estimates “near” the true parameter values. Here are two examples of formulas: It is important to understand the factors affecting the variance of b₂: - Larger error variances σ² lead to larger variances of the LS estimators, as σ² measures the overall uncertainty in the model specification. If σ² is large, then the data are widely spread about the regression function, and less information is given about the parameter values. - Larger sample sizes N imply smaller variances of the least squares estimators, as N is implicitly contained in the length of the summations. - More variation in an explanatory variable around its mean, leads to smaller variances of the LS estimator (sample dispersion of x) - A larger correlation between x₂ and x₃ leads to a larger variance of b₂. It is customary to arrange the estimated variances and covariances of the least squares estimators in a square array, which is called a matrix. This matrix has variances on its diagonal and covariances in the off-diagonal positions. It is called a variance-covariance matrix. To estimate variances and covariances, we need to review two important results in probability: the Law of Iterated Expectations and the Law of Iterated Variances. • Law of Iterated Expectations - Let Y and X be two r.v. (possibly vectors) - E(Y) is a population constant - E(Y |X) is a population function of X - Then: E(Y) = EX [E(Y|X)]
 → The inner expectation averages w.r.t. Y , given X → The outer expectation averages w.r.t. X - Application to the unbiasedness of the OLS estimator: E(b) = EX [E(b|X)] = EX[β] = β - b is conditionally and unconditionally unbiased • Law of Iterated Variances - E(Y) and Var(Y) are two population constants - E(Y |X ) and Var(Y |X ) are two population functions of X - Then: Var(Y)=EX [Var(Y|X)]+VarX [E(Y|X)] - Application to the variance of the OLS estimator: 
 Var(b) = EX [Var(b|X)] + VarX [E(b|X)] = EX [Var(b|X)] + 0 - For example, if K = 3:
 which is easy to estimate! Testing for elastic demand We want to check whether demand is elastic w.r.t. price 1. H₀ : β₂ ≥ 0, H₁ : β₂ < 0. Note: H₀ is rejected only if incompatible with the sample 2. Test statistic: if H₀ is true, t = b₂/se(b₂) ∼ t(N−K) 3. α=0.05 and N−K=72, RR: t≤t (0.05,72) = −1.666 4. In the sample, t = −7.908/1.096 = −7.215; 
 p-value = Prob(t(72) < −7.215) = 0.000 5. H0 : β₂ ≥ 0 is rejected 
 Testing advertising effectiveness We want to check whether advertising is repaid by a large enough increase in SALES 1. H₀ : β₃ ≤ 1, H₁ : β₃ > 1 2. Test statistic: If H₀ is true, t = (b₃ − 1)/se(b₃) ∼ t(N−K) 3. α=0.05 and N−K=72, RR: t>t (0.95,72) =1.666 4. In the sample, t = (1.8626 − 1)/0.6832 = 1.263; pv = Prob(t(72) > 1.263) = 0.105 5. H0 : β₃ ≤1 is not rejected 
 Hypothesis testing for a linear combination of coefficients We are often interested in testing hypotheses about linear combinations of coefficients. Will changes in the values of two or more explanatory variables lead to a mean dependent variable change that exceeds a predefined goal? Nonlinear relationships One class of models is that of polynomial equations. Now that we are working within the framework of the multiple regression model, we can consider unconstrained polynomials with all their terms included. Another generalization is to include “cross-product” or “interaction” terms leading to a model such as y = γ₁ + γ₂x₂ + γ₃x₃ + γ4x₂x₃ + e. Cost and product curves A linear in x regression model is not well suited to describe cost or production functions. • A polynomial in x regression model fits the data better: MC=β₁ +β₂Q+β₃Q³+e TC=α₁ +α₂Q+α₃Q²+α4Q³+e • These models are linear in the parameters: we can still use the OLS approach • We can use the different powers of the same variables as if they were different variables • Note: In nonlinear models the parameters are not slopes or marginal effects • A model with x, x², . . . , xp can exhibit multicollinearity. Extending the model for Burger Barn sales It is reasonable to expect that the effect of advertising on sales is marginally decreasing. Optimal place to advertise is not the one that yields highest sales, because advertising is costly, so we want it to at least repay itself. SALES=β₁ +β₂ PRICE+β₃ ADVERT+β4 ADVERT² +e - The ADVERT effect is not constant: We expect β₃ > 0 and β4 < 0 Large sample properties of the LS estimator It is nice to be able to use the finite sample properties of the OLS estimator or, indeed, any other estimator, to make inferences about population parameters. However, the assumptions we have considered so far are likely to be too restrictive for many data sets. To accommodate less restrictive assumptions, as well as carry out inference for general functions of parameters, we need to examine the properties of estimators as sample size approaches infinity, which provide a good guide to properties in large samples. They will always be an approximation, but it is an approximation that improves as sample size increases. Large sample approximate properties are known as asymptotic properties: they hold approximately if N is large enough. How large does the sample have to be? It depends on the model. In statistics/econometrics, finite sample properties are the exception. Asymptotic properties are the general case. In this section, we introduce some large sample (asymptotic) properties and then discuss some of the circumstances where they are necessary. 1. Consistency When choosing econometric estimators, we do so with the objective in mind of obtaining an estimate that is close to the true but unknown parameter with high probability. Suppose that for decision-making purposes we consider that obtaining an estimate of β₂ within “epsilon” of the true value is satisfactory. The probability of obtaining an estimate “close” to β₂ is P(β₂ - ε ≤ b₂ ≤ β₂ + ε) Intuitively, the distribution of b₂ depends on N. E.g., under assumptions MR1-MR5, we know that 
 E(b₂) = β₂ An estimator is said to be consistent if this probability converges to 1 as the sample size N → ∞. Or, using the concept of a limit, the estimator b₂ is consistent if 
 We depict the probability density functions 𝑓(bNi ) for the least squares estimator b₂ based on samples sizes N4 > N3 > N2 > N1. As the sample size increases the probability density function becomes narrower. Why is that so? First of all, the least squares estimator is unbiased if MR1-MR5 hold, so that E(b₂)=β₂. This property is true for all sample sizes. As the size changes in fact, the center of the pdfs remains β₂. However, the variance of the estimator b₂ becomes smaller.
 variables that satisfy these are said to be contemporaneously uncorrelated. We do not insist that cov(et, xsk) = 0 for t≠s. Is b₂ consistent? Yes, as long as x is not “too dependent”. Asymptotic normality can be shown by a central limit theorem. Inference for a nonlinear function of coefficients β • The need for large sample or asymptotic distributions is not confined to situations where assumptions MR1–MR6 are relaxed. Even if these assumptions hold, we still need to use large sample theory if a quantity of interest involves a nonlinear function of coefficients. To introduce this problem, we return to Big Andy’s Burger Barn and examine the optimal level of advertising. • Economic theory tells us to undertake all those actions for which the marginal benefit is greater than the marginal cost. This optimising principle applies to Big Andy’s Burger Barn as it attempts to choose the optimal level of advertising expenditure. Recalling that SALES denotes sales revenue or total revenue, the marginal benefit in this case is the marginal revenue from more advertising. • The required MR is given by the marginal effect of more advertising β3 + 2β4 ADVERT. The marginal cost of $1 of advertising is $1 plus the cost of preparing the additional products sold due to effective advertising. If we ignore the latter costs, the marginal cost of $1 of advertising expenditure is $1. Thus, advertising should be increased to the point where β3 + 2β4ADVERT0 = 1, with ADVERT0 denoting the optimal level of advertising. Using the least squares estimates for β₃ and β₄ in (5.25), a point estimate for ADVERT0 is implying that the optimal monthly advertising expenditure is $2014. • To assess the reliability of this estimate, we need a standard error and an interval estimate for (1 − b3)∕2b4. What makes it more difficult than what we have done so far is the fact that it involves a nonlinear function of b3 and b4. Variances of nonlinear functions are hard to derive. Suppose λ = (1 – β3)∕2β4 and λ̂ = (1 − b3)∕2b4 ; then, the approximate variance expression is the following, which holds for all nonlinear functions of two estimators • We used the delta theorem to find the variance. This is something that computer softwares do. • We estimate with 95% confidence that the optimal level of advertising lies between $1757 and $2271. 5.6 Interaction variables Es pinatry aribi tha low 0 Nave non conan marina fc. bu funcon of another vr » A marginal effect may depend on another variable » Consider a wage equation WAGE = 8; + S,EDUC + B3EXPER + e » dE(WAGE)/0EDUC = 82. We expect 8» > 0 » dE(WAGE)/0EXPER = 83. We expect 93 > 0 WAGÈ = -17.729** + 2.583" EDUC+ 0.200** EXPER (se) (2.209) (0.136) (0.030) 5.6 Interaction variables » It is plausible that the marginal effects of education and experience might be non constant, e.g., they might depend on the other variable » Interaction: WAGE = 8,+/,EDUC + B3EXPER +B4(EDUC x EXPER) + e to 1 tn iran variable Tha makes The marginal sucatton and experience no consint » dE(WAGE)/0EDUC = 5; + MEXPER » 9E(WAGE)/0EXPER = 8; + B,EDUC > Marginal effects of EXPER when EDUC = 8 or 16: d E(WAGE) DEXPER 7 b3 + baEDUC = 0.238 — 0.00275 EDUC 0.194 when EDUC = 16 { 0.216 when EDUC = 8 Loglinear models logWAGE = 5; + B3EDUC + B3EXPER + e » Effect of one additional year of experience: 1008;% » Interaction effect: logWAGE = + ,EDUC + B3EXPER +84(EDUC x EXPER) + e » Effect of one additional year of experience: 100(8: + =, EDUC)% Loglinear models > Interaction and quadratic effect: log WAGE = + B,EDUC + B3EXPER +Bx(EDUC x EXPER) + B5EXPER? + e A negative 5 couid express the fact hat t00 much experience is not as valuable anymore > Semielasticity of wage w.r.t. experience: 83 + BaEDUC + 285EXPER Further inference in the multiple regression model Testing joint hypotheses • Until now we have only considered simple null hypotheses, meaning that there was only one restriction of the parameters • How can we jointly test more than one constraint? y = β₁ +β₂x₂ + β₃x₃ + β4x4 + β5x5 + β6x6 + e • We might be tempted to compute t-tests and p-values for the parameters separately, but it would be wrong. H0 : β4 =0, β5 =0, β6 =0 • NB: The three t tests are not joint tests: in fact, when you apply a test on e.g. β₃ you are letting all other βs be whatever, therefore you are not checking whether they are acting jointly. Testing the effect of advertising: the F test (An example of a joint hypothesis: testing to check if advertising has any effect on sales) SALES=β₁ +β₂ PRICE+β₃ ADVERT+β4 ADVERT² +e H0 : β₃ = 0 , β4 = 0 i.e. both β₃ and β4 are equal to zero at the same time H₁ : β₃ ≠ 0 , β4 ≠ 0 or both i.e. either β₃ is different from zero, or β4 is different from zero, or both are different from zero With joint tests, the alternative hypothesis only admits “different from”, so it’s always going to be two sided. • Unrestricted model (U): does not impose H0: it is the general model without considering H0. SALES = β₁ + β₂ PRICE + β₃ ADVERT + β4 ADVERT² + e • Restricted model (R): derived from (U), by imposing H₀: it is the model assuming H0 is true. SALES=β₁ +β₂ PRICE + e → If U and R are similar, then H0 is true • J = number of constraints, i.e. hypotheses under H0 • SSER - SSEU ≥ 0 the difference between the sum of squared residuals in the restricted model and that in the unrestricted model, is always ≥ 0 because SSER ≥ SSEU SSEU=∑(si -b₁-b₂Pi-b₃Ai-b4Ai²)² SSER=∑(si-a₁-a₂Pi)² • The denominator is the estimate of σ² in the unrestricted model → If H0 is false, F is large → If H0 is true, F is small • This distribution holds under Gauss-Markov and the errors are normally distributed: therefore this is an exact distribution The relationship between t and F tests • If J = 1 (i.e. one restriction), a two-tail H0 can be tested using a t or a F test • The two test statistics are different: F (1,N−K) ≡ t² (N−K) and their critical values are different, too: Fc ≡ (tc)² • Conclusion: the two tests however, provide the same p-value • However, using the F test is restrictive: in fact, the F test only allows for two sided (≠ from) SALES=β₁ +β₂ PRICE+β₃ ADVERT+β4 ADVERT² +e • H0: β₂=0 • t = 7.30444; F = 53.355 = (7.30444)² • tc = 1.9939; Fc = 3.976 = (1.9939)² • P-value: 3.236 × 10−10
 Note: The equivalence only holds for two-tail tests More general F tests • Any J ≤ K linear restrictions can be tested using an F statistic SALES=β₁ +β₂ PRICE+β₃ ADVERT+β4 ADVERT² +e • Optimal advertising expenditure: β₃ + 2β4 ADVERT0 = 1 (1,000$) • H0 : ADVERT0 =$1,900 ⇔ H0 : β₃ +3.8β4 =1, H₁: β₃ +3.8β4≠1 • t = 0.968, p-value =0.337 
 Using computer software • The optimal advertising expenditure is 1,900$ • Budget conjecture: if PRICE=6$ and ADVERT=1,900$, SALES=80,000$ • This is equivalent to: H0 : β₃ + 3.8β4 = 1, β₁ + 6β₂ +1.9β₃ +1.9²β4 = 80 H₁ : At least one of the two constraints does not hold Large sample tests • To be valid, the F test needs two conditions to hold: 1. MR1-MR6 are valid
 2. The J constraints on β are linear • The F test is a finite sample one •We can derive a similar test under the asymptotic distribution and the weaker exogeneity assumption from the following property: The denominator is not a r.v. asymptotically, same for J. Therefore the denominator and J only matter when they are r.v.’s i.e. finite, not asymptotic. •Asymptotically, it is equivalent to use σ² or σ^² at the denominator (as σ²^ →p σ²): • Notice that F = V^₁ /J. The p-value of the two statistics are usually close, but different • V^₁ is the “asymptotic” version of the F test. It still works for linear constraints only • For nonlinear constraints, we must use the Delta Theorem (R takes care of all the details) 
 Midterm II The Use of Nonsample Information In many estimation problems we have information over and above the information contained in the sample observations. To illustrate how we might go about combining sample and nonsample information, consider a model designed to explain the demand for beer. From the theory of consumer choice in microeconomics, we know that the demand for a good will depend on the price of that good, on the prices of other goods—particularly substitutes and complements—and on income. In the case of beer, it is reasonable to relate the quantity demanded (Q) to the price of beer (PB), the price of liquor (PL), the price of all other remaining goods and services (PR), and income (I). To estimate this demand relationship, we need a further assumption about the functional form: we assume, for this case, that log-log is the appropriate one: ln(Q) = β₁ + β2ln(PB) + β3ln(PL) + β4ln(PR) + β5ln(I) + e A relevant piece of nonsample information can be derived by noting that if all prices and income go up by the same proportion, we would expect there to be no change in the lhs — quantity demanded. This assumption is that economic agents do not suffer from “money illusion”. Let us impose this assumption on our demand model and see what happens. Having all prices and income change by the same proportion is equivalent to multiplying each price and income by a constant k. = β₁ + β2ln(k*PB) + β3ln(k*PL) + β4ln(k*PR) + β5ln(k*I) we can simplify this with the following line = β₁ + β2ln(PB) + β3ln(PL) + β4ln(PR) + β5ln(I) + (β2 + β3 + β4 + β5)*ln(k) what remains is (β2 + β3 + β4 + β5)*ln(k) = 0 Therefore, with absence of monetary illusion (AMI), meaning, for there to be no change in ln(Q) when all prices and income go up by the same proportion, it must be true that: β2 +β3 +β4 +β5 = 0 You can solve this with respect to β4 = - β₂ - β3 - β5 meaning that we have one less parameter to estimate Thus, we can say something about how Qd should not change when prices and income change by the same proportion, and this information can be written in terms of a specific restriction on the parameters of the demand model. We call such a restriction nonsample information, that we introduce by solving for one of the βk’s, like we did. Let's estimate the demand model imposing AMI using data in beer:
 we substitute this inside the model ln(Q) = β₁ + β2ln(PB) + β3ln(PL) + (- β₂ - β3 - β5)ln(PR) + β5ln(I) + e (R) By using the restriction to replace β4, and using the properties of logarithms, we have constructed the new variables ln(PB/PR), ln(PL/PR), and ln(I/PR). These variables have an - Omission of a relevant variable (defined as one whose coefficient is nonzero) leads to an estimator that is biased: this is called omitted variable bias, a special type of non- sample information. • General case, K=3, supposing all are statistically significant:
 y =β₁ +β₂x₂ +β₃x₃ +e • Imposing β₃ = 0 — that is, omitting x₃ —we get: bias this expression provides the bias and, under weaker conditions inconsistency If you omit a variable that should be there, then the OLS estimator will be biased and inconsistent, unless the variables that you are forgetting are uncorrelated with the variables we are using. • Note: b₂* is conditionally unbiased only if ^Cov(x₂, x₃) = 0, and consistent only if Cov(x₂, x₃) = 0. • In the example above, β₃>0 and ^Cov(x₂, x₃)>0: positive, upward bias. Knowing the sign of β₃ and the sign of the covariance between x₂ and x₃ tells us the direction of the bias. 
 Another example: KL6 = Number of kids aged 6 or less The HEDU and WEDU estimates are back to their initial value, because ^Corr(HEDU, WEDU) = 0.59 but ^Corr(HEDU, KL6) = 0.10 and ^Corr(WE, KL6) = 0.13 The omitted variable bias for KL6 is not large, as correlation is quite low. Irrelevant Variables The consequences of omitting relevant variables may lead you to think that a good strategy is to include as many variables as possible in your model. However, doing so will not only complicate your model unnecessarily, it may inflate the variances of your estimates because of the presence of irrelevant variables, those whose coefficients are zero because they have no direct effect on the dependent variable. If the model includes irrelevant variables - The LS estimates are still unbiased - The standard errors are larger with X5, X6 being artificial variables, randomly generated, i.e. having no relation with the rest. P-values confirm that you can safely drop them. Control variables • Control variables are included in a model to avoid omitted variable bias in the (causal) coefficients of other variables. • Control variables may directly affect y, or simply proxy for other (unobserved) omitted variables • Let data be generated by yᵢ =β₁ +β₂xᵢ +β₃zᵢ + eᵢ, E(eᵢ|xᵢ,zᵢ)=0 • Suppose we are interested in β₂, and that zᵢ (ability) is unobserved • A great example would be imagining the wage function log(W) = β₁+β₂ED+β₃ABILITY+e Given that we cannot measure ability, we could try to use IQ as a proxy for it. • If we estimate 
 yᵢ= γ₁ + γ₂xᵢ + vᵢ then 
 - vᵢ = eᵢ + β₃zᵢ - the OLS estimate of γ₂ is not consistent for β₂ unless Cov(xᵢ,zᵢ)=0 • Suppose that we observe another variable qᵢ that is a proxy for zᵢ. Can we estimate unbiasedly/consistently β₂ in the model yᵢ= δ₁ +δ₂ +δ₃qᵢ +uᵢ (using δ₃^ as proxy of β₃)? • Yes, if (yᵢ , xᵢ , zᵢ , qᵢ ) satisfy two assumptions: 
 1. The conditional mean independence assumption (CMIA) E(zᵢ|xᵢ, qᵢ) = E(zᵢ|qᵢ) “Once you know qᵢ, knowing xᵢ doesn’t tell you anything about zᵢ” 2. qᵢ has no direct effect on yᵢ: E(yᵢ|xᵢ, zᵢ, qᵢ) = E(yᵢ|xᵢ, zᵢ) This is equivalent to E(eᵢ|xᵢ, zᵢ, qᵢ) = 0 • Both conditions can not be tested because zᵢ is not observed, so you must be convinced that they hold based on economic intuition or common sense. • In conclusion if we regress yᵢ on xᵢ and qᵢ (i.e. when using a proxy variable), we are usually able to estimate consistently: - the causal effect of xᵢ on yᵢ, i.e. β₂, the parameter of interest - but not the causal effect of zᵢ on yᵢ, i.e. β₃ Choosing the model 1. Do you want to estimate causal effects or predict y? 2. Choose variables and functional form coherent with economic theories and other a priori information 3. Estimates with the wrong sign or size can be symptoms of misspecification or omitted variable bias 4. Are there any systematic patterns in residuals plot? 5. Conduct significance test 6. Are there influential observations? 7. Are results robust to different specifications (esp. causal effects estimates)? 8. Use the RESET specification test 9. Check the model selection criteria 10. Out-of-sample prediction 11. … And keep track of all the checks and modifications that led you to the final model! 
 Note: - Significance tests: done on the βs (e.g. are they equal to zero?) - Specification (or diagnostic) test: done on the overall model (e.g. are the errors normally distributed? We run some diagnostic tests using tools which will tell us how to choose between competing models based on how well specified they are, and on whether they are omitting variables or not. RESET Ramsey test REgression Specification Error Test: designed to detect omitted variables and incorrect functional form. Note: apply this test only to nested models. Two models are nested if model M1 is a restricted version of model M2. To test which of the two models is the best, you have to test for the significance of the additional variable that differentiates them: - estimate the unrestricted model - test the restriction If we reject H0, then M2>>>M1 (i.e. is better than) Suppose that we have specified and estimated the regression model y = β₁ + β₂x₂ + β₃x₃ +e Let (b₁, b₂, b₃) be the least squares estimates, and let ŷ be the fitted values of y ŷ = b₁ + b₂x₂ + b₃x₃ (6.32) • Consider the following two artificial models (auxiliary regressions):
 y = β₁ + β₂x₂ + β₃x₃ + γ₁ŷ² +e (1) y = β₁ + β₂x₂ + β₃x₃ + γ₁ŷ² + γ₂ŷ³ +e (2) • Tests for misspecification: - H0 : γ₁=0, H₁ : γ₁≠0 in (1), i.e. model is well specified (t or F test)
 - H0 : γ₁=0, γ₂=0, H₁: γ₁≠0 and/or γ₂≠0 in (2) i.e. model has omitted variables (F test) Rejection of H₀ implies that the original model is inadequate and can be improved. A failure to reject H₀ says that the test has not been able to detect any misspecification. To understand the idea behind the test, note that ŷ₂ and ŷ₃ will be polynomial functions of x₂ and x₃. If you square and cube both sides of (6.32), you will get terms such as x²₂, x³₃, x₂x₃, x₂x₃², and so on. Since polynomials can approximate many different kinds of functional forms, if the original functional form is not correct, the polynomial approximation that includes ŷ₂ and ŷ₃ may significantly improve the fit of the model, and will be detected through nonzero values of γ₁ and γ₂. Furthermore, if we have omitted variables and these variables are correlated with x₂ and x₃, then they are also likely to be correlated with terms such as x₂² and x₃², so some of their effect may be picked up by including the terms ŷ₂ and/or ŷ₃. The point is: if we can significantly improve the model by artificially including powers of the predictions of the model, then the original model must have been inadequate. The test however, will not work if the omitted variable z is significantly different from xh K , h=2,3,… - Estimates are not statistically significant - R² can still be large, indicating significant explanatory power of the model as a whole - The F test of model significance rejects H₀ - The model may still predict well out-of-sample - The estimates are very sensitive to the addition or the omission of some observation or of some variable, even if apparently irrelevant Dummy variables trap These are cases of exact collinearity, that make it impossible for us to estimate the model. 1) w = β₁ + β₂Ed + β₃ Ex + … + βMM + βFF + e M and F are dummy variables, meaning that: Mᵢ = 1 i is man ——-0 i is woman Fᵢ = 1 i is woman ——-0 i is man There exists an exact, linear relationship between M,F: 1=M+F, implying exact collinearity. E.g., if we want to test where there is discrimination between genders, i.e. H0: βM = βF H1: βM > βF we need to drop one of the two variables w = δ₁ + δ₂Ed + δ₃Ex + … + δFF + e H0: δF = 0 H1: δF < 0 , now we can estimate it 2) Which day of the week do we expect returns to be higher? Rt= β₁+ βMMt+ βTTt+ βWWt+ βThTht + βFFt +e Mᵢ = 1 t is Monday ——-0 otherwise H0: βM = βTu = βW = βTh = βf 4 linear restrictions → F test 1 = Mt+ Tt + Wt + THt + Ft exact collinearity, model cannot be estimated Identifying and mitigating collinearity • Compute the correlations between regressors • Estimate the auxiliary regressions (this is how we check for approximate collinearity): xk =a1x1+a2x2+...+ak−1xk−1+ak+1xk+1+...+aKxK+v and compute the R² coefficient, denoted R²k, for k = 1,2,...,K • If R²k > 0.8 for some k, a large fraction of the sample variability of xk is explained by the other x • The Variance Inflation Factor (VIF) 
 VIFk = 1 measures the percent increase in the variance of bk when we add ———-1-R²k regressors different from xk to the rhs. If R²𝚔=0, then VIF=1. If R² is higher, VIF is higher, and var(b₂|X) is VIF times higher than if there were no collinearity. • M1) y = β₁ + β₂xk + e → ^Var(b₂)
 M2) y = δ₁ + δ₂x₂ + δ₃x₃ + … + δKxk + … + δKxK + u → ^Var(δ^K), larger than the VIF𝚔 = ^Var(^δK) previous one ^Var(b₂) If R²k of the auxiliary regression is high, then a large portion of the var in xk is explained by other regressors, and that may have a detrimental effect on the precision of β estimation. • Mitigate: Collinearity amounts to insufficient information - Add more data
 - Introduce nonsample information Using indicator variables An indicator variable is an artificial variable which is equal to 1 if a certain logical condition is met, and equal to 0 otherwise. PRICE = β₁ + β₂SQFT + e • New explanatory variable: desirability of the neighbourhood - D is qualitative - D: binary, dichotomic, indicator, dummy (to indicate that we are creating a numeric variable for a qualitative, nonnumeric characteristic). - 0 and 1 are just labels, not numbers, namely “true” or “false” Intercept indicator variables The most common use of indicator variables is to modify the model’s intercept parameter. The effect of the inclusion of an indicator variable D into the regression model is best seen by examining the regression function in the two locations • δ = change in intercept, i.e. δ has the effect of shifting the intercept • δ = location premium, i.e. change in price due to the location of the house Choosing the reference group A researcher can choose whichever neighbourhood is most convenient, for expository purposes, to be the reference group • D = 0: reference group • If: LD meaning “least desirable” Reference case: LD = 0 (desirable), this indicator variable is defined just the opposite from D, and LD = 1 − D. • Perfect collinearity PRICE = β₁ + δD + λLD + β₂SQFT + e P=β₁+δD+β₂+δ+e β₁+δ+β₂S D=1 good E(P|D,S)= β₁+β₂S D=0 bad P= γ₁λLD+γ₂S+e γ₁+γ₂S LD=1 good E(P|LD,S)= γ₁+λ+γ₂S LD=0 bad γ₁=β₁+δ γ₂=β₂ → solve the system δ= -λ γ₁+ λ=β₁ δ is the location premium and λ is the location penalty. We try to understand the relationship between (β₁, δ, β₂) ↔ (γ₁, λ, γ₂) Location premium has become a location penalty, and the intercept has also changed. You may be tempted to include both D and LD in the regression model to capture the effect of each neighbourhood on house prices. That is, you might consider the model PRICE=β₁ +δD+λLD+β₂SQFT+e In this model, the variables are such that D + LD = 1. We have created a model with exact collinearity, where the least squares estimator is not defined. This error is sometimes described as falling into the dummy variable trap. By including only one of the indicator variables, either D or LD, the omitted variable defines the reference group. Slope indicator variables PRICE = β₁ + β₂SQFT + γ(SQFT * D) + e • (SQFT * D): we can allow for a change in a slope by including in the model an additional explanatory variable that is equal to the product of an indicator variable and a continuous variable, an interaction effect, i.e. another form of location premium that depends on house size. Slope indicator variables don’t change the intercept. • Joint test of significance: Testing the equivalence of two regressions • Is the regression equation the same if SOUTH = 0 or SOUTH = 1? • Significance test of all the interactions with SOUTH: H0 : θ1 =0, θ2 =0, θ3 =0, θ4 =0, θ5 =0 • Note: This test is based on MR1-MR6 (σ² constant) • This is known as the Chow test: an F-test for the equivalence of two regressions. (Too complicated, use the following example) By including an intercept indicator variable and an interaction variable for each additional variable in an equation, we allow all coefficients to differ based on a qualitative factor. • Is the regression (the coefficients indicated) the same for men and women? E(W) = β₁ + β₂EDUC + β₃EXPER By putting all observations together, we are going to miss the difference that we observe when we separate men from women. W=β₁+β₂ED+β₃EXPER+δ₁F+δ₂(F*ED)+δ₃(F*EXPER)+e H0: do males and females have the same regression function? β₁ + β₂ED + β₃EXPER F = 0 (β₁+δ₁)+ (β₂+δ₂)ED + F = 1 (β₃+δ₃)EXPER Note that each variable has a separate coefficient for female and non-female workers. H0: δ₁=0, δ₂=0, δ₃=0 this is NOT the automatic F test of significance, as we are only checking for significance of the F interaction variables. If we reject this null hypothesis, we conclude that there is some difference in the wage equation of females relative to that of non-females. The linear probability model To analyze and predict such outcomes using an econometric model, we represent the choice using an indicator variable, the value one if one alternative is chosen and the value zero if the other alternative is chosen. Because we are attempting to explain choice between two alternatives, the indicator variable will be the dependent variable rather than an explanatory variable in a regression model. It is a special case of linear regression. • Until now indicator variables were included in the rhs • The dependent variable can be a dummy, too If we observe the choices that a random sample of individuals makes, then y is a random variable. If p is the probability that the first alternative is chosen, then P[y = 1] = p. The probability that the second alternative is chosen is P[y = 0] = 1 − p. The probability function for the binary indicator variable y is • If p=Prob(y=1), y is a Bernoulli r.v. The probability function for the binary indicator y is:
 pdf: f(y)=py(1−p)1−y, y=0,1 moments: E(y) = p, Var(y) = p(1 − p) • Pros: very simple; quite good in prediction/forecasting • Cons: quite bad if you are looking for consistency in parameters; the functional form makes no sense, as it is not linear; if you apply OLS, it’s not efficient • Linear probability model: E(y|X) = p = β₁ +β₂x₁ +…+βKxK βK = ΔE(y|x) = ΔProb{y=1|x} ΔxK ΔxK If y is binary E(y) = 1*p + 0*(1-p) = p = β₁ +β₂x₁ +...+βKxK + e e.g. the impact of changing price on the probability of household buying a product Men Women • The corresponding econometric model: y= E(y|X) + e = β₁ +β₂x₁ +...+βKxK + e p(x) • One difficulty with using this model for choice behavior is that the usual error term assumptions cannot hold. The outcome y only takes two values, implying that the error term e also takes only two values, so that the usual “bell-shaped” curve describing the distribution of errors does not hold. The probability functions for y and e are 
 1-p y=1, prob=p e = y-p = -p y=0, prob=1-p E(e|X) = (1-p)p - p(1-p) = 0 Var(e|X) = E(e²|x) - E(e|x)² = (1-p)²*p - p²*(1-p) = p(1-p) Therefore errors become heteroskedastic, as p(=p(x)) changes across observations. So, Gauss-Markov is not valid, and OLS estimator is not efficient. • This error is not homoskedastic, so the usual formula for the variance of the least squares estimator is incorrect, and instead we use: • A second problem associated with the linear probability model is that predicted values, E(y) = p̂, can fall outside the (0, 1) interval (can be negative or larger than 1), meaning that their interpretation as probabilities does not make sense. 0 if pˆ<0 pˆ → p∼ = pˆ if 0≤ pˆ≤1 1 if pˆ>1 pˆ = predicted probs ⋲ (0,1) belong to this interval ŷ = predicted choices ⋲ {0,1} i.e. either ŷ = 0 if pˆ ≤ ½, or ŷ = 1 if pˆ > ½, with threshold ½. -1 0 1 2 Star experiment: (another example) regress total score (yᵢ) on the treatment indicator (dᵢ) yᵢ=β₁+β₂dᵢ+eᵢ, i=1,2,...,N • β₂ is the treatment effect we wish to measure • Some algebra shows that b₂ is a difference estimator: where ȳ₁ and ȳ0 are the sample means of y for treated and untreated observations, respectively. • Is b₂ unbiased or consistent for β₂?
 - Unbiasedness requires MR2, i.e. strict exogeneity: E(eᵢ|d) = 0, i = 1,2,...,N - Consistency requires the weaker condition: E(eᵢ|dᵢ) = 0, i = 1,2,...,N • Both conditions amount to assume that, on average, all factors affecting the outcome y other then the treatment d do not depend on the treatment assignment, and therefore must be equal to the treatment and control groups. • Including additional regressors attenuates but almost never eliminates the issue: some factors could be unobservable Note: one event preceding another does not necessarily imply causation. Recapping it all: with random assignment, and the use of a large number of experiment subjects, we can be sure that E(e₁) = E(e₀) and E(b₂) = β₂. The STAR project • If individuals are randomly assigned to the treatment and control groups, the difference estimator works fine • In the STAR experiment, children were randomly assigned within schools into three types of classes: 
 1. Small classes with 13-17 students
 2. Regular-sized classes with 22-25 students
 3. Regular-sized classes with a full-time teacher aide • Student scores on tests were recorded, as was some information about the students, teachers, and schools 
 • It may be that assignment to treatment groups is related to one or more observable characteristics. That is, treatments are randomly assigned given an external factor. The way to adjust to “conditional” randomisation is to include the conditioning factors into the regression. In the STAR data, another factor that we might consider affecting the outcome is the school itself. The students were randomised within schools (conditional randomisation), but not across schools. Some schools may be located in wealthier school districts that can pay higher salaries, thus attracting better teachers. The students in our sample are enrolled in 79 different schools. One way to account for school effects is to include an indicator variable for each school. That is, we can introduce 78 new indicators. Differences-in-Differences How can we measure the increase of minimum wage on employment? We cannot conduct a formal experiment, however there was a regulation that looked like an experiment even if it wasn’t. • Natural experiments approximate what happens in randomised controlled experiments. Treatment appears as if it were randomly assigned. In this section, we consider estimating treatment effects using “before and after” data. We can isolate the effect of the treatment by using a control group that is not affected by the policy change. • DiD (or diff-in-diff) idea: 1. Consider the change in y after the treatment
 2. Compare the change in the treated and control groups under the common trend assumption - Treated observed change: C − B 
 - Control observed change: E − A 
 - Treatment effect: (C −B)−(E −A) = C−D=δ 
 • DiD estimator:
 δ^=(ȳT,after − ȳT,before) − (ȳC,after − ȳC,before) This method is okay, however, it doesn’t provide us with standard errors. • δ^ can also be conveniently estimated as a coefficient in a linear regression, a method which will give us standard errors and test statistics: yit = β₁ + β₂dᵢ + βtAFTERt + δ(di × AFTERt) + eit with dᵢ being the TREATMENT • Notice that we need a panel of observations (several individuals and two time periods)
 treat = 0, after = 0 [control before] treat = 1, after = 0 [treatment before] treat = 0, after = 1 [control after] treat = 1, after = 1 [treatment after] Heteroskedasticity We proposed the simple population regression model FOOD_EXPᵢ = β₁ + β₂INCOMEᵢ + eᵢ Given the parameter values, β₁ and β₂, we can predict food expenditures for households with any income. The random error eᵢ represents the collection of all the factors other than income that affect household expenditure on food. If the assumption of strict exogeneity holds then the regression function is E(FOOD_EXPᵢ|INCOMEᵢ) = β₁ + β₂INCOMEᵢ The discussion above focuses on the level, or amount, of food expenditure. We now ask, “How much variation in household food expenditure is there at different levels of income?” If we observe many households with the median income (about $1000 a week), we would observe a wide range of actual weekly food expenditures. The variation arises because different households have differing tastes and preferences, and they have differing demographic characteristics, and life circumstances. We can expect to observe larger variations in weekly food expenditures by households with large incomes. Holding income constant, and given our model, what is the source of the variation in household food expenditures? It must be from the random error, the collection of factors, other than income, that influence food expenditure. Recall that the random error in the regression is the difference between any observation on the outcome variable and its conditional expectation, that is eᵢ = FOOD_EXPᵢ − E(FOOD_EXPᵢ|INCOMEᵢ) If the assumption of strict exogeneity holds, then the population average value of the random errors is E(eᵢ|INCOMEᵢ) = E(eᵢ) = 0. Another way of describing the greater variation in food expenditures for high-income households is to say the probability of observing large positive or negative random errors is higher for high incomes than it is for low incomes. The random error eᵢ has a higher probability of taking on a large value if its variance is large. In the context of the food expenditure example, we can capture the effect we are describing by assuming that var(eᵢ|INCOMEi) increases as income increases. Food expenditure can deviate further from its mean, or expected value, when income is large. In such a case, when the error variances for all observations are not the same, we say that heteroskedasticity exists. Conversely, if all observations come from probability density functions with the same variance, we say that homoskedasticity exists, and eᵢ is homoskedastic. Therefore, the estimate of ^σᵢ² is êᵢ² because each residual has a different variance, so we need to use only one observation to estimate one variance. White test • Breusch and Pagan, the one we just saw, requires knowledge of z, which is why we need the White test. One problem with the variance function test described so far is that it presupposes that we have knowledge of what variables will appear in the variance function if the alternative hypothesis of heteroskedasticity is true. In other words, it assumes we are able to specify z₂, z₃, ... , zS. • In reality, we may wish to test for heteroskedasticity without precise knowledge of the relevant variables. With this point in mind, White suggested defining the z’s as equal to the x’s, the squares of the x’s, and their cross-products. • White helps us decide which z variable to use: as z, use x, x² and cross-products • Suppose the regression model is yᵢ =β₁ +β₂xᵢ² +β₃xᵢ³ +eᵢ
 z₂ =x₂ z₃ =x₃ z₄ =x₂² z₅ =x₃² and z₆ =x₂x₃ • Note: to avoid collinearity, it may be necessary to drop some of the z If the regression model contains quadratic terms (x₃ = x₂² for example), then some of the z’s are redundant and are deleted. Also if x₃ is an indicator variable, taking the values 0 and 1, then x₃² = x₃ which is also redundant. One difficulty with the White test is that it can detect problems other than heteroskedasticity. Thus, while it is a useful diagnostic, be careful about interpreting the result of a significant White test. It may be that you have an incorrect functional form, or an omitted variable. In this sense, it is something like RESET. The Goldfeld-Quandt test • This method is more limited than the Lagrange multiplier one, but is however exact: it is designed for the case where we have two subsamples with possibly different variances. The subsamples might be based on an indicator variable. • It checks whether the variance in a group of observations is the same as that of another group of observations. • Consider the cps5_small dataset, and the regression model:
 WAGE = β₁ + β₂EDUC + β₃EXPER + e 
 • We suspect that the variance of eᵢ depends on METRO, a binary indicator (= 1: metropolitan area (M); = 0: rural area (R)) 
 1. Estimate the same model in two groups: one on metropolitan workers and one on rural workers.
 WAGEMi = βM1+βM2EDUCMi +βM3EXPERMi +eMi, i=1,...,NM WAGERi = βR1+βR2EDUCRi +βR3EXPERRi +eRi, i=1,...,NR 2. Compute and (here,KM=KR=2) 3. Under H0: σ²M = σ²R, i.e. quotient = 1, the test statistic is GQ= 4. The rejection region depends on H₁ (one- or two-tail) 5. The test conclusion does not change if we use either of these test statistics The food expenditure example • If z is continuous (i.e. not binary):
 1. Order the sample w.r.t. z
 2. Conduct the GQ test by comparing the first and the second part (e.g., half) of the sample • In the food expenditure example, z = INCOME 2. Consequences of heteroskedasticity in the MR model 
 • Heteroskedasticity is a violation of MR3: instead of 
 Var(eᵢ|X) = σ², ∀i 
 we should assume
 Var(eᵢ|X) = σᵢ², i = 1,2,...,N 
 • We focus on violations of MR3 only, other assumptions still hold (in finite samples or asymptotically) • In particular: MR2 [E(eᵢ |X) = 0] or exogeneity of x [Cov(eᵢ, xᵢ ) = 0] still hold 
 • Consequences: 1. The OLS estimator is still unbiased and/or consistent, resp. 2. The OLS estimator is not efficient 3. Inference based on “classical” standard errors and OLS variances is not valid (t, F, interval estimates, …). “Classical” refers to those estimated assuming homoskedasticity. In other words, the least squares estimator is still a linear and unbiased estimator, but it is no longer best. There is another estimator with a smaller variance. The standard errors usually computed for the least squares estimator are incorrect. Confidence intervals and hypothesis tests that use these standard errors may be misleading. 
 3. How should we address heteroskedasticity? • Compute standard errors using robust formulae! • Consider the simplest possible case: 
 This is a practical problem because your computer software has programmed into it the estimated variances and covariance of the least squares estimator under homoskedasticity. The solution is using robust estimators. Fortunately, calculation of a correct estimate for the OLS variance is astonishingly simple. Robust variance estimator: White heteroskedasticity-consistent estimator (HCE) • White’s HC (robust) variance estimator replaces σ²ᵢ with : • Robust means that the estimator works whether or not MR3 is violated, thus, if we are not sure whether the random errors are heteroskedastic or homoskedastic, we can use a robust variance estimator and be confident that our standard errors, t-tests, and interval estimates are valid in large samples. • This estimator is valid in large samples (i.e. asymptotically) • It's an example of a sandwich formula • The idea is simple, but the derivation is not! • With robust s.e., we can compute robust confidence intervals and test statistics 
 HC robust standard errors Why LS estimation fails • If Cov(eᵢ,xᵢ) > 0 on average, the errors are equal to zero, but for large values of x they are mostly positive, and for smaller values of x they are negative If cov>0, the OLS will overestimate the effects of the explanatory variable (In reality, of course, you only have the fitted regression line, not the true one) • b₂ overestimates β₂. Why? - LS estimation attributes ∆y to ∆x
 - But if exogeneity does not hold, ∆x on average implies some ∆e ∆y =β₂ ∆x + ∆e (+) (+) (+) - The effect of ∆x is overestimated • The opposite argument holds if Cov(eᵢ , xᵢ ) < 0 • OLS estimators are not consistent • The bias remains even asymptotically • You can skip [HGL 10.1.3] • Are there any relevant models in which Cov(xᵢ, ei) ≠ 0? We are now going to see three cases which automatically lead to endogeneity: omitted variables; measurement error; simultaneity/reverse causality Omitted variables • One case in which we know for sure that there will be endogeneity (and that OLS will not be consistent), is when we have omitted variables that are correlated with included explanatory variables. In other words, sometimes Cov(eᵢ, xᵢ ) ≠ 0 because eᵢ contains omitted variables correlated with xᵢ. • As an example, consider estimating a wage equation for married women: 
 log WAGEᵢ = β₁ +β₂ EDUCᵢ +β₃ EXPER +β4 EXPER²ᵢ +eᵢ • This model omits:
 1. Job market conditions 
 2. Geographic area 3. Industry
 4. Unionship status 5. Ability 
 Cov(e,x)≠0 i.e. x may be endogenous because - True model: y=δ₁+δ₂x+δ₃z+u - We estimate: y=β₁+β₂x +e (δ₃z+u is dropped into the error term e) → e = u+δ₃z → cov(e, x) = 0 = δ₃cov(x, z) [cov(u, x)=0, x exogenous in the true model] However, if the omitted variable is correlated with the included variable, then it is endogenous. • Ability is correlated both with salary and with education: Cov(ABILITYᵢ, EDUCᵢ) > 0 → Cov(eᵢ, EDUCᵢ) > 0 • b₂ overestimates β₂, and the bias will not vanish asymptotically • Using mroz.csv, estimate with OLS
 log WAGEi = β₁ +β₂ EDUCi +β₃ EXPER +β4 EXPER²i +ei considering N = 428 married women in the labor force • The estimated yearly return to education is 10.75% • This is likely too large an effect • Economic policy decision making needs accurate estimates of this parameter! Measurement error Another case in which we know for sure that there will be endogeneity (and that OLS will not be consistent), is when there are measurement errors. • When xᵢ is measured with error, we have a measurement error problem • In this case, xᵢ and eᵢ are correlated • As an example, consider the “permanent income hypothesis”: savings yᵢ depend on “permanent income” xᵢ*: yᵢ =β₁ +β₂xᵢ*+eᵢ • As xᵢ* is unobservable, we decide to use the current income xᵢ: xᵢ =xᵢ*+uᵢ, E(uᵢ)=0, Var(uᵢ) = σ²u, Cov(uᵢ,eᵢ) = 0 • Replacing the unobservable xᵢ* in the model we get: yᵢ = β₁+β₂(xᵢ−uᵢ)+eᵢ
 = β₁ +β₂xᵢ +(eᵢ −β₂uᵢ) = β₁+β₂xᵢ+vᵢ • In this model (which we can estimate):
 Cov(xᵢ,vᵢ) = E(xᵢvᵢ) = E[(xᵢ* +uᵢ)(e −β₂ uᵢ)] = E(−β₂uᵢ²)=−β₂σ²u≠0 • With measurement error in the regressors, OLS estimators are not consistent (there is an attenuation bias) • You can check that OLS consistency still holds with measurement error in the dependent variable yᵢ Simultaneity/Reverse causality The third case in which we know for sure that there will be endogeneity (and that OLS will not be consistent), is when there is simultaneity and reverse causality. • Example: in the market, prices and quantities are determined simultaneously by an equilibrium condition Qtd = β₁+β₂Pt+etd (Demand) Qts = γ₁+γ₂Pt+ets (Supply) • In equilibrium, Qtd = Qts , which gives
 Pt = 1 (γ₁ − β₁ + ets − etd) β₂ - γ₂ implying Cov(Pt, etd) ≠ 0 and Cov(Pt, ets) ≠ 0 • Pt and Qt are both endogenous: each one causes the other • OLS estimation of demand or supply equations is inconsistent To recap: - one cases in which we know for sure that there will be endogeneity (and that OLS will not be consistent), is when we have omitted variables that are correlated with included explanatory variables. - another case in which we know for sure that there will be endogeneity (and that OLS will not be consistent), is when we have measurement errors. - another case in which we know for sure that there will be endogeneity (and that OLS will not be consistent), is when we have simultaneity/reverse causality.
Docsity logo


Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved