Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli

Inferential Statistics ( INTECO), Appunti di Statistica Inferenziale

Lectures of Inferential statistics. Grade 30. INTECO 2021/2022.

Tipologia: Appunti

2021/2022

In vendita dal 22/02/2022

annaprevitali
annaprevitali 🇮🇹

4.4

(11)

14 documenti

Anteprima parziale del testo

Scarica Inferential Statistics ( INTECO) e più Appunti in PDF di Statistica Inferenziale solo su Docsity! INFERENTIAL STATISTICS – WEEK 1 LESSON 1 – 2 - 3 Statistics is the art of learning from data. It is concerned with the collection of data, their subsequent description and their analysis which often leads to drawing conclusion. 1. The part of statistics concerned with the description and summarization of data is called DESCRIPTIVE STATISTICS 2. The part of statistics concerned with the drawing of conclusion from data is called  INFERENTIAL STATISTICS EXAMPLE We toss a coin 10 times and we get the sequence T T T H T H T T T T - Descriptive approach: we can summarise the outcome of the experiment by computing the proportion of TAILS we obtained, that is 80% - Inferential approach: 80% of TAILS: can we say that the coin is not fair? or the coin is fair and 80% of TAILS can be explained by chance?  The possibility of chance must be considered for an inferential approach: we can’t just use the data found with the descriptive approach. It is usually necessary to make some assumptions about the chances of obtaining the different data values, the totality of these assumption is referred to as a probability model for the data PROBABILITY THEORY FOR STATISTICAL INFERENCE Statistical inference starts with the assumption that important aspects of the phenomenon under study can be described in terms of probabilities and then it draws conclusions by using data to make inference about these possibilities Probability theory provides a key tool for statistical inference POPULATIONS AND SAMPLE A typical inferential problem consists in drawing conclusion on a population by analysing only a subgroup of the total - Total collection of all the element that we are interested in  POPULATION - A subgroup of the population SAMPLE In order for the sample to be informative about the total population, it must be representative of that population:  The statistical units composing a sample can be described by random variables. RANDOM VARIABLES There are 2 types of random variables: A sample of k members of population is said to be a random sample if the members are chosen in such a way that all possible choices of the k members are equally likely. 1. DISCRETE RANDOM VARIABLE EXAMPLE: DISCRETE RANDOM VARIABLES Random experiment: roll of 2 dice S:sample space = {(1,1)(1,2)(1,3)…} X: #of 6= number of six out of the two dice Range of X : {0,1,2}: X can takes these values. X: R.V x:number, realization of X DISTRIBUTION OF X P(X=0)= P(X=1)= P(X=2)= X P(X=x) 0 25/36 1 10/36 2 1/36 2. CONTINUOUS RANDOM VARIABLES DEFINITION Or A random variable X is discrete if ∑ 𝑃(𝑥 = 𝑥) = 1 The distribution of the discrete random variables ( of X) is given by the list of probabilities P(X=x) for all the values in the range of X. A random variable X is continuous if P(X=x)=0 for all x€ R A random variable X is continuous if there is a density function f such that: 𝑃(𝑎 ≤ 𝑥 ≤ 𝑏) = 𝑓𝑥𝑑𝑥 Whenever a<= b Range of Y2={1,2,3,√2, √3} Now we can compute the exact distribution of Y But what is the distribution of Yn if the sample size n is larger? For example, n=20? S=sample size = 3^20 so the explicit approach is not an option. We can use the asymptotical approximation. It’s convenient to work with : Log ( Yn)= log ((X1,X2 Xn)1/n )= ∑ log (𝑥 ) Thanks to the central limit theorem Log ( Y20) ~N(0,4479;0,1052^2) Approximation via Monte Carlo simulation. Replicate nr 1 1. Simulate realizations of X1…X20 = x1…x20 2. We compute the value taken by Y20 3. We compute ( log( Y20) Now we repeat the procedure n times so , in this way, we have n values for n repliactes we approximate the distribution of Y20 EXAMPLE N 2 TWO fair dice are rolled X1: out come of die 1 X2: out come of die 2 We define a random variable W as : W = X1 X2 S= {(1,1)(1,2) (1,3)…} Range W = {1,2,3,4,5,6,8,9…} We could compute the exact distribution, proceeding as before To compute it approximately by Monte Carlo method 1st replicate : - Simulate a realization of X1,X2 - Compute the corresponding realization of W  Repeat until the nth replicate P(W= 1) ) ~ # { , .. } We must repeat it for all the values in the range of W CONVERGES IN PROBABLITY EXAMPLE Let Y = 0 and define the sequence of random variables {Xn} as Xn ~Exponential ( n). Show that {Xn} converges in probability to Y Firstly, we recall the exponential distribution We say Xn ~Exp( λ), λ > 0 if its PDF is given by 𝑓(𝑥) = 𝜆𝑒 𝑖𝑓 ≥ 0 𝑂𝑖𝑓𝑥 < 0 I.E X is a RV taking values in [0; ∞+] and E[X]= 1/λ Let X1 X2 … be an infinite sequence of random variables also denoted by {Xn} and another variable Y. we say that the sequence {XN} converges in probability to Y if, for any ε>0 𝑙𝑖𝑚 → 𝑃(|𝑋𝑛 − 𝑌| > ε) = 0 Example with continuous random variable If X is a continuous RV by definition we have a PDF f associated to X F(X0) = P(X≤x0)=∫ 𝑓𝑥 𝑑𝑥 CONVERGES IN DISTRIBUTION OBSERVATIONS - the convergence in probability studies the difference between random variables, the convergence in distributions studies the difference between distributions ( or CDF) - why is this useful? Sometimes studying the distribution of Xn is tricky while studying that one of Y is easier. If we know that the sequence {Xn} converges in distribution to Y, then we can use the distribution of Y to approximate that one of Xn (if n is large enough) - How are the two notions of convergences related o each other? The convergence in P is stronger than the convergence in D Xn y then Xn y But if Xn y then it’s not always true that Xn y EXAMPLE show that a {Xn} converges in distribution to Y, where: 𝑋𝑛 0 𝑤𝑖𝑡ℎ 𝑃 = 1/𝑛 1 𝑤𝑖𝑡ℎ 𝑃 1 − 1/𝑛 and 𝑌 0 𝑤𝑖𝑡ℎ 𝑃 = 0 1 𝑤𝑖𝑡ℎ 𝑃 = 1 We want to show that {Xn} Y, that is if lim n→∞ P(Xn ≤ x) = P(Y ≤ x) for any x such that P(Y=x)=0 (that is, in this case, for every x different from 1) let X1,X2,… be an infinite sequence of random variables, also denoted by {Xn}, and another random variable Y. We say that the sequence {Xn} converges in distribution to Y if lim n→∞ P(Xn ≤ x) = P(Y ≤ x) for any x such that P(Y=x)=0 CDF OF Xn - If X € (-∞,0) = P(X≤x) = 0 - If X € [0,1) = P(X≤x) = P( X=0) = 1/n - If X € [1,+∞) = P(X≤x) = P( X=0)+ P(X=1) = 1 CDF OF Y - if Y € (-∞,1) = P(Y≤x) = 0 - If Y € [1,+∞) = P(Y≤x) = P( Y=1) = 1 IS THE LIMIT OF P(Xn ≤x) EQUAL TO P(Y≤x)? We need to check that this is the case for every x different from 1 - If X € (-∞,0) => lim → P(X≤x) = lim → 0= 0 = P(Y≤x) - If X € [0,1) => lim → P(X ≤ x) = lim → 1/𝑛 = 0 = P(Y≤x) - If X € [1,+∞) => lim → P(X ≤ x) = lim → 1=1 == P(Y≤x) So {Xn} Y informal statement of central limit theorem the sum of n i.i.d. random variables is approximately normal when n is large or (equivalent statement) the mean of n i.i.d. random variables is approximately normal when n is large RECALL: normal distribution A normal RV takes values on the real line, it is a continuous RV with PDF given by : ,f(x)= √ ⋅ 𝑒 ( ) - For µ = 0 and Ϭ2= 1 we have the special case of standard normal Z ~𝑁(0,1) - It’s a symmetric distribution P(X<µ)= P(X>µ) the median of the distribution is the mean of the distribution NOW we investigate via simulation the statement: “the sum of n i.i.d. random variables is approximately normal when n is large, regardless of the distribution population” (informal statement of central limit theorem - we assume that X1,X2,… are i.i.d. uniform random variables and we check how the distribution of the sum of the first n R.V. of the sequence looks like RECALL: uniform distribution If X is uniform RV on [0,1] then x takes values on [0,1] X is a continuous RV with PDF given by ,f(x) = 1; 𝑖𝑓 𝑥 𝜖 [0; 1] 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 E [X]=1/2 Var(x) = 1/12 - We generate 10000 values from a uniform distribution on (0,1) - Sum of 2,3,5,100,1000 uniforms (based on 10000 replicates) - More formally: let X1,X2,… be i.i.d. R.V. from a distribution with finite mean µ and finite variance σ2 the focus is on Sn = X1 + X2 + … + Xn and on its standardisation Zn RECALL : STANDARDIZATION OF A RV If X is any RV with mean µ and variance Ϭ2 Z= [ ] ( ) = The distribution of Z has the same “ shape” of that one of X but E[Z] = 0 and var (z) = 1 Observe that: E[Sn] = E[X1,X2..] = n*µ var(Sn) = Var[X1,X2..] = n*Ϭ2 sd(Sn) = Var[X1, X2. . ] = Ϭ√𝑛 Then Zn si defined as ( standardized version of Sn) Zn= [ ] ( ) = ∗ ∗√ Remark : Zn coincides with the standardized version of Mn Sn= n*Mn INDEED Zn= ∗ ∗√ = ∗ ∗ ∗√ = /√ - Zn ~N(0,1) APPLICATION OF CLT: NORMAL APPROXIMATION TO ASSES THE ERROR Let X1,X2,…,Xn be a random sample from a population whose distribution has unknown mean µ. A very natural choice is to use the value taken by the sample mean Mn. Say we want to estimate the unknown population mean µ, based on the observation of X1,X2,…,Xn. 𝑀 = (𝑥 + ⋯ + 𝑥 ) We call the R.V. Mn estimator of µ, the value taken mn by Mn estimate of µ. Mn is unbiased ( imparziale ) estimator of µ Mn, as an estimator of µ, has good properties: - E[Mn]= µ  CONVENIENT: the estimator Mn “ on average” takes the value µ - VAR[Mn] = 𝜎 2 𝑛 CONVENIENT: as n becomes larger, Var becomes smaller. Is the estimate we obtain accurate? We need to assess ( valutare) somehow the error Area under the normal curve : Thanks to CLT we know that 𝑀n ~𝑁(𝜇; 𝜎 2 𝑛) P µ + 1.96 ⋅ 𝜎 ∕ √𝑛 ≤ 𝑀 ≤ 𝜇 + 1.96 𝜎 √𝑛 ≈ 0.95 Which is , equivalently P(𝑀𝑛 − 1.96 ⋅ 𝜎 √𝑛⁄ ≤ 𝑢 ≤ 𝑀𝑛 + 1.96 𝜎 √𝑛⁄ ) ≈ 0.95 The interval [Mn -1.96 ⋅ 𝜎 √𝑛⁄ +Mn + 1.96] is random and it has probability 0.95 to contain the true value of µ Mn ± 1.96 ⋅ 𝜎 √𝑛⁄ “ 95% confidence interval for µ” PROBLEM:Ϭ the SD of the population distribution ( of X1, …Xn) it is likely to be unknown beside µ If we can estimate σ2 with an estimator -say Sn2-, then we have a way to assess the error in estimating µ 𝑆 = 1 𝑛 − 1 (𝑥 − 𝑀 ) It has good properties as an estimator of σ2: -E [𝑆 ]= σ2  The estimator, “ on average” takes the value σ2 - thanks to WLLN, we could show that 𝑆 —converges in P  σ2 REMARK: If we estimate σ2 with Sn2 the (already approximate) distribution of Mn is not anymore normal but is a student-t. Nonetheless, when n is large, the student-t is approximated by a normal distribution Sn is called sample standard deviation  the 0,95% confidence interval for µ becomes Mn ± 1.96 ⋅ Sn √𝑛⁄ - This comes from normal distribution - When Ϭ is estimated by Sn , we should look for the equivalent number for the student – t distribution - If n is large the number of the student-t distribution is anyway ~ 1.96 large n confidence intervals with level of confidence 1-α: Confidence interval Mn ± 𝑍𝑎/2 ⋅ Sn √𝑛⁄ Level of confidence α α/2 z α/2 90% 10% 5% 1.645 95% 5% 2. 5% 1.96 99% 1% 0. 5% 2.576 APPLICATION OF CLT: Normal approximation for the binomial distribution Recall: Bernoulli and Binomial distributions BERNOULLI X~Bernoulli (p), p€[0,1] If X can take only 2 values, say 0 and 1, (failure and success )with distribution given by P(X=1)=p and P(X=0)=1-p E[X] = 1* P(X=1)+0 *P(X=0)= p VAR[X] = p(1-p) BINOMIAL DISTRIBUTION Say X1,X2 Xn are independent Bernoulli experiments in each experiment p is the probability of success Y= X1,+X2 + Xn = # success out of n experiments The distribution of Y is binomial with parameters n, p Y~Binomial(n,p) What is the distribution of Binomial ( n,p)? P(Y= k) for a fixed of k 111…1000..0  examples of outcomes that leads to k success and n-k failure Probability of this specific sequence Ppp..p*(1-p)(1-p)… = pk( 1-p)n-k  each specific sequence with k success and n-k failures will have P pk( 1-p)n-k : the order doesn’t matter P(Y= k)= #sequences with exactly k success  ways we can choose k elements out of n = 𝑛 𝑘 −→ 𝑏𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 = ! !( )! P(Y= k)= = 𝑛 𝑘 pk( 1-p)n-k  distribution of binomial ( n,k) E[Y]= np VAR[Y]= np(1-p) APPROXIMATION OF BINOMIAL WITH NORMAL Y ∼ Binomial(n, p) Y = X1 + X2 + … + Xn = Sn with Xi i.i.d. Bernoulli(p) We can apply the CLT Y= Sn ~ N(np,np(1-p) Y= X1 +X2 + Xn = # non smokers in the sample We know Y~ Binomial (100, π) Y/n= =Mn= sample mean By CLT  Mn~ N(π, ( ) ) Confidence interval of confidence 1-a Mn±Za/2 ( ) A= 0.01 0.82±2,576 , ( , )  [0.721; 0,9191] MONTECARLO APPROXIMATION The law of large numbers tells us that the sample mean Mn converges in probabiity to the population mean µ. Say µ is unknown. LLN suggests that, conditionally on the observation of a sample X1,X2,…,Xn, we can use the average of the observations to estimate µ. Monte Carlo approach: we can “create” observations by using a computer. Say we want to study some property of a distribution (e.g. the mean µ of the population distribution): if we know how to simulate realizations from that distribution, we can generate a sample (as large as we want) and use that to estimate the property of the distribution. REMARK: number generated with a computer are the result of a sequence of deterministic operations and therefore they’re not random pseudo random ( in this course we will skip this distinction) EXAMPLE let Z~N(0,1) and Y=Z^2 +1. Compute P(2<Y<3) assume we know how to simulate realizations from a standard normal via computer say we generate n realization of Z ( with n large) for zi we can compute the corresponding value taken by y = z^2+1 Let’s define the random variables W1,W2,Wn as W1= 1 𝑖𝑓 𝑌𝑖𝜖 (2,3) 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 The RVS W1,W2 WN are iid Bernoulli(p) where P is the probability of success at each trial p= P (Yi ϵ2,3) =P (2< y<3) MOREOVER E[Wi]=p= P(2< y<3) WLLN = P We generated Z1 Zn , we compute Y1 Y2 next we compute W1 W2 and take their average 1/n (W1 W2 Wn) as an approximate value for P(2< y<3) STATISTICAL INFERENCE FIRST FRAMEWORK Let’s start by assuming we know the probability distribution which describes a random experiment To be more precise we assume we know the probability measure P which describes the random experiment probability measure is a function which associates to each event its probability, that is to each subset of the sample space S its probability Using the notion of probability allows to deal with the uncertainty about the outcome s∈S of the random experiment Based on the knowledge of P we might want to make inference on the outcome s Examples of inference we can make on the outcome s (when we know P): 1. predict or estimate s, that is provide a likely value for s 2. 2. construct a subset C of S such that there is a specified probability -say for example 95%- that the outcome s of the experiment falls in C. 3. asses whether a value s0 is a plausible outcome or not EXAMPLE 5.2.1 X=“life length of a machine” It is known that X~Exp(1) F(x) = 𝑒 𝑖𝑓 ×≥ 0 0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 1. predict the life duration of a new machine if X~Exp(λ) E[x] = 1 / λ  in our case 1/1 = 1 2. find a value c such that, with prob 95%, the life length of the new machine will fall in the interval C=(0,c) we look for c s.t P(xϵ(0,c)) = 0.95 that is P(x≤ c)= 0.95 RECALL If X ~Exp(λ)  P(X≤x) = 1 - 𝑒 (CDF) In our case  P(X≤x) = 1 - 𝑒 for x ≥ 0  P(X≤c)= 0.95 1 - 𝑒 = 0.95 Taking the log C= -log(0.05)= 2.99 3. Is s0=5 a plausible value? Given the answer at point 2 it’s not a plausible value More formally: P(x≥ 5) = 1 - P(x≤ 5) = 1-(1-𝑒 ) = 0.0067 This is the probability of observing the value s0 or a more extreme value Second framework: In statistics we often face the opposite situation: we do not know P but we observe the outcome s, that is the data Based on the data s, we want to learn about P, that is make inference on P With respect to the case where P is known, here we have an increased level of uncertainty: 1. P deals with the uncertainty associated with the outcome s 2. we have uncertainty about P itself Making inference on Pθ consists of two steps: 1. choosing the true parameter value based on the data 2. providing a measure of uncertainty associated with such choice EXAMPLE BERNOULLI MODEL Say we observe the data s=(X1 ,..Xn) where each Xi has only 2 possible outcomes, that is 0 or 1 ( fail or success) ASSUMPTIONS The Xi are independent and the same distribution of a Bernoulli , θ ϵ (0,1) Xi~ Ber(θ) MODEL ( translate of assumption for 1 obs) M= {fθ:θ ϵ [0,1]} where fθ is a distribution of a Bernoulli fθ(x) = P(X=x) = = θ 𝑖𝑓 = 1 1 − θ 𝑖𝑓𝑥 = 0 = θ ∗ (1 − θ) if Xi ~BER(θ), what is the probability of observing s = (X1, …Xn)? Since I.I.D P(X1=x1, X2=x2,..) = P(X1=x1) * P(X2=x2) …= θ ∗ (1 − θ) ∗ θ ∗ (1 − θ) … ∗ θ ∗ (1 − θ) … = θ ∗ (1 − θ) ( ) = θ ∗ (1 − θ) ( ) Statistical model for n observations : s = ( x1…xn) M= {fθ(x1,…xn) = θ ∗ (1 − θ) ( ): ϵ [0,1]} EXAMPLE 2: LOCATION SCALE NORMAL MODEL We assume that they are realizations of X1 , … Xn where Xi ~N(µ;Ϭ2) we do not know µ;Ϭ2 What is the distribution of X1 , … Xn under these assumption? fθ(µ;Ϭ^2) (x1,…xn)= ∏ 𝑓 = (𝑢, 𝜎)(𝑥 ) = 𝑛𝑜𝑟𝑚𝑎𝑙 𝑝𝑑𝑓 = √ 𝑒𝑥𝑝 − = (2𝜋) ⋅ 𝑒𝑥𝑝 − [(𝑛 − 1)𝑠 + 𝑢(?⃗? − 𝑢)]  M= { fθ(µ;Ϭ^2) (x1,…xn) :θ ϵ RxR+} OBSERVATION As long as there is a one-to-one correspondence between one parametrisation and the other, the two models will be equivalent, that is they will consist of the same collection of probability measures Pθ (or densities fθ) Example: In the normal model we could choose (µ, λ) where lambda = 1/Ϭ^2  we would get 2 equivalent model LIKELIHOOD INFERENCE Say we have 2 not fair coins: - Coin 1 : P(H)=0.9 - Coin 2 : P(H)=0.1 One of the coins is picked( we do not know which one) and tossed once, that is we observe 1 realizations s ϵ S Say s = HEAD  what do we believe is the coin was picked? M{fθ(x)= θ 1 − θ : θϵ{0.1; 0.9}} - fθ(s|θ=0.9)=0.9  if we had picked the coin the P of observing s = H would be 0.9 - fθ(s|θ=0.1)=0.1 if we had picked the coin the P of observing s = H would be 0.1 the data we observed are more likely under θ=0.9 than under θ=0.1 DEFINITION LIKELIHOOD FUNCTION Let M = { fθ : θ ϵ Ω} be a statistical model. The likelihood function is a function 𝐿(⋅ | 𝑆): 𝛺 → 𝑅 Defined as L ( θ |s) = fθ(s) OBSERVATIONS - L ( θ |s) is a function of ( θ) - The value taken by L ( θ |s) coincides with 1. The probability that the data take value s given that θ is the true parameter value fθ : discrete distribution 2. The value of the joint distribution of the observation at s, given that θ is the true value parameter in case  fθ : continuous distribution - L ( θ |s) is not the probability of θ given s EXAMPLE 6.1.1 M = { fθ : θ ϵ {1,2}} Where f1 is a uniform distribution of the first 103 positive integers that is {1,2,3,…103} F2 is a uniform distribution of the first 106 positive integers that is {1,2,3,…106} Say we have one observation s=12 under which the value of θ it is more likely ? L ( 1 |s = 12) = f1(12) = 10-3 larger: under θ = 1 L ( 2 |s = 12) = f1(12) = 10-6 A useful instrument for comparing the likelihood of the 2 values of θ is the LIKELIHOOD RATIO ( | ) ( | ) = = 10 > 1 is 1000 times more likely LIKELIHOOD RATIO AND DEFINITION OF THE LIKELIHOOD FUNCTION UP TO A CONSTANT Since we are comparing the likelihood of two values of θ we could define the likelihood by multiplying it by means of a constant and we would get the same conclusion about which value of the parameter has a larger likelihood. L(θ |S) “ likelihood “ L(θ |S) = a* L(θ |S) S=( x1,…,xn) S’=( x1’,…,xn’) Such that the sample mean takes the same value for both the likelihood function is the same for s and s’ Definition of sufficient statistics can we provide a confidence region (for example an interval) for θ by using the likelihood approach? YES for example we could choose the region composed by all the values of θ such that L(θ|s)>c, for a given threshold c C(s)= {θϵΩ:L(θ|s)>c} NOTICE: we talk about region and not about interval because we are not sure to have an interval : for example, if the likelihood function were like this, we would have a region given by two interval. Choosing the threshold c though is not obvious. EXERCISE 6.1.3 Suppose that the lifelengths (in thousands of hours) of light bulbs are distributed Exp(θ), where θ>0 is unknown. • If we observe x=52 for a sample of 20 light bulbs, record a ̄ representative likelihood function. • Why is it that we only need to observe the sample average to obtain a representative likelihood? S=(x1,…xn) = for which we observed only ?̅? We assume that these are realizations of X1,…Xn such that Xi~EXP(θ) With θ > 0 unknown. Joint PDF= fθ (x1…xn) = ∏ 𝑓 (𝑥 ) = ∏ 𝜃 𝑒 = 𝜃 ⋅ 𝑒𝑥𝑝{−𝛴𝜃𝑥 } = 𝜃 𝑒𝑥𝑝{−𝜃 𝑛} M={𝑓 : θϵ(0; ∞+)} a function T, defined on the sample space S, is called sufficient statistic for a model M if, whenever T(s1)=T(s2), then L( ⋅ ∣ s1) = cL( ⋅ ∣ s2) for some constant c>0. 𝑓 (𝑥1. . 𝑥𝑛) = 𝜃 𝑒𝑥𝑝{−𝜃 𝑛} The likelihood function of M is given by L(θ|s= x1..xn) =𝜃 𝑒𝑥𝑝{−𝜃 𝑛} The likelihood function depends on s only through X which means that if we have 2 distinct sample s and s’ such that x=x’ the likelihood conditionally on s coincides with the one conditionally on s’ That is x is a sufficient statics for model M. MAXIMUM LIKELIHOOD ESTIMATOR We have seen that we might be interested in carrying out different types of inference on the parameter θ of a statistical model, for example: • point estimate (provide our estimate for the true value of θ) • interval estimate (provide an interval likely to contain the true value of θ) • assessing the plausibility of a given value θ0 Next we see how to formally do this by using the likelihood function. POINT ESTIMATE FOR θ Given the observation of s=(x1,…,xn), if we want to estimate θ it seems reasonable to choose θ(s) such that ̂L( ̂ θ(s) ∣ s) ≥ L(θ ∣ s) for every θ in Θ(Ω). That is, for every s, we can choose the value (or -better- a value) which maximises the likelihood function. θ(s) is such that the observed data s are more likely under the distribution than under any other fθ distribution in the model. DEFINITION: MAXIMUM LIKELIHOOD ESTIMATOR (MLE) The value taken by θ at s, that is θ(S), is called maximum likelihood estimate REMARKS • the ML estimator is a random variable • the ML estimate is a number • we say “a” MLE in the definition as the maximum might not be unique Invariance to reparameterization Models might admit different (equivalent) parametrizations. the function ̂ θ : S → Θ satisfying L( ̂ θ(s) ∣ s) ≥ L(θ ∣ s for any θ in Θ is called maximum likelihood estimator. For example, let M1 = {fθ : θ ∈ Θ1} . Consider a one-to-one func on ψ : Θ1 → Θ2 and consider the model M2 = { γ : γ ∈ Θ2}, where γ=ψ(θ). If θ(s) is a MLE for model M1, then γ(s)= ψ(θ(s)) is a MLE for M2 ̂ HOW TO COMPUTE THE MLE If the likelihood function is continuously differentiable (i.e. the derivative exists for every θ∈ Ω and it is continuous) we can use calculus to find the MLE:  We look for θ(s) which maximizes L(θ|s) for every s ϵ S OBSERVATION since the logarithm is an increasing function the following holds: θ(s) maximises L(θ ∣ s) if and only if θ(s) maximises log(L(θ ∣ s)) MAXIMIZING the log is more convenient Often fθ is given by = ∏ 𝑓 (𝑥 ) Log(a*b) = log(a)+log(b) and it’s easier to compute the derivative of the sum rather than the derivative of a product. HOW TO COMPUTE THE MLE? METHOD we call score function the derivative of the log-likelihood S(θ ∣ s) = d /dθ ℓ(θ ∣ s) we look for critical points of ℓ(θ|s)= LOG(L( (θ|s)by solving the score equation S(θ ∣ s) = 0 each value we find as solution of the score equation is candidate to be a point of maximum. This is confirmed if d /dθ S(θ ∣ s) <0 2ND derivative of ℓ(θ ∣ s) if we find two or more points of maximum we choose the global maximum, that is the one at which ℓ(θ|s) takes the larger value the global maximum is a MLE for the model EXAMPLE: location normal model data: s=(x1,x2,x3), where x1=3.2, x2=3.6, x3=4 assume σ2 = 1 X1, X2,Xn ~N(µ;Ϭ2) ~N(µ;1) L((θ ∣ s)=EXP{− ∗ 𝑛(?̅? − 𝜃) } Log[L((θ ∣ s)]= {− ∗ 𝑛(?̅? − 𝜃) S(θ ∣ s)= ∗ 𝑛(?̅? − 𝜃) = 0 for (?̅? = 𝜃)so ?̅? 𝑖𝑡 𝑠𝑎 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑝𝑜𝑖𝑛𝑡 Is a maximum because the second derivative is negative STATISTICAL MODEL M={ fθ : θ > 0) where each 𝑓 (𝑥) = 𝑖𝑓𝑥 ∈ [0; 𝜃] 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝐿 𝜃|𝑠 = (𝑥 … 𝑥 ) = 𝑓(𝑋𝑖) = 1 𝜃 ⋅ 1[0; 𝜃]( ) = ⋅ ∏ 1[0; 𝜃]( ) Is it differentiable? it is a “nice” function? NO; we can’t use the standard techniques  look at the figure L(θ|S)=0 IF θ < X(n) L(θ|S)>0 IF θ > X(n) We need to look for the maximum where L(θ|S)>0 is positive, that is in [X(n);∞] In [X(n);∞] the deriva ve of L(θ|S) is given by 𝑑 𝑑𝛩 1 𝜃 ⋅ 1 = −𝑛 ⋅ 1 𝜃 < 0 Since the 1st derivative is negative in [X(n);∞], we can conclude that L(θ|S), in [X(n);∞], it’s decreasing: the max. is obtained at the left extreme of the support of our interval : 𝜃(s)=X(n) estimator and 𝜃(s)=x(n) estimate what if θ is not a scalar but a vector? things are trickier as we need to maximise a multivariate function (techniques from multivariate calculus can be used) (we will see an example about this situation in the next lesson, but we will see only a “lucky example”) ASSESING THE ACCURACY OF AN ESTIMATOR the MLE of a parameter θ is a function which maximises the likelihood ̂θ : S → Ω more in general, any function T : S → Ω can be considered as estimator of θ how accurate is as an estimator of θ? And more in general, how accurate is T as an estimator of θ? OBSERVATION: RECALL that ̂ θ and more in general T are function of the observations s=(x1,…,xn). When we think of the observations as to random variables X1,…,Xn, then also ̂θ and T are random variables. when we introduce a measure of accuracy of T as an estimator of θ, we need to take into account two things: 1. T is a random variable 2. θ is not known (we do not know the true parameter value which generated the data) DEFINITION of MEAN SQUARED ERROR the mean squared error (MSE) of T as an estimator of θ∈ℝ is given by MSEθ(T) = E[(T − θ)2 ] for every θ in Ω. is it a sensible choice as a measure of accuracy? YES, because if T is accurate, then we expect that it will take value close to θ - (T- θ) will tend to be small MSEθ(T) will tend to be small - (T- θ)2 will tend to be small and positive MSEθ(T) will tend to be small and positive by definition ISSUE: MSEθ(T) is a quantity which depends on θ That means that it might be that MSEθ(T) is small for some θ (that is T is accurate) and large for other θ (that is T is not accurate). IS such an instrument any useful? YES for example: - If T1 and T2 are 2 estimators of θ, such that MSEθ(T1)< MSEθ(T2) for all θ ϵ Ω  we can say that T1 is more accurate than T2 - If T1 and T2 are 2 estimators of θ, such that MSEθ(T1)< MSEθ(T2) for some θ ϵ Ω and MSEθ(T1)> MSEθ(T2) for other θ ϵ Ω  we can’t conclude that one is more accurate than the other Another option is to replace θ with 𝜃 : that is to consider MSEθ(T) as a measure of accuracy of T PRO: return a number CONS: if it’s not accurate 𝜃  we propagate the error EXAMPLE: LOCATION NORMAL MODEL observations: s=(x1,…,xn) assumption: observations are realisations of i.i.d. random variables X1,…,Xn with normal distribution with unknown mean θ and known variance σ02 • consider the MLE for the unknown mean θ, what is its MSE? 𝑀𝑆𝐸 [?̅?] = 𝐸[(?̅? − 𝜃) ] We already know that: 𝑉𝑎𝑟 𝜃 = 𝑉𝑎𝑟(?̅?) = 𝜎 𝑛 𝐸 𝜃 = 𝐸[?̅?] = 𝜃 So 𝑀𝑆𝐸 [?̅?] = 𝐸[(?̅? − 𝜃) ]= 𝐸 ?̅? − 𝐸[?̅?] = 𝑉𝑎𝑟(𝑥) =  It’s a Lucky case because 𝑀𝑆𝐸 [?̅?] = which doesn’t depend on the unknown parameter θ OBSERVATION: accuracy of X (as an estimator of the mean of a location normal model) depends on: 1. the population variance σ2 θ is replaced by its estimate 𝜃 We want to maximize 𝐿(𝜎 |𝑠) 1. Log-likelihood 𝑙(𝜎 |𝑠)=− 𝑙𝑜𝑔(𝜎 ) ⋅ (𝑛 )𝑠 2. Score function 𝜎 𝑠 = − (−1) ⋅ ( ) ⋅ (𝑛 − 1)𝑆 3. Score equation ( multiplying both the right hand side and the left hand side by (𝜎 ) − ⋅ 𝜎 = ( ) 𝑠 = 0  𝜎 = 𝑠  CRITICAL POINT Is the critical point a maximum? We need to check the 2nd derivative ( <0) OBSERVATION 𝜎 = 𝑛 − 1 𝑛 𝑠 = ⋅ 𝛴(𝑥 − ?̅?) = 𝛴(𝑥 − 𝑥) OBSERVATION The sample variance S2 𝛴(𝑥 − ?̅?) is an unbiased estimator of 𝜎  E[S2] =𝜎 𝜎 = 𝛴(𝑥 − 𝑥) is not unbiased  E[𝜎 ] NOT =𝜎 = 𝜎 DEFINITION OF UNBIASED ESTIMATOR we say that T is an unbiased estimator of θ if Eθ[T] = θ for every θ in Ω Unbiased estimators and standard error Being unbiased is an appealing property for an estimator of T but it’s not the only property we should check for: - The decomposition of MSE into the sum of variance and the square of the bias tell us we should also look into the variance of T  if T is unbiased it is important to look at VAR(T) (the smaller, the better) - A related quantity is the standard error of the estimator : 𝑆𝐸 (𝑇) = 𝑣𝑎𝑟(𝑇) EXAMPLE 6.3.2- BERNOULLI MODEL s=(x1,…,xn) realisation of X1,…,Xn~IIDBern(θ).  with θ unknown • find the ML estimator of θ • how accurate is the ML estimator of θ? M={fθ : θ ϵ(0;1)} where fθ is the distribution of a Bern(θ) L(θ|s=(x1…xn) = 𝜃 ̅ ⋅ (1 − 𝜃) ̇ 1. Loglikelihood 𝑛?̅? ⋅ 𝑙𝑜𝑔(𝜃) + 𝑛(1 − ?̅?) ⋅ 𝑙𝑜𝑔(1 − 𝜃) 2. Score function 𝑛?̅? ⋅ 1 𝜃 + 𝑛(1 − ?̅?) ⋅ 1 1 − 𝜃 3. Score equation 𝑛?̅? ⋅ + 𝑛(1 − ?̅?) ⋅ =0 Multiplying by θ(1-θ)  θ=?̅? 2nd derivative check θ=𝑥 is the MLE for θ in a Bernoulli model ASSESS THE ACCURACY RECALL X~Bern(θ) µ=E[?̅?] = θ 𝜎 =Var[?̅?] = θ(1-θ) E[𝜃]= µ=E[?̅?] = θ  MLE is an unbiased estimator of θ S.E ( 𝜃) = S.E [?̅?] = 𝑣𝑎𝑟(𝑥) = = ( ) EXAMPLE 6.3.3 n=1000 families are interviewed and asked if they’re willing to participate to a new recycling project. 790 families say YES, 210 say NO. S= ( x1…xn) where n = 1000 ; Xi ~Bern(θ) Θ= unknown probability that a family picked at random is in favour if the new project MLE 𝜃 = ?̅? ?̅? = 790 1000 = 0.79 = 𝜃 MSEθ[ θ]= var(𝜃)+(BIAS)2 = ( )  SEθ[ θ] Is SEθ[ θ] small or big ?--> 0,0129 In order to evaluate it we can compute the relative error RE= = 0,0016 P(-z≤Z≤z)= 𝛾 P(-𝑧 ≤Z≤ 𝑧 )= 𝛾 = P(-𝑧 ≤ ̅ ∕ ≤ 𝑧 )= P(θ-𝑧 √ ≤ 𝑋 ≤ θ 𝑧 √ )=P(?̅? − 𝑧 ⋅ √ ≤ 𝜃 ≤ ?̅? + 𝑧 ⋅ √ ) RANDOM INTERVAL ?̅?±𝑧 ⋅ √ COMMENTS in the location normal model we computed the intervals by using the distribution of a standard normal random variable Z => intervals are also called z-confidence intervals with level of confidence γ  the same principle (that is recognising the distribution of the estimator, or approximating the distribution of the estimator) are used to build confidence intervals for other models for example, the estimator for the unknown mean θ in a location-scale normal model (where the variance σ2 is unknown too) has a Student t-distribution. Based on this one can build a t-confidence interval. STUDENT T DISTRIBUTION - It has larger tails than normal so : > larger interval for the unknown mean θ > by estimating σ2 we add some error which makes the uncertainty about θ larger - When n goes to ∞ student’s approaches the standard normal N(0;1) HYPOTHESIS TESTING This notion is strictly related to the notion of confidence interval - suppose that we want to test a new medical treatment: a sensible way to proceed is to consider the effect of the treatment null unless there is strong evidence that the treatment is effective - in a court case: a person is considered innocent and is convicted only if there is strong evidence that she is guilty - an e-mail spam filter considers an e-mail legitimate and sends it the spam folder only if there is strong evidence that it is junk mail -  from a statistical point of view, we assume there is a theory according to which a specific property θ of a statistical model takes a given value θ0, that is θ=θ0. We call this null hypothesis (borrowing the name from the medical field) and denote it by H0. Typical question in statistics given the observation of s=(x1,…,xn) is there evidence against the null hypothesis? Example: a new treatment is proposed, and we want to assess whether it leads to some change in blood pressure. we consider a group of individuals who are given the treatment and measure their blood pressure: s=(x1,…,xn) - we assume the null hypothesis θ=θ0 true, that is the treatment has no effect (in this case the mean blood pressure θ of the individuals is the same as the one, known and equal to θ0, in the population) we assess whether, under this assumption, the data we observed are “surprising” - if they are surprising enough, we use it as evidence against H0 in order to decide whether the data are surprising enough we will compute a probability - called p-value- which can be interpreted as a measure of surprise SMALL P-VALUE: => data are surprising under H0 => evidence against H0 LARGE P-VALE: => data are NOT surprising under H0 => NO evidence against H0 NB: - the p-value is not the probability that the null hypothesis is true - the test is asymmetric: it tells us whether there is evidence against H0 but it does not tell you if H0 is true Example 6.3.9: location normal model we observed s=(x1,…,xn) from N(θ,σ02), with σ02 known. we consider the null hypothesis H0:θ=θ0. does s provide evidence against H0? compute the p-value H0: θ= θ0 NULL HYPOTHESIS H1: θ≠ θ0 NULL HYPOTHESIS RECALL: we work under the assumption H0 is true and we check if, under this assumption, the data we observe are surprising If H0 is true  θ= θ0  X1….Xn IID ~N(θ;σ2 0) We consider 𝑥 as a test statistic ( why x? because the hypothesis is on the unknown parameter θ, mean) ?̅?~N(θ;σ2 0)-- > ̅ ∕ ~N(0,1) We call z= ̅ ∕ TEST STATISTICS We observed s= ( x1…xn) which mean that we compute a realization of 𝑋, IE ?̅? and a realization of Z I.E z. Is the value of z surprising ? what should we compute to decide whether z is surprising or not ? P-VALUE = P(|𝑍| ≥ |𝑧| ) = probability that, under tht assumption that H0 is true, the test statistic z takes values as estreme as the observed z. P-VALUE = = P(|𝑍| ≥ |𝑧|= P(|𝑍| ≥ ̅ ∕ = 2P(𝑍 ≥ ̅ ∕ ) =2(1-P( 𝑍 ≤ ̅ ∕ ) = 2 ⋅ 1 − 𝜙(𝑧) This is the p value for the location normal model comment: as for the z-confidence interval, we used the properties of a standard normal random variable Z. That is why we call this test z-test. LESSON 11 Example 6.3.10: application of the z-test n=10 observations x1,…,x10 are simulated from N(26,4) Usual values for the cut-offs are 0.05 and 0.01 If the cut-off is α => we reject H0 if the p-value is smaller than α and we say that the test has statistical significance equal to α When performing a hypothesis test we can make two types of error: 1. type I error: we reject H0 when H0 is true 2. type II error: we don’t reject H0 when H0 is false -- > the p-value coincides with the probability of making a type I error Choosing a small cut-off α is a way to control type I errors BUT with a trade off : small cut offs leads to more type II error NB: there is an equivalence between p-value for the hypothesis H0: θ=θ0 and a confidence interval for θ. the p-value is smaller than α for H0: θ=θ0 IF AND ONLY IF θ0 does not fall within a (1-α)- confidence interval for θ  testing H0 with a test with level of significance α is equivalent with checking if θ0 falls inside a confidence interval for θ with level of confidence 1-α GRAPHICALLY: Distribution of x if Ho = θ0 is true BACK TO BERNOULLI MODEL EXAMPLE We set α = 0.05 P-value = 0,423> 0,05 = α  we didn’t reject H0 EQUILVALENTLY We build a confidence interval of level 1- α = 0.95 for θ ?̅? ± 𝑧 , 2 ⋅ ?̅?(1 − 𝑥) 𝑛 = 0,54 ± 0,097 = (0.44; 0.63) Since 0,5 belongs to the interval  we don’t reject H0 So far we have considered null hypothesis of the type: H0: θ=θ0 for which, therefore, the null hypothesis is H1: θ≠θ0  in this case we talk about a two-sided test In some situations we might want to consider a set of hypotheses of the type: H0: θ≥θ0 H1:≤ θ ( in this case we talk about a one sided test : they can be right side test or left side test, depending on the verse of the alternative hypothesis) The logic we follow is always the same Example 6.3.14: location normal model we observed s=(x1,…,xn) from N(θ,σ02), with σ0 2 known. we consider the null hypothesis H0: θ≤θ0. does s provide evidence against H0? We compute the same test statistic 𝑍 = ̅ ∕ ) ~ N(0;1) RIGHT SIDED TEST H0: θ ≤ θ0 H1: θ≥θ0 P-VALUE: P (Z ≥ z)= Z ≥ ̅ ∕ ) )= 1 - 𝜙(𝑧) - If PVALUE < α-- > WE reject Ho - If PVALUE > α-- > WE don’t reject Ho LEFT SIDED TEST H0: θ≥θ0 H1: θ ≤ θ0 P-VALUE: P (Z ≤ z)= P(Z ≤ ̅ ∕ ) )= 𝜙(𝑧) - If PVALUE < α-- > WE reject Ho - If PVALUE > α-- > WE don’t reject Ho H0: θ≥θ0 =7.6 H1: θ ≤ θ0 = 7.6 X = 7.2 N=16 .σ = 1.2 X1,….xn IID~ N(θ;σ2) P-VALUE: P (Z ≤ z)= P(Z ≤ ̅ ∕ ) )= 𝜙(𝑧 = −1.33)= 0.091: if H0 is true the probability of observing data as estreme or nore extreme than the one we observed is 9,1% P.value = 0,091> 0,05 -- > we don’t reject H0 P.value = 0,091> 0,01 -- > we don’t reject H0
Docsity logo


Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved