Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

CSE 6740 Lecture 9: Choosing the Right Loss Function for Estimation Theory - Prof. Alexand, Quizzes of Computer Science

A lecture note from georgia tech's cse 6740 course on estimation theory. It discusses the importance of choosing the right loss function for estimators, focusing on robustness and comparisons between different estimators like mle and l2e. The lecture also covers the huber estimator and influence functions.

Typology: Quizzes

Pre 2010

Uploaded on 08/05/2009

koofers-user-13a
koofers-user-13a 🇺🇸

10 documents

1 / 35

Toggle sidebar

Related documents


Partial preview of the text

Download CSE 6740 Lecture 9: Choosing the Right Loss Function for Estimation Theory - Prof. Alexand and more Quizzes Computer Science in PDF only on Docsity! CSE 6740 Lecture 9 What Loss Function Should I Use? (Estimation Theory) Alexander Gray agray@cc.gatech.edu Georgia Institute of Technology CSE 6740 Lecture 9 – p.1/35 Quiz Answers 1. For a Bayesian, parameters are drawn from probability distributions. T. 2. Even a flat prior has some effect on an estimator. T. 3. The effect of the prior on the estimator diminishes as N → ∞. T. CSE 6740 Lecture 9 – p.2/35 Robustness In the (approximate) words of [Huber, 1981]: Any statistical procedure should possess the following desirable features: It has reasonably good efficiency under the assumed model. It is robust in the sense that small deviations from the assumed model assumptions should impair the performance only slightly. Somewhat larger deviations from the model should not cause a catastrophe. CSE 6740 Lecture 9 – p.5/35 MLE vs. L2E Let’s revisit L2 estimation (L2E), which we used for KDE. If f is the true density and f̂θ is an estimate with parameters θ, the L2 error or L2 distance is L2(θ) = ∫ (f̂θ(x) − f(x))2dx (1) = ∫ f̂2θ (x)dx − 2 ∫ f̂θ(x)f(x)dx + ∫ f2(x)dx. Note that the third term can be ignored for the purpose of comparing different estimators. CSE 6740 Lecture 9 – p.6/35 MLE vs. L2E Given a dataset, we wish to find the parameters which minimize the L2 risk E [L2(θ)] = ∫ f̂2θ (x)dx − 2 N N∑ i=1 f̂θ(xi). (2) The term ∫ f̂2θ (x)dx can be thought of as a kind of built-in regularization term, which acts to penalize spikes or overly large densities (due to, say, overlapped components in a mixture), and the second term as a goodness-of-fit term. CSE 6740 Lecture 9 – p.7/35 Robustness Let X ∼ N(µ, σ2). The value which minimizes squared-error, or L2 loss, arg minθ E(X − θ)2, is the mean of X: d dθ E(X − θ)2 = 0 ⇔ θ = EX (5) The value which minimizes absolute, or L1 loss, arg minθ E|X − θ|, is the median: d dθ E|X − θ| = 0 ⇔ θ = m (6) where m is the median of X, i.e. P(X ≤ m) ≥ 1/2 and P(X > m) ≥ 1/2. (If X is continuous,∫ m −∞ f(x)dx = ∫ ∞ m f(x)dx = 1/2.) CSE 6740 Lecture 9 – p.10/35 Loss Functions loss CSE 6740 Lecture 9 — p.11/3! Robustness: Efficiency Let’s consider the performance of the sample mean X. 1. Efficiency. We know it has variance VX = σ2/N , which is the Cramer-Rao lower bound. CSE 6740 Lecture 9 – p.12/35 M-Estimators Consider any loss function of the form N∑ i=1 Φ(xi − θ). (10) An estimator which minimizes such a loss function, an M-estimator, is the solution to N∑ i=1 φ(xi − θ) = 0 (11) where φ = Φ′. An example is the MLE. CSE 6740 Lecture 9 – p.15/35 M-Estimators Recall that the MLE is asymptotically normal around the true parameter. So is any M-estimator θ̂, as well as being consistent, and we can compute its asymptotic variance. Thus we can compute its asymptotic relative efficiency (ARE) with respect to the optimal variance (which the MLE achieves): ARE(θ̂, θ∗) = ( Eθ∗φ(X − θ̂)l′(θ̂|X) )2 Eθ∗φ(X − θ∗)2Eθ∗l′(θ∗|X)2 ≤ 1 (12) where θ∗ is the true parameter. For example the ARE of the sample median with respect to the sample mean is 0.64. This is the price of robustness. CSE 6740 Lecture 9 – p.16/35 Huber Estimator Consider an M-estimator called the Huber estimator, which minimizes N∑ i=1 Φ(xi − θ) (13) where Φ(t) = { 1 2t 2 if |t| ≤ c c|t| − 12c2 if |t| ≥ c (14) where the constant c must be chosen. Φ(t) is a function which acts like t2 for |t| ≤ c and like |t| for |t| > c, and is continuous and differentiable. Indeed, its behavior is a compromise between the mean and the median. CSE 6740 Lecture 9 – p.17/35 Influence Function: Median If T () computes the median m, we have UT (x) = { 1 2f(m) if x > m −12f(m) otherwise. (19) So unlike the mean, the median has a bounded influence function. CSE 6740 Lecture 9 – p.20/35 Influence Function: Any M-Estimator For an M-estimator θ̂ that is the solution to ∑ i φ(xi − θ) where X ∼ f , the influence function of θ̂ is Ubθ(x) = φ(x − θ∗) − ∫ φ′(t − θ∗)f(t)dt = φ(x − θ∗) −Eθ∗(φ′(X − θ∗)) . (20) The expected square of the influence function gives the asymptotic variance of θ̂, i.e. √ N(θ̂ − θ∗) N(0, Eθ∗U2bθ (X)). (21) CSE 6740 Lecture 9 – p.21/35 Comparing Estimators Given a loss function, which estimator? Decision theory. CSE 6740 Lecture 9 – p.22/35 Comparing Risk We’ll look at two one-number summaries of the risk function. The maximum risk is Rmax(θ̂) = sup θ R(θ, θ̂). (23) The Bayes risk is r(f, θ̂) = ∫ R(θ, θ̂)f(θ)dθ (24) where f(θ) is a prior for θ. CSE 6740 Lecture 9 – p.25/35 Decision Rules Recall that a decision rule is another name for an estimator, and that a decision rule which minimizes the Bayes risk is called a Bayes rule or Bayes estimator, i.e. θ̂f is a Bayes rule with respect to the prior f if r(f, θ̂f ) = inf eθ r(f, θ̃). (25) An estimator which minimizes the maximum risk is called a minimax rule, i.e. θ̂ is minimax if sup θ R(θ, θ̂) = inf eθ sup R(θ, θ̃) (26) where the infimum is over all estimators θ̃. CSE 6740 Lecture 9 – p.26/35 Minimax Rules Finding minimax rules, or showing that something is minimax, is hairy. However, there is at least one easy way. Some Bayes rules are minimax. Let θ̂f be the Bayes rule for some prior f : r(f, θ̂f ) = infeθ r(f, θ̃). Suppose that R(θ, θ̂f ) ≤ r(f, θf ) ∀θ. (27) Then θ̂f is minimax and f is called a least favorable prior. A simple consequence of this is that if a Bayes rule has constant risk R(θ, θ̂f ) = c for some c, it is minimax. CSE 6740 Lecture 9 – p.27/35 Admissibility Of course, any estimator which is dominated by another at all values of θ is undesirable. We say an estimator θ̂ is inadmissible if there exists another rule θ̃ such that R(θ, θ̃) ≤ R(θ, θ̂) ∀θ and (33) R(θ, θ̃) < R(θ, θ̂) for at least one θ. (34) Otherwise, θ̂ is admissible. CSE 6740 Lecture 9 – p.30/35 Many Normal Means The many normal means problem is a prototype problem which can be shown to be equivalent to general nonparametric regression or density estimation. For this problem, many of our positive results regarding maximum likelihood no longer hold. Let Yi ∼ N(θi, σ2/N), i = 1, . . . , N . Let Y = (Y1, . . . , YN ) denote the data and θ = (θ1, . . . , θN ) denote the unknown parameters. Our model class is F = { f : ∫ (f ′′(x))2dx < ∞ } (35) for some c > 0. Note that there are as many parameters as observations. CSE 6740 Lecture 9 – p.31/35 MLE is Not Optimal Here The MLE for this problem is θ̂ = Y = (Y1, . . . , YN ). Under the loss function L(θ̂, θ) = ∑N i=1(θ̂i − θi)2, the risk of the MLE is R(θ̂, θ) = σ2. It can be shown that the minimax risk is approximately σ2/(σ2 + c2) and that there is an estimator θ̃ which achieves this risk. In other words, there exists an estimator with smaller risk than that of the MLE, i.e. the MLE is inadmissible. In practice the difference in risk can be significant. So in high-dimensional or nonparametric problems, the MLE is not an optimal estimator. There is also a robustness argument against the MLE in nonparametric settings. CSE 6740 Lecture 9 – p.32/35
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved