Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Parameter Estimation, Hypothesis Testing - Principles of Data Mining | CMSC 828G, Study notes of Computer Science

Material Type: Notes; Professor: Getoor; Class: ADV TOPC INFO PROC; Subject: Computer Science; University: University of Maryland; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-1s0
koofers-user-1s0 🇺🇸

5

(1)

10 documents

1 / 33

Toggle sidebar

Related documents


Partial preview of the text

Download Parameter Estimation, Hypothesis Testing - Principles of Data Mining | CMSC 828G and more Study notes Computer Science in PDF only on Docsity! CMSC828G Principles of Data Mining Lecture #8 • Today’s Reading: – HMS, chapter 4, 5 • Today’s Lecture: – Parameter Estimation cont. – Hypothesis Testing – Data mining: the component view • Upcoming Due Dates: – H1 due now Statistical Inference • Statistical inference: inferring properties of an unknown distribution from data generated by that distribution. • Estimate parameters of the model from data • Use likelihood function: Model Data Probability Statistical Inference Parameter estimation e Maximum Likelihood Estimation cont. e Bayesian Estimation Likelihood Function • Let D = {x(1),…,x(n)} • independently sampled, from the same distribution p(x|θ) ‘independent and identically distributed’, iid • The likelihood function, L(θ| x(1),…,x(n)) captures the probability of the data as a function of θ )|)i(x(p )|)n(x),...,1(x(p ))n(x),...,1(x|(L)D,(L n 1i θ= θ= θ=θ ∏ = Likelihood • L(X, θ) for a parameter θ and data X is the relative probability of getting the data for the different possible parameters θ. • We must remember that it is not a probability distribution. • For a probability distribution, θ is fixed and you can get all possible values of the random variable X. • For the likelihood funtion, X=x is fixed and you consider how the probability of getting this x changes when you change θ. Bayesian Estimation • frequentist approach: parameters of a population are fixed but unknown, and data is a random sample from that population • Bayesian approach: the data is known, and the parameters are the random variables. θ has a distribution of possible values and the observed data provide evidence for different values Bayesian Estimation, cont. • Before observing the data, we have a distribution over possible values for θ which is called the prior distribution )(p θ • By analyzing the data, we can update our beliefs to take into account the observed data. This leads to a distribution over possible values for θ given D which is called the posterior distribution )D|(p θ • We use Bayes’ theorem to obtain the posterior: )D(p )(p)|D(p )D|(p θθ =θ Bayesian Estimation, cont. • The posterior gives a distribution over parameter values. If we would like a single value (a point estimate, as in the MLE case), we can use the same princple, choose the value of θ which maximizes the posterior. • This is called the maximum a posteriori (MAP) method )D|(pmaxargˆMAP θ=θ θ MLE is a special case of MAP with ‘flat’ prior Example )1()(p 11 −β−α θ−θ∝θ • Beta prior: • Likelihood for Binomial model: )1()D|(L rnr −θ−θ=θ )1( )1()1( )(p)|D(p)D|(p 1rn1r 11rnr −β+−−α+ −β−α− θ−θ= θ−θθ−θ= θθ=θ • Posterior: • Beta is like a Binomial with α - 1 prior successes and β - 1 prior failures; • can think of Beta as having equivalent sample size α + β - 2 • Posterior is Binomial Beta distribution with parameters r + α and n - r + β BETA is CONJUGATE PRIOR for BINOMIAL Bayesian Approach to Prediction • A fully Bayesian approach is characterized by maintaining a distribution over models • In order to make a prediction about a new data point x(n+1) not in our training data set D, we average over all possible values of θ, weighted by the posterior probability p(θ|D): θθθ+= θθ+=+ ∫ ∫ d)D|(p)|)1n(x(p d)D|),1n(x(p)D|)1n(x(p Of course, computationally, this is much more challenging than the maximum likelihood approach… MCMC Classical Hypothesis Testing • Two hypothesis H0 (the null hypothesis) and H1 (the alternative hypothesis) • Outcome of a hypothesis test is 'reject H0' or 'do not reject H0' • Example: – H0: there is no difference in taste between coke and pespi against H1: there is a difference. • The hypotheses are often statements about population parameters like expected value and variance: – for example H0 might be that the expected value of the height of ten year old boys in the Scottish population is not different from that of ten year old girls • A hypothesis might also be a statement about the distributional form of a characteristic of interest, – the height of ten year old boys is normally distributed within the Scottish population Z-test Example Testing for Independence • Two nominal variables X = x1, …, xr and Y = y1, …, ys • H0 = I(X,Y) • Let n(xi)/n be then proportion of observations with X = xi, similarly for n(yj)/n • If X and Y are independent then we expect n(xi yj) = n(xi) n(yj) • Number the possible combinations 1, …, t • Let Ek denote the expected number of occurrences of combination k; let Ok denoted the observed occurrences ∑ = − =Χ t 1k k kk2 E )OE( If H0 is true, X2 follows a chi-squared distribution Example • A = age of mother and B = birth weight 20017525 15013515> 20 504010≤ 20 > 2500g≤ 2500g Rejection Rule: Reject if X2 > 3.841, df = (2-1)(2-1)=1 200 175150E, 200 25150E, 200 17550E, 200 2550E 11100100 ⋅ = ⋅ = ⋅ = ⋅ = 25.131 )25.131135( 75.18 )75.1815( 75.43 )75.4340( 25.6 )25.610( X 2222 2 −+ − + − + − = =3.428 Component View • Data mining algorithm components: – task – structure or model – score function – search or optimization method – data management technique Task e visualization, classification, clustering, regression, anomaly detection Structure • functional form of the model or pattern we are fitting to the data • e.g., linear regression model, hierarchical clustering model, neural network, Bayesian network, association rule Data Management Technique • methods used for storing, indexing and retrieving data • many statistical and machine learning algorithms do not specify any data management technique, essentially assuming that the data set fits in main memory so random access to any data point is free • However for massive data sets that exceed capacity of main memory, accessing data is orders of magnitude slower, and the physical location of the data and manner in which it is accessed can be critical to algorithm efficiency 3 Algorithms linear scansunspecifiedunspecified Data Management Technique breadth-first with pruning gradient descent on parameters greedy search over structureSearch Method support/accuracy squared errorcross-validated loss functionScore Function association rulesneural networkdecision treeStructure rule pattern discoveryregression classification and regressionTask A PrioriBackpropagationCART Next Time e Reading: — HMS, chapter 5, 6 cont.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved