Gain Ratio, Gini Index, Binary Split, Discrete-Valued Attributes, Information Gain, Gain Ratio, Gini Index, Tree Pruning, Overfitting, Prepruning, Tree Pruning Approaches, Bayes Classification Methods, Bayesian Classifiers, Naïve Bayesian Classification, Johann Gamper, Mouna Kacimi

Tipologia: Appunti

2011/2012

Caricato il 12/03/2012

millyg 🇮🇹

4.6

(7)

34 documenti

1 / 33

Documenti correlati

Banca dati data mining e big data

(2)

Paniere Data Mining e Big Data

(1)

Data mining e big data

(1)

Computer science (network e data management)

(5)

prova business data science arezzo

Data Mining e Big Data

(1)

Classification and Prediction-Data Warehousing and Data Mining-Book Summary Part 04-Computer Science

(1)

OLAP Operations-Data Warehousing and Data Mining-Lecture 04 Slides-Computer Science

Decision Making-Data Warehousing and Data Mining-Lecture 01 Slides-Computer Science

(1)

Multidimensional Modeling-Data Warehousing and Data Mining-Lecture 02 Slides-Computer Science

Case Studies-Data Warehousing and Data Mining-Lecture 03 Slides-Computer Science

ETL and Advanced Modeling-Data Warehousing and Data Mining-Lecture 07 Slides-Computer Science

(2)

Web Mining (statistica e data science)

Big Data Ethics - riassunto seconda parte corso data management and warehousing

Data Mining parte della Carota

political science book summary part 2

(1)

political science book summary part 1

Data Warehousing e OLAP (DW e OLAP)

Distributed Database Design-Distributed Database-Lecture 03 Slides-Computer Science

Semantic Data Control-Distributed Database-Lecture 04 Slides-Computer Science

Appunti Fondamenti di Computer Science e Gestione dei Big Data

Data Science (seconda parte corso di Data Science and Management)

Appunti di Data Mining (pt. 2)

Introduzione al Data Mining

fondamenti di computer science e gestione dei big data - Messina e Signorello

Temporal Data Models-Temporal and Spatial Databases-Lecture 03 Slides-Computer Science

Introduction to Spatial Databases-Temporal and Spatial Databases-Lecture 10 Slides-Computer Science

Data mining - DATA MINIG

(1)

Data mining e big data

METODI ESPLORATIVI PER L'ANALISI DEI DATA MINING PARTE 1

Anteprima parziale del testo

Scarica Gain Ratio-Data Warehousing and Data Mining-Book Summary Part 05-Computer Science e più Appunti in PDF di Basi di Dati solo su Docsity! 2nd approach: Gain Ratio Problem of Information gain approach Biased towards tests with many outcomes (attributes having a large number of values) E.g: attribute acting as unique identifier Produce a large number of partitions (1 tuple per partition) Each resulting partition D is pure Info(D)=0 The information gain is maximized Extension to Information Gain C4 5 a successor of ID3 uses an extension to information gain . , known as gain ratio Overcomes the bias of Information gain Applies a kind of normalization to information gain using a split information value 2nd approach: Gain Ratio The split information value represents the potential information generated by splitting the training data set D into v partitions, corresponding to v outcomes on attribute A ) || (log || || )( 2 1 D D D D DSplitInfo j v j j A ×−= ∑ = High splitInfo: partitions have more or less the same size (uniform) Low split Info: few partitions hold most of the tuples (peaks) The gain ratio is defined as )()( AGainAG i R i The attribute with the maximum gain ratio is selected as the )(ASplitInfo a n at o = splitting attribute Binary Split: Continuous-Valued Attributes D: a data patition Consider attribute A with continuous values To determine the best binary split on A What to examine? Examine each possible split point The midpoint between each pair of (sorted) adjacent values is taken as a possible split-point How to examine? For each split point compute the weighted sum of the impurity of - , each of the two resulting partitions (D1: A<=split-point, D2: A> split- point) DD The split point that gives the minimum Gini index for attribute A is )()()( 2211 DGiniD DGini D DGini A += - selected as its splitting subset Binary Split: Discrete-Valued Attributes D: a data patition Consider attribute A with v outcomes {a1…,av} To determine the best binary split on A What to examine? Examine the partitions resulting from all possible subsets of {a1…,av} Each subset S is a binary test of attribute A of the form “A∈S ?”A A 2v possible subsets. We exclude the power set and the empty set, then we have 2v-2 subsets How to examine? For each subset, compute the weighted sum of the impurity of each of the two resulting partitions Th b t th t i th i i Gi i i d f tt ib t A i )()()( 2211 DGiniD DDGini D DDGini A += e su se a g ves e m n mum n n ex or a r u e s selected as its splitting subset Gini(income) RID i t d t dit ti l b tage ncome s u en cre -ra ng c ass: uy_compu er 1 youth high no fair no 2 youth high no excellent no 3 middle-aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle-aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle-aged medium no excellent yes 13 middle-aged high yes fair yes 14 senior medium no excellent no 22 ⎞⎛ Compute the Gini index of the training set D: 9 tuples in class yes and 5 in class no 459.0 14 5 14 91)( =⎟ ⎟ ⎠ ⎜ ⎜ ⎝ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛+⎟ ⎠ ⎞ ⎜ ⎝ ⎛−=DGini Using attribute income: there are three values: low, medium and high Choosing the subset {low, medium} results in two partions: •D1 (income ∈ {low, medium} ): 10 tuples •D2 (income ∈ {high} ): 4 tuples 2.2.3 Tree Pruning Problem: Overfitting Many branches of the decision tree will reflect anomalies in the training data due to noise or outliers Poor accuracy for unseen samples Solution: Pruning Remove the least reliable branches A1? A ? A2? A3? yes no yes no yes no 1 A2? Class B yes no yes no A4? Class A Class BClass A A5? Class B Class AClass B yes no yes no A4? Class A Class BClass A yes no Tree Pruning Approaches 1st approach: prepruning Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Statistical significance, information gain, Gini index are used to assess the goodness of a split Upon halting the node becomes a leaf, The leaf may hold the most frequent class among the subset tuples Problem Diffi lt t h i t th h ldcu o c oose an appropr a e res o Tree Pruning Approaches 2nd approach: postpruning Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees A subtree at a given node is pruned by replacing it by a leaf The leaf is labeled with the most frequent class Example: cost complexity pruning algorithm Cost complexity of a tree is a function of the the number of leaves and the error rate (percentage of tupes misclassified by the tree) At each node N compute The cost complexity of the subtree at N The cost complexity of the subtree at N if it were to be pruned If pruning results is smaller cost, then prune the subtree at N U t f d t diff t f th t i i d t t d id hi h se a se o a a eren rom e ra n ng a a o ec e w c is the “best pruned tree” Chapter 2: Classification & Prediction 2.1 Basic Concepts of Classification and Prediction 2.2 Decision Tree Induction 2.2.1 The Algorithm 2.2.2 Attribute Selection Measures 2.2.3 Tree Pruning 2 2 4 Scalability and Decision Tree Induction. . 2.3 Bayes Classification Methods 2.3.1 Naïve Bayesian Classification 2.3.2 Note on Bayesian Belief Networks 2.4 Rule Based Classification 2.5 Lazy Learners 2.6 Prediction 2.7 How to Evaluate and Improve Classification 2.3 Bayes Classification Methods What are Bayesian Classifiers? Statistical classifiers Predict class membership probabilities: probability of a given tuple belonging to a particular class Based on Bayes’ Theorem Characteristics? Comparable performance with decision tree and selected neural network classifiers Bayesian Classifiers Naïve Bayesian Classifiers Assume independency between the effect of a given attribute on a given class and the other values of other attributes Bayesian Belief Networks Graphical models Allow the representation of dependencies among subsets of attributes Bayes’ Theorem In the Classification Context X is a data tuple. In Bayesian term it is considered “evidence” H is some hypothesis that X belongs to a specified class C )( )()|()|( XP HPHXPXHP = P(H|X) is the posterior probability of H conditioned on X Example: predict whether a costumer will buy a computer or not Costumers are described by two attributes: age and income X is a 35 years-old costumer with an income of 40k H is the hypothesis that the costumer will buy a computer P(H|X) fl t th b bilit th t t X ill b tre ec s e pro a y a cos umer w uy a compu er given that we know the costumers’ age and income Bayes’ Theorem In the Classification Context X is a data tuple. In Bayesian term it is considered “evidence” H is some hypothesis that X belongs to a specified class C )( )()|()|( XP HPHXPXHP = P(X) is the prior probability of X Example: predict whether a costumer will buy a computer or not Costumers are described by two attributes: age and income X is a 35 years-old costumer with an income of 40k P(X) is the probability that a person from our set of costumers is 35 years-old and earns 40k Naïve Bayesian Classification D: A training set of tuples and their associated class labels Each tuple is represented by n-dimensional vector X(x1,…,xn), n measurements of n attributes A1,…,An Classes: suppose there are m classes C1,…,Cm Principle Given a tuple X the classifier will predict that X belongs to the , class having the highest posterior probability conditioned on X Predict that tuple X belongs to the class Ci if and only if Maximize P(Ci|X): find the maximum posteriori hypothesis ijmjXCPXCP ji ≠≤≤> ,1for )|()|( )( )()|()|( XP CPCXPXCP iii = P(X) is constant for all classes, thus, maximize P(X|Ci)P(Ci) Naïve Bayesian Classification To maximize P(X|Ci)P(Ci), we need to know class prior probabilities If the probabilities are not known, assume that P(C1)=P(C2)=…=P(Cm) ⇒ maximize P(X|Ci) Class prior probabilities can be estimated by P(Ci)=|Ci,D|/|D| Assume Class Conditional Independence to reduce computational cost of P(X|Ci) given X(x1,…,xn), P(X|Ci) is: n )|()|()|( )|()|( 1k iki CxPCxPCxP CxPCXP ×××= = ∏ = The probabilities P(x1|Ci), …P(xn|Ci) can be estimated from the training tuples ... 21 inii Example RID i t d t dit ti l b tage ncome s u en cre -ra ng c ass: uy_compu er 1 youth high no fair no 2 youth high no excellent no 3 middle-aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle-aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle-aged medium no excellent yes Tuple to classify is 13 middle-aged high yes fair yes 14 senior medium no excellent no X (age=youth, income=medium, student=yes, credit=fair) Maximize P(X|Ci)P(Ci), for i=1,2 Example Gi X ( th i di t d t dit f i )ven age=you , ncome=me um, s u en =yes, cre = a r Maximize P(X|Ci)P(Ci), for i=1,2 First step: Compute P(Ci). The prior probability of each class can be computed based on the training tuples: P(buys_computer=yes)=9/14=0.643 P(buys computer=no)=5/14=0.357_ Second step: compute P(X|Ci) using the following conditional prob. P(age=youth|buys_computer=yes)=0.222 P( th|b t ) 3/5 0 666age=you uys_compu er=no = = . P(income=medium|buys_computer=yes)=0.444 P(income=medium|buys_computer=no)=2/5=0.400 P(student=yes|buys_computer=yes)=6/9=0.667 P(tudent=yes|buys_computer=no)=1/5=0.200 P(credit rating=fair|buys computer=yes)=6/9=0.667_ _ P(credit_rating=fair|buys_computer=no)=2/5=0.400 Example P(X|b t ) P( th|b t )uys_compu er=yes = age=you uys_compu er=yes × P(income=medium|buys_computer=yes) × P(student=yes|buys_computer=yes) × P(credit_rating=fair|buys_computer=yes) = 0.044 P(X|buys computer=no)= P(age=youth|buys computer=no)×_ _ P(income=medium|buys_computer=no) × P(student=yes|buys_computer=no) × P( dit ti f i |b t )cre _ra ng= a r uys_compu er=no = 0.019 Third step: compute P(X|Ci)P(Ci) for each class P(X|buys_computer=yes)P(buys_computer=yes)=0.044 ×0.643=0.028 P(X|buys_computer=no)P(buys_computer=no)=0.019 ×0.357=0.007 The naïve Bayesian Classifier predicts buys_computer=yes for tuple X 2.3.2 Bayesian Belief Networks Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution Nodes: random variables Links: dependency X and Y are the parents of Z, and Y is X Y the parent of P No dependency between Z and P Has no loops or cycles Z P Example The conditional probability table (CPT) for variable LungCancer:Family History Smoker LC (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) 0.8 0.5 0.7 0.1 ~LC 0.2 0.5 0.3 0.9 CPT shows the conditional probability LungCancer Emphysema for each possible combination of its parents Derivation of the probability of a ∏ n YPPP ))(|()( particular combination of values of X, from CPT: PositiveXRay Dyspnea B i B li f N t k 31 = = i arents ixixx n 1 ,...,1ayes an e e e wor s Training Bayesian Networks S l ievera scenar os: Given both the network structure and all variables observable: learn only the CPTs Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning Network structure unknown, all variables observable: search through the model space to reconstruct network topology Unknown structure, all hidden variables: No good algorithms known for this purpose