Prepara i tuoi esami
Ottieni punti
Guide e consigli

Vendi su Docsity

Accedi

Registrati

Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity

Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium

Guide e consigli

Vendi su Docsity

Accedi

Registrati

Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity

Cerca documenti

Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity

Cerca documenti Store

I migliori documenti in vendita da studenti che hanno completato gli studi

Video Corsi

Preparati con lezioni e prove svolte basate sui programmi universitari!

Quiz

Rispondi a reali domande d’esame e scopri la tua preparazione

Cerca tra tutte le risorse di studio

Docsity AINEW

Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali

Maturità 2024

Studia con prove svolte, tesine e consigli utili

Esplora domande

Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te

Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium

Condividi documenti

20 Punti

Per ogni documento caricato

Rispondi alle domande

5 Punti

per ogni risposta data (max 1 al giorno)

Tutti i modi per ottenere punti gratis

Ottieni punti subito

Scegli un piano Premium con tutti i punti di cui hai bisogno

Opportunità di studio

Cerca offerte formativeNEW

Entra in contatto con le migliori università del mondo e scegli il tuo percorso di studi

Community

Chiedi alla community

Chiedi aiuto alla community e sciogli i tuoi dubbi legati allo studio

Classifica università

Scopri le migliori università del tuo paese secondo gli utenti Docsity

Guide Gratuite

I nostri eBook salva studente

Scarica gratuitamente le nostre guide sulle tecniche di studio, metodi per gestire l'ansia, dritte per la tesi realizzati da tutor Docsity

Dal blog

Lavoro & Stage

Master e dottorati

Vai al blog

Classification and Prediction-Data Warehousing and Data Mining-Book Summary Part 04-Computer Science, Appunti di Basi di Dati

Libera Università di Bolzano (UNIBZ)Basi di Dati

Prof. Johann Gamper

Classification

Tipologia: Appunti

2011/2012

Caricato il 12/03/2012

millyg 🇮🇹

4.6

(7)

34 documenti

1 / 37

Documenti correlati

Banca dati data mining e big data

(2)

Paniere Data Mining e Big Data

(1)

Data mining e big data

(1)

Computer science (network e data management)

(5)

prova business data science arezzo

Data Mining e Big Data

(1)

Gain Ratio-Data Warehousing and Data Mining-Book Summary Part 05-Computer Science

OLAP Operations-Data Warehousing and Data Mining-Lecture 04 Slides-Computer Science

Decision Making-Data Warehousing and Data Mining-Lecture 01 Slides-Computer Science

(1)

Multidimensional Modeling-Data Warehousing and Data Mining-Lecture 02 Slides-Computer Science

Case Studies-Data Warehousing and Data Mining-Lecture 03 Slides-Computer Science

ETL and Advanced Modeling-Data Warehousing and Data Mining-Lecture 07 Slides-Computer Science

(2)

Web Mining (statistica e data science)

Big Data Ethics - riassunto seconda parte corso data management and warehousing

PM10 a Brescia - Machine Learning Data Mining

Data Mining parte della Carota

political science book summary part 2

(1)

political science book summary part 1

Presentazione progetto PM10- Brescia Machine Learning e Data Mining

Data Warehousing e OLAP (DW e OLAP)

Distributed Database Design-Distributed Database-Lecture 03 Slides-Computer Science

Semantic Data Control-Distributed Database-Lecture 04 Slides-Computer Science

Appunti Fondamenti di Computer Science e Gestione dei Big Data

Data Science (seconda parte corso di Data Science and Management)

Appunti di Data Mining (pt. 2)

Introduzione al Data Mining

fondamenti di computer science e gestione dei big data - Messina e Signorello

Temporal Data Models-Temporal and Spatial Databases-Lecture 03 Slides-Computer Science

Introduction to Spatial Databases-Temporal and Spatial Databases-Lecture 10 Slides-Computer Science

Data mining - DATA MINIG

(1)

Anteprima parziale del testo

Scarica Classification and Prediction-Data Warehousing and Data Mining-Book Summary Part 04-Computer Science e più Appunti in PDF di Basi di Dati solo su Docsity! Chapter 2: Classification & Prediction 2.1 Basic Concepts of Classification and Prediction 2.1.1 Definition 2.1.2 Classification vs. Prediction 2.1.3 Classification Steps 2.1.4 Issues of Classification and Prediction 2 2 Decision Tree Induction. 2.2.1 The Algorithm 2.2.2 Attribute Selection Measures 2 2 3 T P i. . ree run ng 2.2.4 Scalability and Decision Tree Induction 2.3 Bayes Classification Methods 2.4 Rule Based Classification 2.5 Lazy Learners 2 6 Prediction. 2.7 How to Evaluate and Improve Classification 2.1.1 Definition Classification is also called Supervised Learning Supervision Th t i i d t ( b ti t t ) d t e ra n ng a a o serva ons, measuremen s, e c are use o learn a classifier The training data are labeled data Age Income Class label 27 28K Budget-Spenders Training data New data (unlabeled) are classified Using the training data 35 36K Big-Spenders 65 45K Budget-Spenders Class label [Budget Spender] Principle Classifier Numeric value Unlabeled data Age Income 29 25K [Budget Spender (0.8)] Construct models (functions) based on some training examples Describe and distinguish classes or concepts for future prediction P di t k l l b lre c some un nown c ass a e s 2.1.3 Classification Steps (2/2) Step2: Model Usage Before using the model, we first need to test its accuracy Measuring model accuracy To measure the accuracy of a model we need test data Test data is similar in its structure to training data (labeled data) How to test? The known label of test sample is compared with the classified result from the model A I Cl l b l Test data Age Income Tuple ge ncome ass a e 25 30K Budget-Spenders 40 50k Big-Spender Classifier Budget-Spender 25 30K Accuracy rate is the percentage of test set samples that are correctly classified by the model Important: test data should be independent of training set, otherwise over-fitting will occur Using the model: If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known Model Construction If age <30 & income <30k Then, Budget-Spender Classification Classifier If age <30 & income > 30k Then, Big-Spender If 30<age<60 and income >30k Then Big spender Training data Algorithm - If 30<age<60 and income <30k Then Budget-spender Age Income Class label 27 28K Budget-Spender 35 36K Big-Spender If age >60, Then Budget-spender 38 28K Budget-Spender 65 45K Budget-Spender 20 18k Budget-Spender 75 40k Budget-Spender 28 50k Big-Spender 40 60k Big-Spender 60 65k Big-Spender Model Usage 1 Test the classifier Age Income Class label Test Data - 27 28K Budget-Spenders 25 36K Big-Spenders 70 45K Budget-Spenders 40 35k Big-Spender Test Classifier Accuracy 2-If acceptable accuracy Age Income 18 28K Unlabeled data Age Income Class label 18 28K Budget-Spenders Classified data Classifier37 40K 60 45K 40 36k 37 40K Big-Spenders 60 45K Budget-Spenders 40 36k Budget-Spenders Summary of section 2.1.1 Classification predicts class labels Numeric prediction models continued-valued functions T t f l ifi ti 1) T i iwo s eps o c ass ca on: ra n ng 2) Testing and using Data cleaning and Evaluation are the main issues of classification and prediction Chapter 2: Classification & Prediction 2.1 Basic Concepts of Classification and Prediction 2.1.1 Definition 2.1.2 Classification vs. Prediction 2.1.3 Classification Steps 2.1.4 Issues of Classification and Prediction 2 2 Decision Tree Induction. 2.2.1 The Algorithm 2.2.2 Attribute Selection Measures 2 2 3 T P i. . ree run ng 2.2.4 Scalability and Decision Tree Induction 2.3 Bayes Classification Methods 2.4 Rule Based Classification 2.5 Lazy Learners 2 6 Prediction. 2.7 How to Evaluate and Improve Classification 2.2 Decision Tree Induction Decision tree induction is the learning of decision trees from class- labeled training tuples A decision tree is a flowchart-like tree structure Internal nodes (non leaf node) denotes a test on an attribute Branches represent outcomes of tests Leaf nodes (terminal nodes) hold class labels Root node is the topmost node age? youth i A decision tree indicating whether a customer is likely to purchase a computer student? credit –rating? yes Middle-aged sen or yesno fair Excellent no yesyes no Class-label Yes: The customer is likely to buy a computer Class-label no: The customer is unlikely to buy a computer 2.2.1 The Algorithm Principle Basic algorithm (adopted by ID3, C4.5 and CART): a greedy algorithm Tree is constructed in a top-down recursive divide-and-conquer manner Iterations At start, all the training tuples are at the root Tuples are partitioned recursively based on selected attributes T t tt ib t l t d th b i f h i ti es a r u es are se ec e on e as s o a eur s c or statistical measure (e.g., information gain) Stopping conditions All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left Example age? youth Middle-aged senior RID Class RID Class 1 yes 2 yes 3 no 4 no RID Class 5 yes 6 no 7 yes i iRID age student cred t-rat ng Class: buys_computer 1 youth yes fair yes 2 youth yes fair yes 3 youth yes fair no 4 youth no fair no 5 middle-aged no excellent yes 6 senior yes fair no 7 senior yes excellent yes Example age? youth Middle-aged senior RID Class RID Class 1 yes 2 yes 3 no 4 no 6 no 7 yes yes i iRID age student cred t-rat ng Class: buys_computer 1 youth yes fair yes 2 youth yes fair yes 3 youth yes fair no 4 youth no fair no 5 middle-aged no excellent yes 6 senior yes fair no 7 senior yes excellent yes Example age? youth Middle-aged senior RID Classstudent? 6 no 7 yes yesno yes no yes i iRID age student cred t-rat ng Class: buys_computer 1 youth yes fair yes 2 youth yes fair yes 3 youth yes fair no 4 youth no fair no 5 middle-aged no excellent yes 6 senior yes fair no 7 senior yes excellent yes Example age? youth Middle-aged senior student? credit –rating? RID Class yesno yes no yes fair Excellent RID Class 7 yes 6 no i iRID age student cred t-rat ng Class: buys_computer 1 youth yes fair yes 2 youth yes fair yes 3 youth yes fair no 4 youth no fair no 5 middle-aged no excellent yes 6 senior yes fair no 7 senior yes excellent yes Example age? youth Middle-aged senior student? credit –rating? yesno yes no yes fair Excellent yesno i iRID age student cred t-rat ng Class: buys_computer 1 youth yes fair yes 2 youth yes fair yes 3 youth yes fair no 4 youth no fair no 5 middle-aged no excellent yes 6 senior yes fair yes 7 senior yes excellent no Before Describing Information Gain Entropy & Bits You are watching a set of independent random sample of X X has 4 possible values: P(X=A)=1/4, P(X=B)=1/4, P(X=C)=1/4, P(X=D)=1/4 You get a string of symbols ACBABBCDADDC… To transmit the data over binary link you can encode each symbol with bits (A=00, B=01, C=10, D=11) Y d 2 bit b lou nee s per sym o http://www.cs.cmu.edu/~guestrin/Class/10701-S06/Handouts/recitations/recitation-decision_trees-adaboost-02-09-2006.ppt (Next 4 slides are from this link) Before Describing Information Gain Fewer Bits – example 1 Now someone tells you the probabilities are not equal P(X=A)=1/2, P(X=B)=1/4, P(X=C)=1/8, P(X=D)=1/8 Now, it is possible to find coding that uses only 1.75 bits on the average. How? E.g., Huffman coding Before Describing Information Gain Fewer Bits – example 2 Suppose there are three equally likely values P(X=A)=1/3, P(X=B)=1/3, P(X=C)=1/3 Naïve coding: A = 00, B = 01, C=10 Uses 2 bits per symbol Can you find coding that uses 1.6 bits per symbol? In theory it can be done with 1.58496 bits 1st approach: Information Gain Approach D: the current partition N: represent the tuples of partition D Select the attribute with the highest information gain (based on the work by Shannon on information theory) This attribute minimizes the information needed to classify the tuples in the resulting partitions reflects the least randomness or “impurity” in these partitions Information gain approach minimizes the expected number of tests needed to classify a given tuple and guarantees a i l ts mp e ree Information Gain Approach Step1: compute Expected information (entropy) of D -Info(D)- The expected information needed to classify a tuple in D is given by: m m: the number of classes )(log)( 2 1 i i i ppDInfo ∑ = −= pi: the probability that an arbitrary tuple in D belongs to class Ci estimated by: |Ci,D|/|D|(proportion of tuples of each class) A log function to the base 2 is used because the information is encoded in bits Info(D) Th t f i f ti d d t id tif th l e average amoun o n orma on nee e o en y e c ass label of a tuple in D It is also known as entropy Info(D): Example RID i t d t dit ti l b tage ncome s u en cre -ra ng c ass: uy_compu er 1 youth high no fair no 2 youth high no excellent no 3 middle-aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle-aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle-aged medium no excellent yes m=2 (the number of classes) 9 tuples in class yes 13 middle-aged high yes fair yes 14 senior medium no excellent no N= 14 (number of tuples) 5 tuples in class no The entropy (Info(D)) of the current partition D is: bits 940.0) 14 5(log 14 5) 14 9(log 14 9)( 22 =−−=DInfo Information Gain Approach Step1: compute Expected information (entropy)of the current partition Info(D) Step2: compute InfoA(D), the amount of information would we still need to arrive at an exact classification after partitioning using attribute A Step3: compute information gain: Information gain by branching on A is I f ti i i th t d d ti i th i f ti (D)InfoInfo(D)Gain(A) A−= n orma on ga n s e expec e re uc on n e n orma on requirements caused by knowing the value of A The attribute A with the highest information gain ,(Gain(A)), is h th litti tt ib t t d Nc osen as e sp ng a r u e a no e Infoage(D): Example RID i t d t dit ti l b tage ncome s u en cre -ra ng c ass: uy_compu er 1 youth high no fair no 2 youth high no excellent no 3 middle-aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no 7 middle-aged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middle-aged medium no excellent yes 13 middle-aged high yes fair yes 14 senior medium no excellent no bits 940.0) 14 5(log 14 5) 14 9(log 14 9)( )1 22 =−−=DInfo 694.0)2,3( 14 5)0,4( 14 4)3,2( 14 5)( )2 =++= IIIDInfo age 2460)()()()3 == DInfoDInfoageGain . − age Similarly, Gain(Income)=0.029, Gain(student)=0.151, Gain(credit_rating)=0.48 Attribute age has the highest gain ⇒ It is chosen as the splitting attribute Note on Continuous Valued Attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point (ai+ai+1)/2 is the midpoint between the values of ai and ai+1 The point with the minimum expected information requirement for A is selected as the split-point for A Split D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point