Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Clustering-Data Warehouse-Lecture Slides, Slides of Data Warehousing

Topics include in this course are Data Warehousing Concepts, Design and Development, Extraction, Transformation and Loading, OLAP Technology, Data Mining Techniques: Classification, Clustering and Decision Tree, Advanced Topics. This lecture includes: Clustering, Grouping, Records, Observations, Tasks, Classes, Records, Homogeneity, Unsupervised, Algorithm, Segment

Typology: Slides

2011/2012

Uploaded on 08/08/2012

sharib_sweet
sharib_sweet 🇮🇳

4.2

(50)

113 documents

1 / 18

Toggle sidebar

Related documents


Partial preview of the text

Download Clustering-Data Warehouse-Lecture Slides and more Slides Data Warehousing in PDF only on Docsity! 2 Clustering Task – Clustering refers to grouping records, observations, or tasks into classes of similar objects – Cluster is collection records similar to one another – Records in one cluster dissimilar to records in other clusters – Clustering is unsupervised data mining task – Therefore, no target variable specified – Clustering algorithms segment records and maximize homogeneity in subgroups – Similarity to records outside cluster minimized docsity.com 3 Clustering Task (cont’d) – For example, Claritas, Inc. provides demographic profiles of geographic areas, according to zip code – PRIZM segmentation system clusters zip codes in terms of lifestyle types – Recall clusters identified for 90210 Beverly Hills, CA – Cluster 01: Blue Blood Estates “Established executives, professionals, and ‘old money’ heirs that live in America’s wealthiest suburbs...” – Cluster 10: Bohemian Mix – Cluster 02: Winner’s Circle – Cluster 07: Money and Brains – Cluster 08: Young Literati docsity.com 6 Clustering Task (cont’d) • Measuring Similarity – Euclidean Distance measures distance between records – Other distance measurements include City-Block Distance and Minkowski Distance records twoof valuesattribute represent ,...,, and ,...,, where,)(),( 2121 2 Euclidean myyyxxx yxd mm i ii    yx yx q i ii i ii yxd yxd     ),( ),( Minkowski Block-City yx yx docsity.com 7 Clustering Task (cont’d) – “Different From” function measures similarity between categorical attributes – Substitute different(x,y) for each categorical attribute in Euclidean Distance function – Normalizing data enhances performance of clustering algorithms – Use Min-max Normalization or Z-Score Standardization )(deviation standard )mean(ationStandardiz Score-Z )min()max( )min(ionNormalizatMax -Min X XX XX XX      otherwise if 1 0 ),(different iiii yx yx      docsity.com 8 Clustering Task (cont’d) – Clustering identifies groups of highly-similar records – Algorithms construct clusters where between-cluster variation (BCV) large, as compared to within-cluster variation (WCV) – Analogous to concept behind analysis of variance Between-cluster variation: Within-cluster variation: docsity.com 11 k-Means Clustering (cont’d) – k-Means algorithm terminates when centroids no longer change – For k clusters, C1, C2, ...., Ck, all records “owned” by cluster remain in cluster – Convergence criterion may also cause termination – For example, no significant reduction in SSE icluster of centroid represents icluster in point dataeach where,),( 1 2       i i k i iCp i m Cp mpdSSE docsity.com 12 Example of k-Means Clustering at Work – Assume k = 2 to cluster following data points – Step 1: k = 2 specifies number of clusters to partition – Step 2: Randomly assign k = 2 cluster centers For example, m1 = (1, 1) and m2 = (2, 1) • First Iteration – Step 3: For each record, find nearest cluster center Euclidean distance from points to m1 and m2 shown a b c d e f g h (1, 3) (3, 3) (4, 3) (5, 3) (1, 2) (4, 2) (1, 1) (1, 2) Point a b c d e f g h Distance from m1 2.00 2.83 3.61 4.47 1.00 3.16 0.00 1.00 Distance from m2 2.24 2.24 2.83 3.61 1.41 2.24 1.00 0.00 Cluster Membership C1 C2 C2 C2 C1 C2 C1 C2 docsity.com 13 Example of k-Means Clustering at Work (cont’d) – Cluster m1 contains {a, e, g} and m2 has {b, c, d, f, h} – Cluster membership assigned, now SSE calculated – Recall clusters constructed where between-cluster variation (BCV) large, as compared to within-cluster variation (WCV) – Ratio BCV/WCV expected to increase for successive iterations 360024.2161.383.224.22 ),( 22222222 1 2      k i iCp impdSSE for WCV surrogate SSE BCVfor surrogate ),( where,0278.0 36 1 SSE ),( WCV BCV 21 21    mmd mmd docsity.com 16 Example of k-Means Clustering at Work (cont’d) – Cluster centroids updated to m1 = (1.25, 1.75) or m2 = (4, 2.75) – After Second Iteration, cluster centroids shown to move slightly 0 1 2 3 4 5 6 0 1 2 5 4 3 docsity.com 17 Example of k-Means Clustering at Work (cont’d) • Third (Final) Iteration – Repeat procedure for Steps 3 – 4 – Now, for each record find nearest cluster center m1 = (1.25, 1.75) or m2 = (4, 2.75) – SSE = 6.23, and BCV/WCV = 0.4703 – Again, BCV/WCV has increased compared to previous = 0.3346 – This time, no records shift cluster membership – Centroids remain unchanged, therefore algorithm terminates docsity.com 18 Example of k-Means Clustering at Work (cont’d) • Summary – k-Means not guaranteed to find to find global minimum SSE – Instead, local minimum found – Invoking algorithm using variety of initial cluster centers improves probability of achieving global minimum – One approach places first cluster at random point, with remaining clusters placed far from previous centers (Moore) – What is appropriate value for k? – Potential problem for applying k-Means – Analyst may have a priori knowledge of k docsity.com
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved