Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Clustering-Data Warehouse-Lecture Slides, Slides of Data Warehousing

Baddi University of Emerging Sciences and Technologies Data Warehousing

Topics include in this course are Data Warehousing Concepts, Design and Development, Extraction, Transformation and Loading, OLAP Technology, Data Mining Techniques: Classification, Clustering and Decision Tree, Advanced Topics. This lecture includes: Clustering, Grouping, Records, Observations, Tasks, Classes, Records, Homogeneity, Unsupervised, Algorithm, Segment

Typology: Slides

2011/2012

Uploaded on 08/08/2012

sharib_sweet 🇮🇳

4.2

(50)

113 documents

1 / 18

Partial preview of the text

Download Clustering-Data Warehouse-Lecture Slides and more Slides Data Warehousing in PDF only on Docsity! 2 Clustering Task – Clustering refers to grouping records, observations, or tasks into classes of similar objects – Cluster is collection records similar to one another – Records in one cluster dissimilar to records in other clusters – Clustering is unsupervised data mining task – Therefore, no target variable specified – Clustering algorithms segment records and maximize homogeneity in subgroups – Similarity to records outside cluster minimized docsity.com 3 Clustering Task (cont’d) – For example, Claritas, Inc. provides demographic profiles of geographic areas, according to zip code – PRIZM segmentation system clusters zip codes in terms of lifestyle types – Recall clusters identified for 90210 Beverly Hills, CA – Cluster 01: Blue Blood Estates “Established executives, professionals, and ‘old money’ heirs that live in America’s wealthiest suburbs...” – Cluster 10: Bohemian Mix – Cluster 02: Winner’s Circle – Cluster 07: Money and Brains – Cluster 08: Young Literati docsity.com 6 Clustering Task (cont’d) • Measuring Similarity – Euclidean Distance measures distance between records – Other distance measurements include City-Block Distance and Minkowski Distance records twoof valuesattribute represent ,...,, and ,...,, where,)(),( 2121 2 Euclidean myyyxxx yxd mm i ii    yx yx q i ii i ii yxd yxd     ),( ),( Minkowski Block-City yx yx docsity.com 7 Clustering Task (cont’d) – “Different From” function measures similarity between categorical attributes – Substitute different(x,y) for each categorical attribute in Euclidean Distance function – Normalizing data enhances performance of clustering algorithms – Use Min-max Normalization or Z-Score Standardization )(deviation standard )mean(ationStandardiz Score-Z )min()max( )min(ionNormalizatMax -Min X XX XX XX      otherwise if 1 0 ),(different iiii yx yx      docsity.com 8 Clustering Task (cont’d) – Clustering identifies groups of highly-similar records – Algorithms construct clusters where between-cluster variation (BCV) large, as compared to within-cluster variation (WCV) – Analogous to concept behind analysis of variance Between-cluster variation: Within-cluster variation: docsity.com 11 k-Means Clustering (cont’d) – k-Means algorithm terminates when centroids no longer change – For k clusters, C1, C2, ...., Ck, all records “owned” by cluster remain in cluster – Convergence criterion may also cause termination – For example, no significant reduction in SSE icluster of centroid represents icluster in point dataeach where,),( 1 2       i i k i iCp i m Cp mpdSSE docsity.com 12 Example of k-Means Clustering at Work – Assume k = 2 to cluster following data points – Step 1: k = 2 specifies number of clusters to partition – Step 2: Randomly assign k = 2 cluster centers For example, m1 = (1, 1) and m2 = (2, 1) • First Iteration – Step 3: For each record, find nearest cluster center Euclidean distance from points to m1 and m2 shown a b c d e f g h (1, 3) (3, 3) (4, 3) (5, 3) (1, 2) (4, 2) (1, 1) (1, 2) Point a b c d e f g h Distance from m1 2.00 2.83 3.61 4.47 1.00 3.16 0.00 1.00 Distance from m2 2.24 2.24 2.83 3.61 1.41 2.24 1.00 0.00 Cluster Membership C1 C2 C2 C2 C1 C2 C1 C2 docsity.com 13 Example of k-Means Clustering at Work (cont’d) – Cluster m1 contains {a, e, g} and m2 has {b, c, d, f, h} – Cluster membership assigned, now SSE calculated – Recall clusters constructed where between-cluster variation (BCV) large, as compared to within-cluster variation (WCV) – Ratio BCV/WCV expected to increase for successive iterations 360024.2161.383.224.22 ),( 22222222 1 2      k i iCp impdSSE for WCV surrogate SSE BCVfor surrogate ),( where,0278.0 36 1 SSE ),( WCV BCV 21 21    mmd mmd docsity.com 16 Example of k-Means Clustering at Work (cont’d) – Cluster centroids updated to m1 = (1.25, 1.75) or m2 = (4, 2.75) – After Second Iteration, cluster centroids shown to move slightly 0 1 2 3 4 5 6 0 1 2 5 4 3 docsity.com 17 Example of k-Means Clustering at Work (cont’d) • Third (Final) Iteration – Repeat procedure for Steps 3 – 4 – Now, for each record find nearest cluster center m1 = (1.25, 1.75) or m2 = (4, 2.75) – SSE = 6.23, and BCV/WCV = 0.4703 – Again, BCV/WCV has increased compared to previous = 0.3346 – This time, no records shift cluster membership – Centroids remain unchanged, therefore algorithm terminates docsity.com 18 Example of k-Means Clustering at Work (cont’d) • Summary – k-Means not guaranteed to find to find global minimum SSE – Instead, local minimum found – Invoking algorithm using variety of initial cluster centers improves probability of achieving global minimum – One approach places first cluster at random point, with remaining clusters placed far from previous centers (Moore) – What is appropriate value for k? – Potential problem for applying k-Means – Analyst may have a priori knowledge of k docsity.com

Documents

questions

Clustering-Data Warehouse-Lecture Slides, Slides of Data Warehousing

Related documents

Partial preview of the text