Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Huffman Codes and Association Rules: Learning Data Compression and Market Basket Analysis, Slides of Database Management Systems (DBMS)

An in-depth explanation of huffman codes and association rules. It covers the process of creating huffman codes through an example, and then moves on to discuss association rules, their generation from frequent itemsets, and the difference between classification and association rules.

Typology: Slides

2012/2013

Uploaded on 05/06/2013

anuragini
anuragini 🇮🇳

4.4

(13)

136 documents

1 / 73

Toggle sidebar

Related documents


Partial preview of the text

Download Huffman Codes and Association Rules: Learning Data Compression and Market Basket Analysis and more Slides Database Management Systems (DBMS) in PDF only on Docsity! Lecture45 Huffman Codes and Asssociation Rules (I!) Docsity.com Huffman Code Example • Given: A B C D E 3 1 2 4 6 By using an increasing algorithm (changing from smallest to largest), it changes to: B C A D E 1 2 3 4 6 Docsity.com Huffman Code Example – Step 3 • Doing another append will give: 6 A 3 BC Docsity.com Huffman Code Example – Step 4 • From the initial BC A D E code we get: D E ABC 4 6 6 D E BCA 4 6 6 D ABC E 4 6 6 D BCA E 4 6 6 E BCD A E AD BC A ED BC BC ED A Docsity.com Huffman Code Example – Step 5 • Taking derivates from the previous step, we get: D E BCA 4 6 6 E DBCA 6 10 DABC E 10 6 D E ABC 4 6 6 AE D BC D BCAE A ED BC BC ED A Docsity.com Example • Items={milk, coke, pepsi, beer, juice}. • Support = 3 baskets. B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} • Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}. Docsity.com Association Rules • Association rule R : Itemset1 => Itemset2 – Itemset1, 2 are disjoint and Itemset2 is non- empty – meaning: if transaction includes Itemset1 then it also has Itemset2 • Examples – A,B => E,C – A => B,C Docsity.com Example B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} • An association rule: {m, b} → c. – Confidence = 2/4 = 50%. + _ _ + Docsity.com » Multiple database scans are costly >» Mining long patterns needs many passes of scanning and generates lots of candidates » To find frequent itemset 7,7,...i;99 > #of scans: 100 oq) (100 > # of Candidates: | 1 }* \ |=2 —1#127x10" he 2 # ; > Bottleneck: candidate-generation-and-test >» Can we avoid candidate generation? ® Docsity.com Problems with Association Rules 88858585858 SSS ¢ for 1000 items there are 2199° jtemsets - each k-itemset gives 2‘ rules * huge data sets (hundreds of gigabytes) but: only rules with high support and confidence might be interesting ... Conclusion: ors search on rules with high support & confidence Docsity.com From Frequent Itemsets to Association Rules • Q: Given frequent set {A,B,E}, what are possible association rules? – A => B, E – A, B => E – A, E => B – B => A, E – B, E => A – E => A, B – __ => A,B,E (empty rule), or true => A,B,E Docsity.com Association Rules Example: • Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ? A, B => E : conf=2/4 = 50% A, E => B : conf=2/2 = 100% B, E => A : conf=2/2 = 100% E => A, B : conf=2/2 = 100% Don’t qualify A =>B, E : conf=2/6 =33%< 50% B => A, E : conf=2/7 = 28% < 50% __ => A,B,E : conf: 2/9 = 22% < 50% TID List of items 1 A, B, E 2 B, D 3 B, C 4 A, B, D 5 A, C 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C Docsity.com Find Strong Association Rules • A rule has the parameters minsup and minconf: – sup(R) >= minsup and conf (R) >= minconf • Problem: – Find all association rules with given minsup and minconf • First, find all frequent itemsets Docsity.com Finding Frequent Itemsets • Start by finding one-item sets (easy) • Q: How? • A: Simply count the frequencies of all items Docsity.com Finding Association Rules • A typical question: “find all association rules with support ≥ s and confidence ≥ c.” – Note: “support” of an association rule is the support of the set of items it mentions. • Hard part: finding the high-support (frequent ) itemsets. – Checking the confidence of association rules involving those sets is relatively easy. Docsity.com Naïve Algorithm • A simple way to find frequent pairs is: – Read file once, counting in main memory the occurrences of each pair. • Expand each basket of n items into its n (n -1)/2 pairs. • Fails if #items-squared exceeds main memory. Docsity.com » Subsets of J can be enumerated systematically » J={a, b,c, d} = ab ac ad be bd cd abe abd acd bed N\ abed ® Docsity.com Example Database L, C, TID | Items Itemset |Support Itemset | Support 100 |134 {1} 2 {1 3}* 2 200 }235 {2} 3 {14} 1 300 |1235 {3} 3 {3 4} 1 400 |25 {5} 3 {2 3}* 2 {2-5} 3 C. {3 5} 2 Itemset | Support {1 2} 1 {1 34} 1 {15} 1 {2 3 5}* 2 {135} 1 S Docsity.com » Each DB transaction is a set of items (called an itemset). s A k-itemset is an itemset containing sets of cardinality k. s An association rule is A==B, such that A and B are sets of items with empty set as their intersection and including items that co-occur in transactions. ® Docsity.com Association Rule Mining Process 1. Find all frequent itemsets based on minimum support 2. Identify strong association rules from the frequent itemsets Rule Representations » types of values: Boolean, quantitative (binned) « = of dimensions/predicates » levels of abstraction/type hierarchy ® Docsity.com Candidates Large Itemsets {Beer {Bread} {Jelly}, {Milk} {PeanutButter} {Beer} {Bread}, {Milk} | PeanutButter} {Beer Bread} {Beer Milk}, {Beer PeanutButter}, {Bread Milk}, {Bread,PeanutButter} { Milk, PeanutButter} {Bread,PeawutButter} s=30% a = 50% Docsity.com Mining Frequent Itemsets: the Key Step 4 — 4 4 = Find the frequent itemsets: the sets of items that have minimum support @ A subset of a frequent itemset must also be a frequent itemset * .eif {AB} is a frequent itemset. both {A} and {B} should bea frequent itemset @ Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) = Use the frequent itemsets to generate association rules. ® Docsity.com Mining Frequent Itemsets: Basic Idea 4 — 0 0 0 8 = Naive Algorithm: count the frequency of for all possible subsets of I in the database D » too expensive since there are 2™ such itemsets The Apriori principle: Any subset of a frequent itemset must be frequent = Method based on the apriori principle ® First count the 1-itemsets, then the ?-itemsets, then the 3- itemsets. and so on ® When counting &+1 itemsets, only consider those k+1-itemsets where all subsets of length k have been determined as frequent in the previous step Docsity.com iy Apriori Algorithm L, ={largel—itemsets} count ‘tem frequency for(K =2:L, , #{}:k++}do begin C, =apriori— gen(Z, ,); mew candidates Vtransactions fe D do begin C, =subset(C,.1): candidates in transaction Veandidatesc e C, do c.count ++. determine support end L, ={ce C, | c.count 2 minsup} create new set end eS iswer =|), Z,; & Docsity.com How to Generate Candidates? Suppose the items in /, _, are listed in an order Step 1: ta insert into €, lal ce aes ee a es oe cas as a ae se Lice eee aed where ».ffemn,—q-_ifemn,, ..., p-fem,.—q-_ item,» p-ftemn,. < q.fitem,, ee Mea eles | se os a |e) coats oe lee) ase) if (s & not in £, ,) then delete cfrom c, Example @L;= {(123), (124), (134), (135), (23 4)} ® Candidates after the join step: {(1 2 3 4), (1 3 4 5)} @ In the pruning step: delete (1 3 4.5) because (3 4 5) € L, Ce = {(1234)} Docsity.com Min. support: 2 transactions Fi Database D 100;134 200/235 — 300)/1235 , 400/25 io Database D TID | Items | *(Supposelaluserl definediminimuml =[A9 ABCE (k= 1) itemset (k= 12) itemset (k = 3) itemset | C7 |Support | Li _| PAT 50 Yd py es LN Lib} | 75 | UY iC} 75 CY 2 {B.C E} | (kK =4) itemset <| {Dp} 25 | ON icp] 75 Yd *alitemsimplies10(n‘Z 2) Eomputationalomplexity? ® Docsity.com o Apriori Algorithm TID BCimo eee EDT Tei eee a LITO 12,14 rent oe 1-Itemsets |Sup-count SN) eee D ° STH seek = = a ReLOLO ee " 4 sw sek 15 ] SETH em keG see: . ._ #_fuples containing both A_ and _ 2B support (A> B) = = toral A gas Apriori Algorithm 1-Itemsets |Sup-count pet Bet 7 ee ee SEE? rl ; i) 2 B yy ae cl 24 2 oe 2 Min support = 2 . VeeseLics Meio eee g See Cam Tews Ta 2-Itemsets 3-Itemsets ee i ie) 2 ee a 3 ee 4 ie} rn 2.4 2 oe 3 Solution Procedure A rT spor UPC Rel Se ee eare wrens ts {I1, 16} Beer. Milk {I2, 13} Diaper. Babv powder C2 ae a ets etl | Sac acs e ete cern es Saree ee Step 3 item ID] tay | Support | Tl 2} Beer. Diaper {I1, 16} Beer. Milk = — {l2,13} Diaper. Babvpowder sss 4/9 Docsity.com Solution Procedure Step 4: L2 is not Null, so repeat Step2 SL ES (os {l1, 12, 13} Beer, Diaper, Baby powder PE (cts oe (RPE Se Chal =s- mess elem | AU PAS aL} MB) C-|e erly elle Solution Procedure Step 5 min_sup=40% min_conf=10% Results ¢ Some rules are believable, like Baby powder > Dit oe ¢ Some rules need additional analysis, like Milk > Beet. ¢ Some rules are unbelievable, like Diaper > Beer. Note this example could contain unreal results ee bicei mere We The Apriori Algorithm: Example TID List of Items T100 I1, I2, I5 T100 I2, I4 T100 I2, I3 T100 I1, I2, I4 T100 I1, I3 T100 I2, I3 T100 I1, I3 T100 I1, I2 ,I3, I5 T100 I1, I2, I3 • Consider a database, D , consisting of 9 transactions. • Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ) • Let minimum confidence required is 70%. • We have to first find out the frequent itemset using Apriori algorithm. • Then, Association rules will be generated using min. support & min. confidence. Docsity.com Step 1: Generating 1-itemset Frequent Pattern Itemset Sup.Count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 Itemset Sup.Count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 • In the first iteration of the algorithm, each item is a member of the set of candidate. • The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets satisfying minimum support. Scan D for count of each candidate Compare candidate support count with minimum support count C1 L1 Docsity.com Step 3: Generating 3-itemset Frequent Pattern Itemset {I1, I2, I3} {I1, I2, I5} Itemset Sup. Count {I1, I2, I3} 2 {I1, I2, I5} 2 Itemset Sup Count {I1, I2, I3} 2 {I1, I2, I5} 2 C3 C3 L3 Scan D for count of each candidate Compare candidate support count with min support count Scan D for count of each candidate • The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori Property. • In order to find C3, we compute L2 Join L2. • C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. • Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck. Docsity.com Step 3: Generating 3-itemset Frequent Pattern [Cont.] Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four latter candidates cannot possibly be frequent. How ? For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3. Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}. BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3. Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning. Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3 having minimum support. Docsity.com Step 4: Generating 4-itemset Frequent Pattern • The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent. • Thus, C4 = φ , and algorithm terminates, having found all of the frequent items. This completes our Apriori Algorithm. • What’s Next ? These frequent itemsets will be used to generate strong association rules ( where strong association rules satisfy both minimum support & minimum confidence). Docsity.com Step 5: Generating Association Rules from Frequent Itemsets [Cont.] – R4: I1  I2 ^ I5 • Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33% • R4 is Rejected. – R5: I2  I1 ^ I5 • Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29% • R5 is Rejected. – R6: I5  I1 ^ I2 • Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100% • R6 is Selected. In this way, We have found three strong association rules. Docsity.com ABCDE ACDEB ABCED ACDBE ADEBC CDEAB ACEBD BCEAD ACEBD ABECD ABCED Large itemset Rules with minsup Simple algorithm: Fast algorithm: ACEBD ABCDE ACDEB ABCED Example Docsity.com TID >|e m= - | CA m TO04 TOO? TO03 ToOo4 | TO05 TO006 TOO? 2 |e |p | oo): T1008 TOOg BIplolalo|ajojo Oo m 71 TO10 Transactions oY — “ of (B in YO 20 70 Example: pass | Ly itemset count {At 6 {B} iC} {D} {E} 7 6 2 2 Itemset { F } 1s infrequent Docsity.com TID items T001 Too? TO03 Too4 TO05 TO06 TOOT Example: pass 4 Generate candidates / TOO8 CLE T0093 OO) Ca yeo CO oC poo GC T010 TP [>>> > [=> > L; C, ifemset | count fhemset (A,B. C}| 2 _| mpjaliopmlaitiasigs (A.B. EI] 2 Transactions . of — INO4 Simin? = 20% mit: C,and L, are empty Docsity.com wer alls tinal - - Example E 9 ° > = Q 6 9° a Classification Examples Teachers classify students’ grades as A, B, C, D, or F. Identify mushrooms as poisonous or edible. Predict when a river will flood. Identify individuals with credit risks. Speech recognition Pattern recognition Docsity.com
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved