Download Huffman Codes and Association Rules: Learning Data Compression and Market Basket Analysis and more Slides Database Management Systems (DBMS) in PDF only on Docsity! Lecture45
Huffman Codes and Asssociation Rules
(I!)
Docsity.com
Huffman Code Example • Given: A B C D E 3 1 2 4 6 By using an increasing algorithm (changing from smallest to largest), it changes to: B C A D E 1 2 3 4 6 Docsity.com Huffman Code Example – Step 3 • Doing another append will give: 6 A 3 BC Docsity.com Huffman Code Example – Step 4 • From the initial BC A D E code we get: D E ABC 4 6 6 D E BCA 4 6 6 D ABC E 4 6 6 D BCA E 4 6 6 E BCD A E AD BC A ED BC BC ED A Docsity.com Huffman Code Example – Step 5 • Taking derivates from the previous step, we get: D E BCA 4 6 6 E DBCA 6 10 DABC E 10 6 D E ABC 4 6 6 AE D BC D BCAE A ED BC BC ED A Docsity.com Example • Items={milk, coke, pepsi, beer, juice}. • Support = 3 baskets. B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} • Frequent itemsets: {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}. Docsity.com Association Rules • Association rule R : Itemset1 => Itemset2 – Itemset1, 2 are disjoint and Itemset2 is non- empty – meaning: if transaction includes Itemset1 then it also has Itemset2 • Examples – A,B => E,C – A => B,C Docsity.com Example B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} • An association rule: {m, b} → c. – Confidence = 2/4 = 50%. + _ _ + Docsity.com » Multiple database scans are costly
>» Mining long patterns needs many passes of
scanning and generates lots of candidates
» To find frequent itemset 7,7,...i;99
> #of scans: 100 oq) (100
> # of Candidates: | 1 }*
\
|=2 —1#127x10"
he 2 # ;
> Bottleneck: candidate-generation-and-test
>» Can we avoid candidate generation?
® Docsity.com
Problems with Association Rules
88858585858 SSS
¢ for 1000 items there are 2199° jtemsets
- each k-itemset gives 2‘ rules
* huge data sets (hundreds of gigabytes)
but:
only rules with high support and confidence
might be interesting ...
Conclusion:
ors search on rules with high support & confidence
Docsity.com
From Frequent Itemsets to Association Rules • Q: Given frequent set {A,B,E}, what are possible association rules? – A => B, E – A, B => E – A, E => B – B => A, E – B, E => A – E => A, B – __ => A,B,E (empty rule), or true => A,B,E Docsity.com Association Rules Example: • Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ? A, B => E : conf=2/4 = 50% A, E => B : conf=2/2 = 100% B, E => A : conf=2/2 = 100% E => A, B : conf=2/2 = 100% Don’t qualify A =>B, E : conf=2/6 =33%< 50% B => A, E : conf=2/7 = 28% < 50% __ => A,B,E : conf: 2/9 = 22% < 50% TID List of items 1 A, B, E 2 B, D 3 B, C 4 A, B, D 5 A, C 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C Docsity.com Find Strong Association Rules • A rule has the parameters minsup and minconf: – sup(R) >= minsup and conf (R) >= minconf • Problem: – Find all association rules with given minsup and minconf • First, find all frequent itemsets Docsity.com Finding Frequent Itemsets • Start by finding one-item sets (easy) • Q: How? • A: Simply count the frequencies of all items Docsity.com Finding Association Rules • A typical question: “find all association rules with support ≥ s and confidence ≥ c.” – Note: “support” of an association rule is the support of the set of items it mentions. • Hard part: finding the high-support (frequent ) itemsets. – Checking the confidence of association rules involving those sets is relatively easy. Docsity.com Naïve Algorithm • A simple way to find frequent pairs is: – Read file once, counting in main memory the occurrences of each pair. • Expand each basket of n items into its n (n -1)/2 pairs. • Fails if #items-squared exceeds main memory. Docsity.com » Subsets of J can be enumerated
systematically
» J={a, b,c, d} =
ab ac ad be bd cd
abe abd acd bed
N\
abed
® Docsity.com
Example
Database L, C,
TID | Items Itemset |Support Itemset | Support
100 |134 {1} 2 {1 3}* 2
200 }235 {2} 3 {14} 1
300 |1235 {3} 3 {3 4} 1
400 |25 {5} 3 {2 3}* 2
{2-5} 3
C.
{3 5} 2
Itemset | Support {1 2} 1
{1 34} 1 {15} 1
{2 3 5}* 2
{135} 1
S Docsity.com
» Each DB transaction is a set of items (called an
itemset).
s A k-itemset is an itemset containing sets of
cardinality k.
s An association rule is A==B, such that A and B
are sets of items with empty set as their
intersection and including items that co-occur in
transactions.
® Docsity.com
Association Rule Mining
Process
1. Find all frequent itemsets based on minimum
support
2. Identify strong association rules from the frequent
itemsets
Rule Representations
» types of values: Boolean, quantitative (binned)
« = of dimensions/predicates
» levels of abstraction/type hierarchy
® Docsity.com
Candidates
Large Itemsets
{Beer {Bread} {Jelly},
{Milk} {PeanutButter}
{Beer} {Bread},
{Milk} | PeanutButter}
{Beer Bread} {Beer Milk},
{Beer PeanutButter}, {Bread Milk},
{Bread,PeanutButter} { Milk, PeanutButter}
{Bread,PeawutButter}
s=30%
a = 50%
Docsity.com
Mining Frequent Itemsets: the Key Step
4 — 4 4
= Find the frequent itemsets: the sets of items that have
minimum support
@ A subset of a frequent itemset must also be a frequent itemset
* .eif {AB} is a frequent itemset. both {A} and {B} should bea
frequent itemset
@ Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemset)
= Use the frequent itemsets to generate association rules.
® Docsity.com
Mining Frequent Itemsets: Basic Idea
4 — 0 0 0 8
= Naive Algorithm: count the frequency of for all possible
subsets of I in the database D
» too expensive since there are 2™ such itemsets
The Apriori principle:
Any subset of a frequent itemset must be frequent
= Method based on the apriori principle
® First count the 1-itemsets, then the ?-itemsets, then the 3-
itemsets. and so on
® When counting &+1 itemsets, only consider those k+1-itemsets
where all subsets of length k have been determined as frequent
in the previous step
Docsity.com
iy Apriori Algorithm
L, ={largel—itemsets} count ‘tem frequency
for(K =2:L, , #{}:k++}do begin
C, =apriori— gen(Z, ,); mew candidates
Vtransactions fe D do begin
C, =subset(C,.1): candidates in transaction
Veandidatesc e C, do
c.count ++. determine support
end
L, ={ce C, | c.count 2 minsup} create new set
end
eS iswer =|), Z,;
& Docsity.com
How to Generate Candidates?
Suppose the items in /, _, are listed in an order
Step 1: ta
insert into €,
lal ce aes ee a es oe cas as a ae se
Lice eee aed
where ».ffemn,—q-_ifemn,, ..., p-fem,.—q-_ item,» p-ftemn,. < q.fitem,,
ee Mea
eles | se os a |e)
coats oe lee) ase)
if (s & not in £, ,) then delete cfrom c,
Example
@L;= {(123), (124), (134), (135), (23 4)}
® Candidates after the join step: {(1 2 3 4), (1 3 4 5)}
@ In the pruning step: delete (1 3 4.5) because (3 4 5) € L,
Ce = {(1234)}
Docsity.com
Min. support: 2 transactions
Fi
Database D
100;134
200/235 —
300)/1235 ,
400/25
io
Database D
TID | Items |
*(Supposelaluserl
definediminimuml
=[A9 ABCE
(k= 1) itemset (k= 12) itemset (k = 3) itemset
| C7 |Support | Li _|
PAT 50 Yd py es LN
Lib} | 75 | UY
iC} 75 CY
2
{B.C E} |
(kK =4) itemset
<|
{Dp} 25 | ON
icp] 75 Yd
*alitemsimplies10(n‘Z 2) Eomputationalomplexity?
® Docsity.com
o Apriori Algorithm
TID BCimo eee EDT
Tei eee
a LITO 12,14
rent oe 1-Itemsets |Sup-count
SN) eee D °
STH seek = =
a ReLOLO ee " 4
sw sek 15 ]
SETH em keG
see:
. ._ #_fuples containing both A_ and _ 2B
support (A> B) = =
toral
A
gas Apriori Algorithm
1-Itemsets |Sup-count pet Bet
7 ee ee
SEE?
rl ; i) 2
B yy ae cl
24 2
oe 2
Min support = 2 .
VeeseLics Meio eee g
See Cam Tews Ta 2-Itemsets
3-Itemsets ee i
ie) 2 ee a 3
ee 4 ie} rn
2.4 2
oe 3
Solution Procedure
A rT spor
UPC Rel
Se ee eare wrens ts
{I1, 16} Beer. Milk
{I2, 13} Diaper. Babv powder
C2
ae a ets etl
| Sac acs e ete cern es Saree ee
Step 3 item ID] tay | Support |
Tl 2} Beer. Diaper
{I1, 16} Beer. Milk =
— {l2,13} Diaper. Babvpowder sss 4/9
Docsity.com
Solution Procedure
Step 4: L2 is not Null, so repeat Step2
SL ES (os
{l1, 12, 13} Beer, Diaper, Baby powder
PE (cts oe
(RPE Se Chal =s- mess elem
| AU PAS aL} MB) C-|e erly elle
Solution Procedure
Step 5
min_sup=40% min_conf=10%
Results
¢ Some rules are believable, like Baby powder >
Dit oe
¢ Some rules need additional analysis, like Milk >
Beet.
¢ Some rules are unbelievable, like Diaper > Beer.
Note this example could contain unreal results
ee bicei mere
We
The Apriori Algorithm: Example TID List of Items T100 I1, I2, I5 T100 I2, I4 T100 I2, I3 T100 I1, I2, I4 T100 I1, I3 T100 I2, I3 T100 I1, I3 T100 I1, I2 ,I3, I5 T100 I1, I2, I3 • Consider a database, D , consisting of 9 transactions. • Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ) • Let minimum confidence required is 70%. • We have to first find out the frequent itemset using Apriori algorithm. • Then, Association rules will be generated using min. support & min. confidence. Docsity.com Step 1: Generating 1-itemset Frequent Pattern Itemset Sup.Count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 Itemset Sup.Count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 • In the first iteration of the algorithm, each item is a member of the set of candidate. • The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets satisfying minimum support. Scan D for count of each candidate Compare candidate support count with minimum support count C1 L1 Docsity.com Step 3: Generating 3-itemset Frequent Pattern Itemset {I1, I2, I3} {I1, I2, I5} Itemset Sup. Count {I1, I2, I3} 2 {I1, I2, I5} 2 Itemset Sup Count {I1, I2, I3} 2 {I1, I2, I5} 2 C3 C3 L3 Scan D for count of each candidate Compare candidate support count with min support count Scan D for count of each candidate • The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori Property. • In order to find C3, we compute L2 Join L2. • C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. • Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck. Docsity.com Step 3: Generating 3-itemset Frequent Pattern [Cont.] Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four latter candidates cannot possibly be frequent. How ? For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3. Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}. BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3. Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operation for Pruning. Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3 having minimum support. Docsity.com Step 4: Generating 4-itemset Frequent Pattern • The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent. • Thus, C4 = φ , and algorithm terminates, having found all of the frequent items. This completes our Apriori Algorithm. • What’s Next ? These frequent itemsets will be used to generate strong association rules ( where strong association rules satisfy both minimum support & minimum confidence). Docsity.com Step 5: Generating Association Rules from Frequent Itemsets [Cont.] – R4: I1 I2 ^ I5 • Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33% • R4 is Rejected. – R5: I2 I1 ^ I5 • Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29% • R5 is Rejected. – R6: I5 I1 ^ I2 • Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100% • R6 is Selected. In this way, We have found three strong association rules. Docsity.com ABCDE ACDEB ABCED ACDBE ADEBC CDEAB ACEBD BCEAD ACEBD ABECD ABCED Large itemset Rules with minsup Simple algorithm: Fast algorithm: ACEBD ABCDE ACDEB ABCED Example Docsity.com
TID
>|e
m=
- | CA
m
TO04
TOO?
TO03
ToOo4
|
TO05
TO006
TOO?
2 |e |p | oo):
T1008
TOOg
BIplolalo|ajojo
Oo
m
71
TO10
Transactions
oY — “ of
(B in YO 20 70
Example: pass |
Ly
itemset
count
{At
6
{B}
iC}
{D}
{E}
7
6
2
2
Itemset { F } 1s infrequent
Docsity.com
TID
items
T001
Too?
TO03
Too4
TO05
TO06
TOOT
Example: pass 4
Generate
candidates
/
TOO8
CLE
T0093
OO) Ca yeo CO oC poo
GC
T010
TP [>>> > [=> >
L; C,
ifemset | count fhemset
(A,B. C}| 2 _| mpjaliopmlaitiasigs
(A.B. EI] 2
Transactions
. of — INO4
Simin? = 20%
mit:
C,and L, are empty
Docsity.com
wer
alls
tinal
-
-
Example
E
9
°
>
=
Q
6
9°
a
Classification Examples
Teachers classify students’ grades as A, B,
C, D, or F.
Identify mushrooms as poisonous or edible.
Predict when a river will flood.
Identify individuals with credit risks.
Speech recognition
Pattern recognition
Docsity.com