Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Mining Midterm Cheat Sheet, Cheat Sheet of Data Mining

Useful cheat sheet with formulas and main concepts for the Data Mining midterm exam

Typology: Cheat Sheet

2019/2020
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 10/23/2020

eshal
eshal 🇺🇸

4.3

(35)

17 documents

Partial preview of the text

Download Data Mining Midterm Cheat Sheet and more Cheat Sheet Data Mining in PDF only on Docsity! Data Mining (CS 145) Midterm Cheat Sheet by Patricia Xiao Model Data Type Task Type Linear Regres- sion Vector Prediction Logistic Regres- sion Vector Classification Decision Tree Vector Classification SVM Vector Classification NN Vector Classification KNN Vector Classification K-means Vector Clustering hierarchical clustering Vector Clustering DBSCAN Vector Clustering Mixture Models Vector Clustering Models • Dispersion: Quartiles& Inter Range (Q1 25%, Q3 75%, IQR = Q3−Q1, Outlier 1.5 IQR away Q1/3), 5 n Summary: min, Q1, median, Q3, max • Bias: E(f̂(x)) − f(x), Variance: V ar(f̂(x)) = E[(f̂(x)−E(f̂(x)))2], E[(f̂(x)− f(x)− ] = bias2 + variance + noise; E() = 0, V ar() = σ2; bias→underfit; variance→overfit • Model Evaluation and Selection: K-way cross va- lidation, AIC (2k − 2 ln(L̂)) & BIC (k ln(n) − 2 ln(L̂)) (k params, n objs), Stepwise feature selec- tion (forward: add, backward: from full model) • Generalized Linear Model (GLM): exponential fa- mily, p(y; η) = b(y)exp(ηTT (y)− a(η)); linear de- cision boundary • Bagging: Bootstrap Aggregating (multi-datasets → multiple classifiers → combine classifiers) • Kernel: K(xi, xj) = Φ(xi)TΦ(xj) • Chain rule: ∂J/∂x = (∂J/∂y)(∂y/∂x) • Minkowski distance (lh): d(x, y) = h √∑d i (xi − yi)h, l1 Manhattan, l2 Euclidean, l∞ supremum; triangle inequality applies (d(i, j) ≤ d(i, k) + d(k, j)). • Confusion Matrix: True / False for correctness, Positive / Negative for result • Multi-class classification: All-vs-all (AVA) is better than One-vs-All (OVA) Basic Concepts 1. σ2 = E[(X − E(X))2] = E(X2)− E2(X) 2. ‖α‖2 = αTα where α is a vector. 3. (AB)T = BTAT , (AB)−1 = B−1A−1 4. ∂(Ax)∂x = A, ∂(AX) ∂X = A T 5. ∂(x TAx) ∂x = x T (A+AT ), ∂(X TATAX) ∂X = 2A TAX 6. X ∼ N(µ, σ2)⇒ f(X = x) = 1√ 2πσ2 e− (x−µ)2 2σ2 7. σ′(x) = σ(x)(1− σ(x)) 8. log(ab) = b log a, log(ab) = log(a) + log(b) 9. Classifiers fi(x): var( ∑ i fi(x) t ) = var(fi(x))/t 10. a · b = ∑ aibi = ‖a‖‖b‖ cos (a, b) 11. normal n, any vector in plane, x, n · x = 0 12. covariance: σ(X1, X2) = E(X1 − µ1)(X2 − µ2) Formula 1. mean - mode = 3 × (mean - median) mode(peak)∼median∼mean(∼ tail) 2. Z-score (normalization): Z = x−µ δ (robust: mean absolute deviation, zjf = xif−meanf sumf ) (nominal: dummy variable(s), ordinal: (r−1)/(M−1)) where r,M start from 1. 3. Logistic / Sigmoid Function: σ(x) = 1 1+e−x 4. Entropy: H(Y ) = − ∑m i=1 pi log(pi); Conditional Entropy: H(Y |X) = ∑ x p(x)H(Y |X = x) 5. Cross Entropy Loss: H(q, p) = − ∑ k qk log(pk) 6. Lagrange multiplier α is used to solve Quadratic Programming (e.g. SVM) 7. Soft margin (allow moving at a cost): minimizing Φ(w) = 1/2wTw ⇒ Φ(w) = 1/2wTw + C ∑ ζi, li- mitation y(wTxi + b) ≥ 1 ⇒ y(wTxi + b) ≥ 1 − ζi (ζi ≥ 0); doesn’t affect the solution of SVM. 8. ROC (Receiver Operating Characteristics): TP rate (y-axis) - FP rate (x-axis), score = area below curve 9. Dendrogram: the hierarchical, cut to clusters. Tools y = xTβ where bias term xi0 = 1, x: (n × (p + 1)) matrix, y: (n × 1) vector, β: ((p + 1) × 1) vector. Continuous y = xβT . (OLS, Ordinary Least Square) J(β) = 1 2n (Xβ − y)T (Xβ − y) = 1 2n (βTXTXβ − yTXβ − βTXT y + yT y). Closed form solution: ∂J ∂β = 0, β̂ = (XTX)−1XT y Gradient Descent: β(t+1) := β(t) − η∆ Batch GD: (converge) ∆ = ∂J ∂β = ∑ i xi(x T i β − yi)/n Stochastic GD: (n times) ∆ = −(yi − xTi β(t))xi LR with Probabilistic Interpretation: (using MLE, Maximum Livelihood Estimation) L(β) = ∏ i p(yi|xi, β) =∏ i p(N(x T i β, σ 2)) = ∏ i 1√ 2πσ2 exp{− (ti−x T i β) 2 2σ2 } Invertible XTX: add λ ∑p j=1 β 2 j to ∑ i(yi − x T i β) 2 (Ridge Regression, or linear regression with l2 norm) Non-linear Correlation: create new terms e.g. x2 Linear Regression Generalized linear model (GLM). P (Y = 1|X,β) = σ(XTβ) = e XT β 1+eXT β P (Y = 0|X,β) = 1− σ(XTβ) = 1 1+eXT β Y |X,β ∼ Bernoulli(σ(XTβ)) MLE: L = ∏ i p yi i (1− pi) 1−yi , pi is P (Y = 1|X,β) Eq to max log likelihood L = ∑ i(yixiβ − log(1 + e xTi β)) Gradient ascent βnew = βold + η ∂L(β) ∂β Newton-Raphson update βnew = βold − ( ∂ 2L(β) ∂β )−1 ∂L(β) ∂β∂βT Cross Entropy Loss (p for prediction, q for ground truth, (q0, q1)|y=0 = (1, 0), (q0, q1)|y=1 = (0, 1), (p0, p1) = (P (Y = 0), P (Y = 1)): H(p, q) = −yxTβ + log(1 + ex T β) Logistic Regression A framework to approach maximum likelihood. p(xi, zi = Cj) = wjfj(xi), p(xi) = ∑ j wjfj(xi) p(D) = ∏ i p(xi) = ∏ i ∑ j wjfj(xi) log(p(D)) = ∑ i log( ∑ j wjfj(xi)) E(expectation)-step assigns objects to clusters. wt+1ij = p(zi = j|θ t j , xi) ∝ p(xi|zi = j, θtj)p(zi = j) = fj(xi)wj M(maximization)-step finds the new clustering w.r.t. conditional distribution p(zi = j|θtj , xi). θt+1 = argmax θ ∑ i ∑ j wt+1ij logL(xi, zi = j|θ) EM Algorithm 1 m for |y| in D, v for |A| Expected Information needed to classify a tuple in D: Info(D) = − ∑m i=1 pi log2(pi) Info after split A: InfoA(D) = ∑v j=1 Dj D × Info(Dj) Info Gain (ID3): Gain(A) = Info(D)− InfoA(D) Info gain biases towards multivalued attributes. SplitInfoA(D) = − ∑v j=1 Dj D × log2( Dj D ) GR (C4.5): GainRatio(A) = Gain(A)/SplitInfo(A) GR biases towards unbalanced splits. Gini(D) = 1− ∑m j=1 p 2 j for impurity GiniA(D) = ∑v j=1 |Dj | |D| Gini(Dj) Gini (CART): ∆Gini(A) = Gini(D)−GiniA(D) Gini index also biases towards multivalued attributes. STOP: same class; last attr; no sample (maj. vot.) Avoid Over Fitting: Pre/Post-pruning, random forest Classification → Prediction: Maj. Vote → e.g. Avg for leaf node. turn to regression tree, V ar(Dj) = ∑ y∈Dj (y − y)2/|Dj |, look for the lowest weighted average vari- ance V arA(D) = ∑v j=1 |Dj | |D| × V ar(Dj) A different view: leaf = box in the plane Random forest is a set of trees, ensemble, bagging, good at classification, handles large & missing data, not good at predictions, lack interpretation. Decision Tree y = sign(W ·X + b), separating hyperplane y = 0 SVM searches for Maximum Marginal Hyperplane To Maximize Margin ρ = 2‖w‖ , w. Lagrange multiplier α, L(w, b, α) = 12w Tw − ∑N i=1 αi(yi(w Txi + b)− 1). ∂L ∂w = w − ∑N i=1 αiyixi = 0, ∂L ∂b = − ∑N i=1 αiyi = 0 Solution: w = ∑ αiyixi, b = yk − wTxk f(x) = wTx+ b = ∑ αiyix T i x+ b default threshold 0 Linear v.s. Non-linear SVM: Kernel Non-linear Decision Boundary: f(x) = wTΦ(x) + b =∑ αiyiK(xi, x) + b Scalability: CF-Tree, Hierarchical Micro-cluster, se- lective declustering (decluster the clusters who could be support cluster; support cluster: centroid on sup- port vector) SVM xi wi−→ ∑ (+b) f−→ o Input vector x, Weight vector w, Bias b, weighted sum, going through activation function f , reach out- put o. Perceptron (Single Unit) Stochastic GD + Chain Rule Special case: Sigmoid + Square loss, 2 layers Assume: i, j, k are input, hidden, output layers’ de- notion, and O for output, T for true value. Errk = Ok(1 − Ok)(Tk − Ok), Errj = Oj(1 − Oj) ∑ k Errkwjk, wij = wij + ηErrjOi and wjk = wjk+ηErrkOj , θj = θj+ηErrj and θk = θk+ηErrk. ∂J ∂wij = ∂J∂Ok ∂Ok ∂Oj ∂Oj ∂wij = − ∑ k[(Tk − Ok)][Ok(1 − Ok)wjk][Oj(1−Oj)Oi] Backpropagation (BP) nlayers = nhidden + noutput(1) Feed-forward, Non-linear regression, capable of any continuous function. Backpropagation is used for learning. Neural Network (NN) Lazy learning (instead of eager), instance-based Consider k nearest neighbors; maj. voting or average. (Could be distance-weighted.) Curse of dimensionality: influence of noise Get rid of irrelevant features; select proper k. Proximity refers to similarity or dissimilarity. Always applies to binary values. If nominal, could do simple matching, or use a series of binary to represent a non-binary; ordinal: rank, normalize zif = rif−1 Mf−1 . Proximity could be measured by |(0,1)|+|(1,0)|all for sym- metric variables, |(0,1)|+|(1,0)|all−|(0,0)| or Jaccard coefficient (similarity) |(1,1)|all−|(0,0)| for asymmetric. Mixed type attributes: weighted combine. Another method: cosine similarity cos(d1, d2) k - Nearest Neighbors (kNN) Holdout method; Cross-validation (k-fold) LOO. Confusion Matrix: True / False Positive / Negative Accuracy = (TP + TN) / All Error Rate = (FP + FN) / All Sensitivity = TP / P (P = TP + FN) Specificity = TN / N (N = FP + TN) Precision = TP / P’ (P’ = TP + FP) Recall = TP / P = Sensitivity F1 / F-score = 2×Precision×Recall Precision+Recall Fβ = (1+β2)×Precision×Recall β2×Precision+Recall (R: P = β : 1) ROC curve: TP rate (y) - FP rate (x). (area under) TPR = TP / P, FPR = FP / N Evaluation: Classification K-means: J = ∑k j=1 ∑ i wij‖xi − cj‖2 Assign wij = 1 to each xi closest cj ; assign the center to be new centroid; stop when no change. O(tkn). For continuous, convex-shaped data, sensitive to noise. K-modes: mean→ mode, for categorical data K-medoids: representative objects, e.g. PAM (s) Hierarchical: bottom-up Agglomerative Nesting (AGNES) merges two closest clusters until end up in 1; top-down DIANA (Divisive Analysis). O(n2). Cluster Distance: Single link for min element-wise dist; Complete link for max; average for avg element pairs dist; centroid, medoid (center obj). DBSCAN: Set Eps  and MinPts. Neighborhood defined as N(q) : {p ∈ D|dist(p, q) ≤ }. Core point |N(q)| ≥MinPts. p is directly density-reachable from q if q is core point and p ∈ N(q); density- reachable if q → p2 → · · · → p; density-connected if o → · · · → p ∧ o → · · · → q. Cluster: max set density-connected points. Individual points are noise. DFS O(n log n) w. spacial index, else O(n2). Mixture Model: soft clustering (wij ∈ [0, 1] rather than wij ∈ {0, 1}), joint prob of object i and cluster Cj : p(xi, zi = Cj) = wjfj(xi), using EM algorithm. Gaussian Mixture Model (GMM): ⊃ k-means Generative model, for each object, pick cluster Z, from X|Z ∼ N(µZ , σ2Z) sample value; Overall li- kelihood function L(D|θ) = ∏ i ∑ j wjp(xi|µj , σ2j ); E wt+1ij = (w t jp(xi|µtj , (σ2j )t))/( ∑ k w t kp(xi|µtk, (σ2k)t)), M µt+1j = ( ∑ i w t+1 ij xi)/( ∑ i w t+1 ij ), (σ 2 j ) t+1 = ( ∑ i w t+1 ij (xi − µ t+1 j ) 2/( ∑ i w t+1 ij ), w t+1 j = ∑ i w t+1 ij /n (in 1-d case) Why EM works? E-Step find tight lower bound L of ` at θold, M-Step find θnew to maximize the lower bound. `(θnew) ≥ L(θnew) ≥ L(θold) = `(θold) Clustering extrinsic (supervised) vs. intrinsic (unsupervised) purity(C,Ω) = 1N ∑ K maxj |ck∩ωj | (C out, Ω truth) Normalized Mutual Information: NMI(C,Ω) = I(C,Ω)√ H(C)H(Ω) I(C,Ω) = ∑ k ∑ j P (ck ∩ ωj) log P (ck∩ωj) P (ck)P (ωj) =∑ k ∑ j |ck∩ωj | N log N |ck∩ωj | |ck||ωj | H(Ω) = − ∑ j P (ωj) logP (ωj) = − ∑ j |ωj | N log |ωj | N Precision and Recall: same / different class / cluster Select k: plot square loss - k, larger k smaller cost, find knee points; BIC penaltize; Cross validation Evaluation: Clustering 2
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved