Uploaded on 10/23/2020

Data Mining (CS 145) Midterm Cheat Sheet by Patricia Xiao Model Data Type Task Type Linear Regres- sion Vector Prediction Logistic Regres- sion Vector Classification Decision Tree Vector Classification SVM Vector Classification NN Vector Classification KNN Vector Classification K-means Vector Clustering hierarchical clustering Vector Clustering DBSCAN Vector Clustering Mixture Models Vector Clustering Models • Dispersion: Quartiles& Inter Range (Q1 25%, Q3 75%, IQR = Q3−Q1, Outlier 1.5 IQR away Q1/3), 5 n Summary: min, Q1, median, Q3, max • Bias: E(f̂(x)) − f(x), Variance: V ar(f̂(x)) = E[(f̂(x)−E(f̂(x)))2], E[(f̂(x)− f(x)− ] = bias2 + variance + noise; E() = 0, V ar() = σ2; bias→underfit; variance→overfit • Model Evaluation and Selection: K-way cross va- lidation, AIC (2k − 2 ln(L̂)) & BIC (k ln(n) − 2 ln(L̂)) (k params, n objs), Stepwise feature selec- tion (forward: add, backward: from full model) • Generalized Linear Model (GLM): exponential fa- mily, p(y; η) = b(y)exp(ηTT (y)− a(η)); linear de- cision boundary • Bagging: Bootstrap Aggregating (multi-datasets → multiple classifiers → combine classifiers) • Kernel: K(xi, xj) = Φ(xi)TΦ(xj) • Chain rule: ∂J/∂x = (∂J/∂y)(∂y/∂x) • Minkowski distance (lh): d(x, y) = h √∑d i (xi − yi)h, l1 Manhattan, l2 Euclidean, l∞ supremum; triangle inequality applies (d(i, j) ≤ d(i, k) + d(k, j)). • Confusion Matrix: True / False for correctness, Positive / Negative for result • Multi-class classification: All-vs-all (AVA) is better than One-vs-All (OVA) Basic Concepts 1. σ2 = E[(X − E(X))2] = E(X2)− E2(X) 2. ‖α‖2 = αTα where α is a vector. 3. (AB)T = BTAT , (AB)−1 = B−1A−1 4. ∂(Ax)∂x = A, ∂(AX) ∂X = A T 5. ∂(x TAx) ∂x = x T (A+AT ), ∂(X TATAX) ∂X = 2A TAX 6. X ∼ N(µ, σ2)⇒ f(X = x) = 1√ 2πσ2 e− (x−µ)2 2σ2 7. σ′(x) = σ(x)(1− σ(x)) 8. log(ab) = b log a, log(ab) = log(a) + log(b) 9. Classifiers fi(x): var( ∑ i fi(x) t ) = var(fi(x))/t 10. a · b = ∑ aibi = ‖a‖‖b‖ cos (a, b) 11. normal n, any vector in plane, x, n · x = 0 12. covariance: σ(X1, X2) = E(X1 − µ1)(X2 − µ2) Formula 1. mean - mode = 3 × (mean - median) mode(peak)∼median∼mean(∼ tail) 2. Z-score (normalization): Z = x−µ δ (robust: mean absolute deviation, zjf = xif−meanf sumf ) (nominal: dummy variable(s), ordinal: (r−1)/(M−1)) where r,M start from 1. 3. Logistic / Sigmoid Function: σ(x) = 1 1+e−x 4. Entropy: H(Y ) = − ∑m i=1 pi log(pi); Conditional Entropy: H(Y |X) = ∑ x p(x)H(Y |X = x) 5. Cross Entropy Loss: H(q, p) = − ∑ k qk log(pk) 6. Lagrange multiplier α is used to solve Quadratic Programming (e.g. SVM) 7. Soft margin (allow moving at a cost): minimizing Φ(w) = 1/2wTw ⇒ Φ(w) = 1/2wTw + C ∑ ζi, li- mitation y(wTxi + b) ≥ 1 ⇒ y(wTxi + b) ≥ 1 − ζi (ζi ≥ 0); doesn’t affect the solution of SVM. 8. ROC (Receiver Operating Characteristics): TP rate (y-axis) - FP rate (x-axis), score = area below curve 9. Dendrogram: the hierarchical, cut to clusters. Tools y = xTβ where bias term xi0 = 1, x: (n × (p + 1)) matrix, y: (n × 1) vector, β: ((p + 1) × 1) vector. Continuous y = xβT . (OLS, Ordinary Least Square) J(β) = 1 2n (Xβ − y)T (Xβ − y) = 1 2n (βTXTXβ − yTXβ − βTXT y + yT y). Closed form solution: ∂J ∂β = 0, β̂ = (XTX)−1XT y Gradient Descent: β(t+1) := β(t) − η∆ Batch GD: (converge) ∆ = ∂J ∂β = ∑ i xi(x T i β − yi)/n Stochastic GD: (n times) ∆ = −(yi − xTi β(t))xi LR with Probabilistic Interpretation: (using MLE, Maximum Livelihood Estimation) L(β) = ∏ i p(yi|xi, β) =∏ i p(N(x T i β, σ 2)) = ∏ i 1√ 2πσ2 exp{− (ti−x T i β) 2 2σ2 } Invertible XTX: add λ ∑p j=1 β 2 j to ∑ i(yi − x T i β) 2 (Ridge Regression, or linear regression with l2 norm) Non-linear Correlation: create new terms e.g. x2 Linear Regression Generalized linear model (GLM). P (Y = 1|X,β) = σ(XTβ) = e XT β 1+eXT β P (Y = 0|X,β) = 1− σ(XTβ) = 1 1+eXT β Y |X,β ∼ Bernoulli(σ(XTβ)) MLE: L = ∏ i p yi i (1− pi) 1−yi , pi is P (Y = 1|X,β) Eq to max log likelihood L = ∑ i(yixiβ − log(1 + e xTi β)) Gradient ascent βnew = βold + η ∂L(β) ∂β Newton-Raphson update βnew = βold − ( ∂ 2L(β) ∂β )−1 ∂L(β) ∂β∂βT Cross Entropy Loss (p for prediction, q for ground truth, (q0, q1)|y=0 = (1, 0), (q0, q1)|y=1 = (0, 1), (p0, p1) = (P (Y = 0), P (Y = 1)): H(p, q) = −yxTβ + log(1 + ex T β) Logistic Regression A framework to approach maximum likelihood. p(xi, zi = Cj) = wjfj(xi), p(xi) = ∑ j wjfj(xi) p(D) = ∏ i p(xi) = ∏ i ∑ j wjfj(xi) log(p(D)) = ∑ i log( ∑ j wjfj(xi)) E(expectation)-step assigns objects to clusters. wt+1ij = p(zi = j|θ t j , xi) ∝ p(xi|zi = j, θtj)p(zi = j) = fj(xi)wj M(maximization)-step finds the new clustering w.r.t. conditional distribution p(zi = j|θtj , xi). θt+1 = argmax θ ∑ i ∑ j wt+1ij logL(xi, zi = j|θ) EM Algorithm 1 m for |y| in D, v for |A| Expected Information needed to classify a tuple in D: Info(D) = − ∑m i=1 pi log2(pi) Info after split A: InfoA(D) = ∑v j=1 Dj D × Info(Dj) Info Gain (ID3): Gain(A) = Info(D)− InfoA(D) Info gain biases towards multivalued attributes. SplitInfoA(D) = − ∑v j=1 Dj D × log2( Dj D ) GR (C4.5): GainRatio(A) = Gain(A)/SplitInfo(A) GR biases towards unbalanced splits. Gini(D) = 1− ∑m j=1 p 2 j for impurity GiniA(D) = ∑v j=1 |Dj | |D| Gini(Dj) Gini (CART): ∆Gini(A) = Gini(D)−GiniA(D) Gini index also biases towards multivalued attributes. STOP: same class; last attr; no sample (maj. vot.) Avoid Over Fitting: Pre/Post-pruning, random forest Classification → Prediction: Maj. Vote → e.g. Avg for leaf node. turn to regression tree, V ar(Dj) = ∑ y∈Dj (y − y)2/|Dj |, look for the lowest weighted average vari- ance V arA(D) = ∑v j=1 |Dj | |D| × V ar(Dj) A different view: leaf = box in the plane Random forest is a set of trees, ensemble, bagging, good at classification, handles large & missing data, not good at predictions, lack interpretation. Decision Tree y = sign(W ·X + b), separating hyperplane y = 0 SVM searches for Maximum Marginal Hyperplane To Maximize Margin ρ = 2‖w‖ , w. Lagrange multiplier α, L(w, b, α) = 12w Tw − ∑N i=1 αi(yi(w Txi + b)− 1). ∂L ∂w = w − ∑N i=1 αiyixi = 0, ∂L ∂b = − ∑N i=1 αiyi = 0 Solution: w = ∑ αiyixi, b = yk − wTxk f(x) = wTx+ b = ∑ αiyix T i x+ b default threshold 0 Linear v.s. Non-linear SVM: Kernel Non-linear Decision Boundary: f(x) = wTΦ(x) + b =∑ αiyiK(xi, x) + b Scalability: CF-Tree, Hierarchical Micro-cluster, se- lective declustering (decluster the clusters who could be support cluster; support cluster: centroid on sup- port vector) SVM xi wi−→ ∑ (+b) f−→ o Input vector x, Weight vector w, Bias b, weighted sum, going through activation function f , reach out- put o. Perceptron (Single Unit) Stochastic GD + Chain Rule Special case: Sigmoid + Square loss, 2 layers Assume: i, j, k are input, hidden, output layers’ de- notion, and O for output, T for true value. Errk = Ok(1 − Ok)(Tk − Ok), Errj = Oj(1 − Oj) ∑ k Errkwjk, wij = wij + ηErrjOi and wjk = wjk+ηErrkOj , θj = θj+ηErrj and θk = θk+ηErrk. ∂J ∂wij = ∂J∂Ok ∂Ok ∂Oj ∂Oj ∂wij = − ∑ k[(Tk − Ok)][Ok(1 − Ok)wjk][Oj(1−Oj)Oi] Backpropagation (BP) nlayers = nhidden + noutput(1) Feed-forward, Non-linear regression, capable of any continuous function. Backpropagation is used for learning. Neural Network (NN) Lazy learning (instead of eager), instance-based Consider k nearest neighbors; maj. voting or average. (Could be distance-weighted.) Curse of dimensionality: influence of noise Get rid of irrelevant features; select proper k. Proximity refers to similarity or dissimilarity. Always applies to binary values. If nominal, could do simple matching, or use a series of binary to represent a non-binary; ordinal: rank, normalize zif = rif−1 Mf−1 . Proximity could be measured by |(0,1)|+|(1,0)|all for sym- metric variables, |(0,1)|+|(1,0)|all−|(0,0)| or Jaccard coefficient (similarity) |(1,1)|all−|(0,0)| for asymmetric. Mixed type attributes: weighted combine. Another method: cosine similarity cos(d1, d2) k - Nearest Neighbors (kNN) Holdout method; Cross-validation (k-fold) LOO. Confusion Matrix: True / False Positive / Negative Accuracy = (TP + TN) / All Error Rate = (FP + FN) / All Sensitivity = TP / P (P = TP + FN) Specificity = TN / N (N = FP + TN) Precision = TP / P’ (P’ = TP + FP) Recall = TP / P = Sensitivity F1 / F-score = 2×Precision×Recall Precision+Recall Fβ = (1+β2)×Precision×Recall β2×Precision+Recall (R: P = β : 1) ROC curve: TP rate (y) - FP rate (x). (area under) TPR = TP / P, FPR = FP / N Evaluation: Classification K-means: J = ∑k j=1 ∑ i wij‖xi − cj‖2 Assign wij = 1 to each xi closest cj ; assign the center to be new centroid; stop when no change. O(tkn). For continuous, convex-shaped data, sensitive to noise. K-modes: mean→ mode, for categorical data K-medoids: representative objects, e.g. PAM (s) Hierarchical: bottom-up Agglomerative Nesting (AGNES) merges two closest clusters until end up in 1; top-down DIANA (Divisive Analysis). O(n2). Cluster Distance: Single link for min element-wise dist; Complete link for max; average for avg element pairs dist; centroid, medoid (center obj). DBSCAN: Set Eps  and MinPts. Neighborhood defined as N(q) : {p ∈ D|dist(p, q) ≤ }. Core point |N(q)| ≥MinPts. p is directly density-reachable from q if q is core point and p ∈ N(q); density- reachable if q → p2 → · · · → p; density-connected if o → · · · → p ∧ o → · · · → q. Cluster: max set density-connected points. Individual points are noise. DFS O(n log n) w. spacial index, else O(n2). Mixture Model: soft clustering (wij ∈ [0, 1] rather than wij ∈ {0, 1}), joint prob of object i and cluster Cj : p(xi, zi = Cj) = wjfj(xi), using EM algorithm. Gaussian Mixture Model (GMM): ⊃ k-means Generative model, for each object, pick cluster Z, from X|Z ∼ N(µZ , σ2Z) sample value; Overall li- kelihood function L(D|θ) = ∏ i ∑ j wjp(xi|µj , σ2j ); E wt+1ij = (w t jp(xi|µtj , (σ2j )t))/( ∑ k w t kp(xi|µtk, (σ2k)t)), M µt+1j = ( ∑ i w t+1 ij xi)/( ∑ i w t+1 ij ), (σ 2 j ) t+1 = ( ∑ i w t+1 ij (xi − µ t+1 j ) 2/( ∑ i w t+1 ij ), w t+1 j = ∑ i w t+1 ij /n (in 1-d case) Why EM works? E-Step find tight lower bound L of ` at θold, M-Step find θnew to maximize the lower bound. `(θnew) ≥ L(θnew) ≥ L(θold) = `(θold) Clustering extrinsic (supervised) vs. intrinsic (unsupervised) purity(C,Ω) = 1N ∑ K maxj |ck∩ωj | (C out, Ω truth) Normalized Mutual Information: NMI(C,Ω) = I(C,Ω)√ H(C)H(Ω) I(C,Ω) = ∑ k ∑ j P (ck ∩ ωj) log P (ck∩ωj) P (ck)P (ωj) =∑ k ∑ j |ck∩ωj | N log N |ck∩ωj | |ck||ωj | H(Ω) = − ∑ j P (ωj) logP (ωj) = − ∑ j |ωj | N log |ωj | N Precision and Recall: same / different class / cluster Select k: plot square loss - k, larger k smaller cost, find knee points; BIC penaltize; Cross validation Evaluation: Clustering 2
