Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Data Mining Midterm Cheat Sheet, Cheat Sheet of Data Mining

Saint John Vianney College Seminary Data Mining

Useful cheat sheet with formulas and main concepts for the Data Mining midterm exam

Typology: Cheat Sheet

2019/2020

On special offer

~~30 Points~~

Limited-time offer

Uploaded on 10/23/2020

eshal 🇺🇸

4.3

(35)

17 documents

1 / 4

On special offer

Partial preview of the text

Download Data Mining Midterm Cheat Sheet and more Cheat Sheet Data Mining in PDF only on Docsity! Data Mining (CS 145) Midterm Cheat Sheet by Patricia Xiao Model Data Type Task Type Linear Regres- sion Vector Prediction Logistic Regres- sion Vector Classification Decision Tree Vector Classification SVM Vector Classification NN Vector Classification KNN Vector Classification K-means Vector Clustering hierarchical clustering Vector Clustering DBSCAN Vector Clustering Mixture Models Vector Clustering Models • Dispersion: Quartiles& Inter Range (Q1 25%, Q3 75%, IQR = Q3−Q1, Outlier 1.5 IQR away Q1/3), 5 n Summary: min, Q1, median, Q3, max • Bias: E(f̂(x)) − f(x), Variance: V ar(f̂(x)) = E[(f̂(x)−E(f̂(x)))2], E[(f̂(x)− f(x)− ] = bias2 + variance + noise; E() = 0, V ar() = σ2; bias→underfit; variance→overfit • Model Evaluation and Selection: K-way cross va- lidation, AIC (2k − 2 ln(L̂)) & BIC (k ln(n) − 2 ln(L̂)) (k params, n objs), Stepwise feature selec- tion (forward: add, backward: from full model) • Generalized Linear Model (GLM): exponential fa- mily, p(y; η) = b(y)exp(ηTT (y)− a(η)); linear de- cision boundary • Bagging: Bootstrap Aggregating (multi-datasets → multiple classifiers → combine classifiers) • Kernel: K(xi, xj) = Φ(xi)TΦ(xj) • Chain rule: ∂J/∂x = (∂J/∂y)(∂y/∂x) • Minkowski distance (lh): d(x, y) = h √∑d i (xi − yi)h, l1 Manhattan, l2 Euclidean, l∞ supremum; triangle inequality applies (d(i, j) ≤ d(i, k) + d(k, j)). • Confusion Matrix: True / False for correctness, Positive / Negative for result • Multi-class classification: All-vs-all (AVA) is better than One-vs-All (OVA) Basic Concepts 1. σ2 = E[(X − E(X))2] = E(X2)− E2(X) 2. ‖α‖2 = αTα where α is a vector. 3. (AB)T = BTAT , (AB)−1 = B−1A−1 4. ∂(Ax)∂x = A, ∂(AX) ∂X = A T 5. ∂(x TAx) ∂x = x T (A+AT ), ∂(X TATAX) ∂X = 2A TAX 6. X ∼ N(µ, σ2)⇒ f(X = x) = 1√ 2πσ2 e− (x−µ)2 2σ2 7. σ′(x) = σ(x)(1− σ(x)) 8. log(ab) = b log a, log(ab) = log(a) + log(b) 9. Classifiers fi(x): var( ∑ i fi(x) t ) = var(fi(x))/t 10. a · b = ∑ aibi = ‖a‖‖b‖ cos (a, b) 11. normal n, any vector in plane, x, n · x = 0 12. covariance: σ(X1, X2) = E(X1 − µ1)(X2 − µ2) Formula 1. mean - mode = 3 × (mean - median) mode(peak)∼median∼mean(∼ tail) 2. Z-score (normalization): Z = x−µ δ (robust: mean absolute deviation, zjf = xif−meanf sumf ) (nominal: dummy variable(s), ordinal: (r−1)/(M−1)) where r,M start from 1. 3. Logistic / Sigmoid Function: σ(x) = 1 1+e−x 4. Entropy: H(Y ) = − ∑m i=1 pi log(pi); Conditional Entropy: H(Y |X) = ∑ x p(x)H(Y |X = x) 5. Cross Entropy Loss: H(q, p) = − ∑ k qk log(pk) 6. Lagrange multiplier α is used to solve Quadratic Programming (e.g. SVM) 7. Soft margin (allow moving at a cost): minimizing Φ(w) = 1/2wTw ⇒ Φ(w) = 1/2wTw + C ∑ ζi, li- mitation y(wTxi + b) ≥ 1 ⇒ y(wTxi + b) ≥ 1 − ζi (ζi ≥ 0); doesn’t affect the solution of SVM. 8. ROC (Receiver Operating Characteristics): TP rate (y-axis) - FP rate (x-axis), score = area below curve 9. Dendrogram: the hierarchical, cut to clusters. Tools y = xTβ where bias term xi0 = 1, x: (n × (p + 1)) matrix, y: (n × 1) vector, β: ((p + 1) × 1) vector. Continuous y = xβT . (OLS, Ordinary Least Square) J(β) = 1 2n (Xβ − y)T (Xβ − y) = 1 2n (βTXTXβ − yTXβ − βTXT y + yT y). Closed form solution: ∂J ∂β = 0, β̂ = (XTX)−1XT y Gradient Descent: β(t+1) := β(t) − η∆ Batch GD: (converge) ∆ = ∂J ∂β = ∑ i xi(x T i β − yi)/n Stochastic GD: (n times) ∆ = −(yi − xTi β(t))xi LR with Probabilistic Interpretation: (using MLE, Maximum Livelihood Estimation) L(β) = ∏ i p(yi|xi, β) =∏ i p(N(x T i β, σ 2)) = ∏ i 1√ 2πσ2 exp{− (ti−x T i β) 2 2σ2 } Invertible XTX: add λ ∑p j=1 β 2 j to ∑ i(yi − x T i β) 2 (Ridge Regression, or linear regression with l2 norm) Non-linear Correlation: create new terms e.g. x2 Linear Regression Generalized linear model (GLM). P (Y = 1|X,β) = σ(XTβ) = e XT β 1+eXT β P (Y = 0|X,β) = 1− σ(XTβ) = 1 1+eXT β Y |X,β ∼ Bernoulli(σ(XTβ)) MLE: L = ∏ i p yi i (1− pi) 1−yi , pi is P (Y = 1|X,β) Eq to max log likelihood L = ∑ i(yixiβ − log(1 + e xTi β)) Gradient ascent βnew = βold + η ∂L(β) ∂β Newton-Raphson update βnew = βold − ( ∂ 2L(β) ∂β )−1 ∂L(β) ∂β∂βT Cross Entropy Loss (p for prediction, q for ground truth, (q0, q1)|y=0 = (1, 0), (q0, q1)|y=1 = (0, 1), (p0, p1) = (P (Y = 0), P (Y = 1)): H(p, q) = −yxTβ + log(1 + ex T β) Logistic Regression A framework to approach maximum likelihood. p(xi, zi = Cj) = wjfj(xi), p(xi) = ∑ j wjfj(xi) p(D) = ∏ i p(xi) = ∏ i ∑ j wjfj(xi) log(p(D)) = ∑ i log( ∑ j wjfj(xi)) E(expectation)-step assigns objects to clusters. wt+1ij = p(zi = j|θ t j , xi) ∝ p(xi|zi = j, θtj)p(zi = j) = fj(xi)wj M(maximization)-step finds the new clustering w.r.t. conditional distribution p(zi = j|θtj , xi). θt+1 = argmax θ ∑ i ∑ j wt+1ij logL(xi, zi = j|θ) EM Algorithm 1 m for |y| in D, v for |A| Expected Information needed to classify a tuple in D: Info(D) = − ∑m i=1 pi log2(pi) Info after split A: InfoA(D) = ∑v j=1 Dj D × Info(Dj) Info Gain (ID3): Gain(A) = Info(D)− InfoA(D) Info gain biases towards multivalued attributes. SplitInfoA(D) = − ∑v j=1 Dj D × log2( Dj D ) GR (C4.5): GainRatio(A) = Gain(A)/SplitInfo(A) GR biases towards unbalanced splits. Gini(D) = 1− ∑m j=1 p 2 j for impurity GiniA(D) = ∑v j=1 |Dj | |D| Gini(Dj) Gini (CART): ∆Gini(A) = Gini(D)−GiniA(D) Gini index also biases towards multivalued attributes. STOP: same class; last attr; no sample (maj. vot.) Avoid Over Fitting: Pre/Post-pruning, random forest Classification → Prediction: Maj. Vote → e.g. Avg for leaf node. turn to regression tree, V ar(Dj) = ∑ y∈Dj (y − y)2/|Dj |, look for the lowest weighted average vari- ance V arA(D) = ∑v j=1 |Dj | |D| × V ar(Dj) A different view: leaf = box in the plane Random forest is a set of trees, ensemble, bagging, good at classification, handles large & missing data, not good at predictions, lack interpretation. Decision Tree y = sign(W ·X + b), separating hyperplane y = 0 SVM searches for Maximum Marginal Hyperplane To Maximize Margin ρ = 2‖w‖ , w. Lagrange multiplier α, L(w, b, α) = 12w Tw − ∑N i=1 αi(yi(w Txi + b)− 1). ∂L ∂w = w − ∑N i=1 αiyixi = 0, ∂L ∂b = − ∑N i=1 αiyi = 0 Solution: w = ∑ αiyixi, b = yk − wTxk f(x) = wTx+ b = ∑ αiyix T i x+ b default threshold 0 Linear v.s. Non-linear SVM: Kernel Non-linear Decision Boundary: f(x) = wTΦ(x) + b =∑ αiyiK(xi, x) + b Scalability: CF-Tree, Hierarchical Micro-cluster, se- lective declustering (decluster the clusters who could be support cluster; support cluster: centroid on sup- port vector) SVM xi wi−→ ∑ (+b) f−→ o Input vector x, Weight vector w, Bias b, weighted sum, going through activation function f , reach out- put o. Perceptron (Single Unit) Stochastic GD + Chain Rule Special case: Sigmoid + Square loss, 2 layers Assume: i, j, k are input, hidden, output layers’ de- notion, and O for output, T for true value. Errk = Ok(1 − Ok)(Tk − Ok), Errj = Oj(1 − Oj) ∑ k Errkwjk, wij = wij + ηErrjOi and wjk = wjk+ηErrkOj , θj = θj+ηErrj and θk = θk+ηErrk. ∂J ∂wij = ∂J∂Ok ∂Ok ∂Oj ∂Oj ∂wij = − ∑ k[(Tk − Ok)][Ok(1 − Ok)wjk][Oj(1−Oj)Oi] Backpropagation (BP) nlayers = nhidden + noutput(1) Feed-forward, Non-linear regression, capable of any continuous function. Backpropagation is used for learning. Neural Network (NN) Lazy learning (instead of eager), instance-based Consider k nearest neighbors; maj. voting or average. (Could be distance-weighted.) Curse of dimensionality: influence of noise Get rid of irrelevant features; select proper k. Proximity refers to similarity or dissimilarity. Always applies to binary values. If nominal, could do simple matching, or use a series of binary to represent a non-binary; ordinal: rank, normalize zif = rif−1 Mf−1 . Proximity could be measured by |(0,1)|+|(1,0)|all for sym- metric variables, |(0,1)|+|(1,0)|all−|(0,0)| or Jaccard coefficient (similarity) |(1,1)|all−|(0,0)| for asymmetric. Mixed type attributes: weighted combine. Another method: cosine similarity cos(d1, d2) k - Nearest Neighbors (kNN) Holdout method; Cross-validation (k-fold) LOO. Confusion Matrix: True / False Positive / Negative Accuracy = (TP + TN) / All Error Rate = (FP + FN) / All Sensitivity = TP / P (P = TP + FN) Specificity = TN / N (N = FP + TN) Precision = TP / P’ (P’ = TP + FP) Recall = TP / P = Sensitivity F1 / F-score = 2×Precision×Recall Precision+Recall Fβ = (1+β2)×Precision×Recall β2×Precision+Recall (R: P = β : 1) ROC curve: TP rate (y) - FP rate (x). (area under) TPR = TP / P, FPR = FP / N Evaluation: Classification K-means: J = ∑k j=1 ∑ i wij‖xi − cj‖2 Assign wij = 1 to each xi closest cj ; assign the center to be new centroid; stop when no change. O(tkn). For continuous, convex-shaped data, sensitive to noise. K-modes: mean→ mode, for categorical data K-medoids: representative objects, e.g. PAM (s) Hierarchical: bottom-up Agglomerative Nesting (AGNES) merges two closest clusters until end up in 1; top-down DIANA (Divisive Analysis). O(n2). Cluster Distance: Single link for min element-wise dist; Complete link for max; average for avg element pairs dist; centroid, medoid (center obj). DBSCAN: Set Eps and MinPts. Neighborhood defined as N(q) : {p ∈ D|dist(p, q) ≤ }. Core point |N(q)| ≥MinPts. p is directly density-reachable from q if q is core point and p ∈ N(q); density- reachable if q → p2 → · · · → p; density-connected if o → · · · → p ∧ o → · · · → q. Cluster: max set density-connected points. Individual points are noise. DFS O(n log n) w. spacial index, else O(n2). Mixture Model: soft clustering (wij ∈ [0, 1] rather than wij ∈ {0, 1}), joint prob of object i and cluster Cj : p(xi, zi = Cj) = wjfj(xi), using EM algorithm. Gaussian Mixture Model (GMM): ⊃ k-means Generative model, for each object, pick cluster Z, from X|Z ∼ N(µZ , σ2Z) sample value; Overall li- kelihood function L(D|θ) = ∏ i ∑ j wjp(xi|µj , σ2j ); E wt+1ij = (w t jp(xi|µtj , (σ2j )t))/( ∑ k w t kp(xi|µtk, (σ2k)t)), M µt+1j = ( ∑ i w t+1 ij xi)/( ∑ i w t+1 ij ), (σ 2 j ) t+1 = ( ∑ i w t+1 ij (xi − µ t+1 j ) 2/( ∑ i w t+1 ij ), w t+1 j = ∑ i w t+1 ij /n (in 1-d case) Why EM works? E-Step find tight lower bound L of ` at θold, M-Step find θnew to maximize the lower bound. `(θnew) ≥ L(θnew) ≥ L(θold) = `(θold) Clustering extrinsic (supervised) vs. intrinsic (unsupervised) purity(C,Ω) = 1N ∑ K maxj |ck∩ωj | (C out, Ω truth) Normalized Mutual Information: NMI(C,Ω) = I(C,Ω)√ H(C)H(Ω) I(C,Ω) = ∑ k ∑ j P (ck ∩ ωj) log P (ck∩ωj) P (ck)P (ωj) =∑ k ∑ j |ck∩ωj | N log N |ck∩ωj | |ck||ωj | H(Ω) = − ∑ j P (ωj) logP (ωj) = − ∑ j |ωj | N log |ωj | N Precision and Recall: same / different class / cluster Select k: plot square loss - k, larger k smaller cost, find knee points; BIC penaltize; Cross validation Evaluation: Clustering 2

Documents

questions

Data Mining Midterm Cheat Sheet, Cheat Sheet of Data Mining

Related documents

Partial preview of the text