Il documento presenta i tools per l'apprendimento non supervisionato (Unsupervised Statistical Learning) con spiegazioni, formule, esempi. Contenuti: 1) Univariate Statistical modelling, 2) Basics of Matrices, 3) Basics of Multivariate Statistics, 4) Principal Component Analysis, 5) Cluster Analysis, 6) Cluster Validation, 7) Model-Based Clustering

Tipologia: Appunti

2021/2022

In vendita dal 19/04/2022

crasssass 🇮🇹

1 documento

1 / 140

Documenti correlati

Lezioni Statistical learning and prediction from data

(1)

Introduzione al Statistical Learning - Con Osmetti (Università Cattolica del Sacro Cuore)

Introduction to machine learning - full course

Machine learning in data science

(2)

MULTIVARIATE ANALYSIS AND STATISTICAL LEARNING

PCA & Clustering: Unsupervised Learning for Data Exploration

Statistical learning

Statistical Machine Learning - dispensa corso completa

Machine Learning for Networking

Riassunto esame Statistical Learning Università Tor Vergata

(1)

ARTIFICIAL INTELLIGENCE (English)

Analisi statistica dati: Principal Component Analysis (PCA) e cluster - Prof. Novi Inverar

Analisi Multivariata in Economia: Clustering, PCA e Regressione Panel

writing report based on statistical data

Big Data Analysis - Linear Regression

Statistical Analysis of Poverty in Italy: ISTAT Data and Recent Trends

Statistical learning and prediction from data

Statistical Data Analysis - ESIR

ANALYSIS OF BIOMEDICAL DATA AND SIGNALS

(1)

Data processing and Analysis

Appunti Statistical Data Analysis - ESIR

Public Opinion on Global Warming in America: A Statistical Analysis

Basi di Data Mining (Machine Learning)

(1)

ENTIRE COURSE AND EXERCISES Statistical learning and prediction from data

Breast Cancer Patient Data: Statistical Analysis & Description

Statistics: Understanding Data and Statistical Analysis - Prof. Bianconcini

Independent and Principal Component Analysis

MACHINE LEARNING Appunti Completi

Data Mining for Market Segmentation: Orthogonal Projection and Clustering

Data Scientist/Machine Learning Engineer Opportunity at Rainforest QA

Anteprima parziale del testo

Scarica Data Analysis - Unsupervised Statistical Learning (PCA, CA, Model-Based Clustering) e più Appunti in PDF di Analisi Dei Dati solo su Docsity! 1 Data Analysis SLIDE 1 – Univariate Statistical modelling Random variable A random variable (rv) is a variable whose numerical values are determined by the outcome of a random experiment. A rv can be: 1) discrete, if it can take no more than a countable number of values; 2) continuous, if it can take any value in an interval. 1) Discrete random variable Probability (mass) function: The probability mass function (pmf) 𝑝(𝑥) of a discrete rv 𝑋 expresses the probability that 𝑋 takes the value 𝑥, as a function of 𝑥. The pmf 𝑝 ∶ ℝ → [0, 1] is defined as 𝑝 (𝑥) = 𝑃 (𝑋 = 𝑥) Support: the support 𝑆𝑥 of a discrete rv 𝑋 is defined as the set of possible values of 𝑋, that is 𝑆𝑥 = {𝑥 ∈ ℝ ∶ 𝑝 (𝑥) > 0}. Properties: The pmf satisfies the following properties: 1) 𝑝 (𝑥) > 0 𝑓𝑜𝑟 𝑥 ∈ 𝑆𝑥 ; 2) 𝑝 (𝑥) = 0 𝑓𝑜𝑟 𝑥 ∉ 𝑆𝑥 ; 3) ∑ 𝑝(𝑥) = 1𝑥∈𝑆𝑥 Cumulative probability function: The cumulative distribution function (cdf) 𝐹(𝑥) of a discrete rv 𝑋 expresses the probability that 𝑋 does not exceed the value 𝑥, as a function of 𝑥. The cdf 𝐹 ∶ ℝ → [0, 1] is defined as 𝐹(𝑥) = 𝑃 (𝑋 ≤ 𝑥) = ∑ 𝑝(𝑡) {𝑡∈𝑆𝑥;𝑡≤𝑥} where the notation indicates that the summation is over all possible values 𝑡 ∈ 𝑆𝑥 that are less than or equal to 𝑥. Properties: The cdf satisfies the following properties: 1) 0 ≤ 𝐹 (𝑥) ≤ 1 for every 𝑥 ∈ ℝ; 2) if x0 and x1 are two numbers such that x0 < x1, then 𝐹 (𝑥0) ≤ 𝐹 (𝑥1) 2 Expectation: The expectation (also called mean or expected value) of a discrete rv 𝑋 is defined as 𝐸(𝑋) = 𝜇𝑋 = ∑ 𝑥 𝑝(𝑥)𝑥∈𝑆𝑥 , where the summation is over all possible values 𝑥 ∈ 𝑆𝑥 . Generalization: Let 𝑋 be a discrete rv with pmf 𝑝(𝑥). Moreover, let 𝑔(𝑋) be some function of 𝑋. The expected value of 𝑔(𝑋) is defined as 𝐸 [𝑔 (𝑋)] = ∑ 𝑔(𝑥)𝑝(𝑥) 𝑥∈𝑆𝑥 Variance: The expectation of the squared discrepancy about the mean (𝑋 − 𝜇𝑋)2 is called the variance, commonly denoted by 𝜎𝑋 2 , and it is given by 𝐸[(𝑋 − 𝜇𝑋)2] = 𝜎𝑋 2 = ∑ (𝑥 − 𝜇𝑋)2𝑝(𝑥) 𝑥∈𝑆𝑥 2) Continuous random variables Probability density function: Let 𝑋 be a continuous rv. The probability density function (pdf) of 𝑋 is a function 𝑓 ∶ ℝ → ℝ0 + with the following properties 1) 𝑓(𝑥) ≥ 0 for any 𝑥 ∈ ℝ 2) 𝑃(𝑎 ≤ 𝑋 ≤ 𝑏) = ∫ 𝑓(𝑥) 𝑑𝑥 𝑏 𝑎 for any 𝑎, 𝑏 ∈ ℝ; 3) ∫ 𝑓(𝑥) 𝑑𝑥 = 1 ∞ −∞ If 𝑋 is a continuous rv, then the probability of a single point x0 is null, that is 𝑃 (𝑋 = 𝑥0) = 0. Cumulative distribution function: The cumulative distribution function 𝐹(𝑥) of a continuous rv 𝑋 is 𝐹 (𝑥) = 𝑃 (𝑋 ≤ 𝑥) = ∫ 𝑓(𝑥)𝑑𝑥 𝑥 −∞ . It easily follows that 𝑓 (𝑥) = 𝑑𝐹(𝑥) 𝑑𝑥 Expectation: The expectation of a continuous rv 𝑋 is defined as 𝐸(𝑋) = 𝜇𝑋 = ∫ 𝑥 𝑑𝐹(𝑥) = ∫ 𝑥 𝑓(𝑥)𝑑𝑥 ∞ −∞ ∞ −∞ Generalization: Let 𝑋 be a discrete rv with pdf 𝑓(𝑥). Moreover, let 𝑔(𝑋) be some function of 𝑋. The expected value of 𝑔(𝑋) is defined as Count data: Poisson distributio The Poisson rv X takes the values x = 0,1,... and has pmf A XX pi= A xo... Jar Sho ° being A > 0. pasto en e” da ì _ _ & lie E(X) = Var(X) = A. ) . . alli . all. (a)X=2 (b) \=10 (A =20 Figure: Examples of pmf for a Poisson rv. mal distributio! norm()) The normal distribution, with mean x € /R and variance 0° > 0, i.e.9 = (1,0%) , is the gold-standard for continuous random variables. Curiosity The normal distribution is sometimes informally called the bel/ curve. The normal rv has pdf a forenelar 1° _i(xu)? f(x; 1,0%) = 75° 3098) While x governs the location of the bell, 0? is related to the concentration of the bell around the mode , (note that mean, median and mode coincide for the normal pdf). (@) 1 =4ando — 0.5 (b) 1 =4ando =1 () 1 =4ando =1.5 Qi IM Logi dtty 5 7 Figure: Examples of pdf for a normal rv. where a > 0,8 > 0 and l' (a) = / x°7'e7*dx is the gamma function. 0 E=3 and Var(X) = 3. } = a= da N ese a AE Me er ci et a { Figure: Examples of pdf for a gamma rv with 8 = 1. 1 _ f(x;pu,0) = e 2, X>0, (X; 4,0) ox with © € Rando > 0. al E(X)=e**? and Var (x) = (67° - 1) @1e°. ) (a) 1=0ando =1 (b) 1 =0ando = 0.5 (©) 4 =1ando = 0.3 Figure: Examples of pdf for a lognormal rv. 7 Random sample A random sample of size 𝑛 is the set 𝑋1, . . . , 𝑋𝑖 , . . . , 𝑋𝑛 of rv’s associated to 𝑛 independent and identically distributed (iid) observations of the rv 𝑋. Observed sample: an observed sample of size n is the set 𝑥1, . . . , 𝑥𝑖 , . . . , 𝑥𝑛 constituting the realizations of the rv’s 𝑋1, . . . , 𝑋𝑖 , . . . , 𝑋𝑛 through the 𝑛 sample units. Sample (joint) distribution: the sample (joint) distribution is the joint distribution 𝑓𝑛 (or 𝑝𝑛) of 𝑋1,..., 𝑋𝑖,..., 𝑋𝑛. If 𝑋 is a continuous rv and has pdf 𝑓 (𝑥; 𝜗), the joint density of the sample is given by 𝑓𝑛(𝑥1, … . , 𝑥𝑖 , … , 𝑥𝑛; 𝜗) = ∏ 𝑓(𝑥𝑖; 𝜗) 𝑛 𝑖=1 10 Unfortunately, an estimator with 𝑀𝑆𝐸(𝑇) uniformly smaller with respect to 𝜗 than any other estimator does not exist. Our goal: set up “desirable” conditions 𝑇 must fulfil to restrict the range of admissible estimators- hopefully there will be one better than the others! Unbiased estimator (quelli che noi studieremo) Unbiased estimator: the estimator 𝑇 of 𝜏(𝜗) is unbiased if 𝐸(𝑇) = 𝜏(𝜗), for every 𝜗. The “central value of 𝑇”, that is 𝐸(𝑇), coincides with 𝜏(𝜗). This property allows us to find a subset of estimators, the unbiased estimators, of much interest. On top of this, given that their MSE coincides with their variance, to choose among unbiased estimators will be enough to look for the lowest variance. Bias of an estimator If 𝐸(𝑇) ≠ 𝜏(𝜗), we say that the estimator is biased Bias of an estimator: the bias of an estimator T of 𝜏(𝜗) is the difference 𝑏𝑖𝑎𝑠 (𝑇) = 𝐸(𝑇) − 𝜏(𝜗) If 𝐸(𝑇) > 𝜏(𝜗) the estimator T is positively biased, with bias 𝑏𝑖𝑎𝑠 (𝑇) = 𝐸(𝑇) − 𝜏(𝜗) > 0 11 𝑇 tends to overestimate 𝜏(𝜗); If 𝐸(𝑇) < 𝜏(𝜗) the estimator T is negatively biased, with bias 𝑏𝑖𝑎𝑠 (𝑇) = 𝐸(𝑇) − 𝜏(𝜗) < 0 𝑇 tends to underestimate 𝜏(𝜗). Cramèr-Rao Lower Bound (legato agli unbiased estimators) Within the class of unbiased estimators, the most important feature is variance. Under certain regularity conditions it is possible to show that the variance of any unbiased estimator is greater than, or equal to, a quantity, which is the lower bound of the variance of unbiased estimators. An unbiased estimator T is efficient if its variance reaches the CR lower bound. 12 Quadratic consistency (legato agli unbiased estimators) Until now we have considered the sample size as fixed. It is interesting also to investigate how estimators behave as n grows. In particular, we want large samples to deliver more accurate estimates than smaller samples would do (it means that if the sample size grows, the quality of the estimator becomes better and the MSE tends to 0). This request can be formalized under quadratic consistency. Note: to stress that estimators somewhat depend on sample size, we will add a subscript 𝑛 and refer to the estimator as 𝑇𝑛. Quadratic consistency: the estimator 𝑇𝑛 of 𝜏(𝜗) satisfies quadratic consistency if lim 𝑛→ ∞ 𝑀𝑆𝐸(𝑇𝑛) = 0, for every 𝜗 Remark 1: Quadratic consistency implies that: 1) lim 𝑛→∞ 𝑉𝑎𝑟(𝑇𝑛) = 0; 2) lim 𝑛→∞ 𝑏𝑖𝑎𝑠(𝑇𝑛) = 0, which corresponds to asymptotic unbiasedness. Both implications are derived from the decomposition of MSE. Remark 2: Within the class of asymptotically unbiased estimators, looking for an asymptotically optimal estimator boils down to find 𝑇𝑛 with zero variance. Strategy for estimator construction (sempre unbiased) 1) Analogy: estimates are computed on the sample which is treated as a population analog. 2) Method of moments: estimates are computed equating the population moments (mean, variance, etc.) with the sample moments. 3) Method of maximum likelihood: In the maximum likelihood (ML) method, estimates are computed maximizing the probability - i.e. the likelihood - of obtaining the available sample. That is, we look for the value of 𝜗 which maximizes the likelihood of obtaining the observed sample. ML: pros and cons ✓ -> The ML method delivers asymptotically normal and efficient estimators. 15 The invariance property of ML estimators We are often interested in not just the model parameters, but some functions of them. Asymptotic properties of ML estimators ML estimators are widely appreciated as they enjoy some very nice asymptotic properties. 16 Goodness-of-fit tests There are 3 main statistical tests to evaluate the goodness-of-fit of a statistical (theoretical) model (that we put under the null hypothesis 𝐻0) to the empirical distribution (namely the distribution of the observed sample) (ovvero ci sono 3 tipologie di test effettuabili per capire la bontà con la quale un set di dati è stato descritto da una distribuzione che si è ipotizzata): Note: The null hypothesis is an hypothesis which is assumed true until there’s a proof of the contrary. The alternative hypothesis is an hypothesis which is opposite to the null one which is accepted only if there is a strong proof in its favour. 1) Pearson’s chi-square test; 2) Kolmogorov-Smirnov test; 3) Likelihood-ratio test. 1) Pearson’s chi-square test In Pearson’s chi-square goodness-of-fit test the sample data of size 𝑛, if of a continuous type, are divided into 𝑠 intervals (or classes or bins) 𝐴1, . . . , 𝐴𝑠. Then the numbers of points 𝑛1, . . . , 𝑛𝑠 that fall into the intervals 𝐴1, . . . , 𝐴𝑠 are compared with the expected numbers of points 𝑛1̂, . . . , 𝑛?̂? (under the null model) in those intervals. The null hypothesis 𝐻0 assigns the probabilities 𝜋1, … , 𝜋𝑠 𝑡𝑜 𝐴1, … , 𝐴𝑠. Hypotheses: the null and alternative hypotheses are: 𝐻0 ∶ (𝑝1 = 𝜋1) ∩· · ·∩ (𝑝𝑠 = 𝜋𝑠) 𝑉𝑠 𝐻1 ∶ (𝑝1 ≠ 𝜋1) ∪ · · ·∪ (𝑝𝑠 ≠ 𝜋𝑠) , Where 𝑝1, … . , 𝑝𝑠 represent the true but unknown probabilities of 𝐴1, … . , 𝐴𝑠 17 Remark: it is natural to assume that the rejection of 𝐻0 depends on the discrepancy between 𝑛1, … . , 𝑛𝑠 and 𝑛1̂, . . . , 𝑛?̂? Remarks: -> The distribution of 𝜒2 is asymptotic and evaluated under the null hypothesis. -> Let q be the number of unknown parameters of the null model. If these parameters need to be estimated from the sample data, then The critical region of level 𝛼 is defined as 𝐶 = {𝜒2 ∶ 𝜒2 > 𝑐} where 𝑐 = 𝜒2 [(𝑠−1);1−𝛼] is the quantile of order 1 − 𝛼 from 𝜒2 (𝑠−1). 2) Kolmogorov-Smirnov test The Kolmogorov-Smirnov test (KS test) is a goodness-of-fit test that can be used with two different purposes. one-sample KS test: it is used to compare the empirical cumulative distribution function 𝐹(𝑥) with the theoretical cumulative distribution function 𝜙(𝑥) under the null. The KS statistic quantifies the distance between 𝐹(𝑥) and 𝜙(𝑥) as 20 SLIDE 2 – Basics of Matrices Triangular matrix: In the mathematical discipline of linear algebra, a triangular matrix is a special kind of square matrix. A square matrix is called lower triangular if all the entries above the main diagonal are equal to 0. Similarly, a square matrix is called upper triangular if all the entries below the main diagonal are equal to 0. 21 Diagonal matrix: a matrix in which all the entries outside the diagonal are equal to 0 (doesn’t matter if even the entries in the diagonal have some zeros). 22 lambda In italian, eigenvector = autovettore Eigenvalue = autovalore 25 • (linear) correlation where 𝜎𝑖𝑖 = 𝑉𝑎𝑟(𝑋𝑖) Note: if 𝑋𝑖 and 𝑋𝑗 are (stochastically) independent, then 𝜌𝑖𝑗 = 0, but not viceversa Extension to the multivariate case Let 𝑿 be a d-dimensional random vector. 1) Mean Vector (𝒄𝒐𝒍𝑴𝒆𝒂𝒏𝒔() applied to the data matrix) The mean vector of 𝑿 is: 2) Covariance matrix (𝒄𝒐𝒗()): The variance-covariance matrix of 𝑿 (often just called the covariance matrix) denoted by Cov(𝑿), is a 𝑑 × 𝑑 symmetric matrix defined as So, Cov(𝑿) is a 𝑑 × 𝑑 symmetric matrix with variances on the main diagonal and covariances on the off-diagonal entries Example: 26 Property: for any constant 𝑚 × 𝑑 matrix A and constant 𝑚 × 1 vector c, one has 𝐶𝑜𝑣(𝒄 + 𝑨𝑿) = 𝑨𝐶𝑜𝑣(𝑿)𝑨′ 3) Correlation matrix It is a 𝑑 × 𝑑 symmetric matrix defined as Note: Covariance (Σ) and correlation (R) matrices are symmetric. They are also positive definite, unless some rv’s in X are linear combination of the other rv’s. In this case, covariance and correlation matrices are positive semidefinite. Moments of linear combinations • Let 𝑋 and 𝑌 be two rv’s. Moreover, let 𝑍 = 𝑎 + 𝑏𝑋 + 𝑐𝑌 where 𝑎, 𝑏, and 𝑐 are constants. The expected value and the variance of 𝑍 are 𝐸(𝑍) = 𝐸(𝑎 + 𝑏𝑋 + 𝑐𝑌) = 𝑎 + 𝑏𝐸(𝑋) + 𝑐𝐸(𝑌) 𝑉𝑎𝑟(𝑍) = 𝑉𝑎𝑟(𝑎 + 𝑏𝑋 + 𝑐𝑌) = 𝑏2𝑉𝑎𝑟(𝑋) + 𝑐2𝑉𝑎𝑟(𝑌) + 2𝑏𝑐𝐶𝑜𝑣(𝑋, 𝑌). • Let X be a d-dimensional random vector with mean vector µ and covariance matrix Σ. Moreover, let a be a vector of d constants. Then Z = a′X is a rv with mean and variance given by E(Z) = a′µ, Var(Z) = a′Σa. 27 Multivariate Normal Distribution Multivariate case: joint pdf (dmvnorm() in the mvtnorm package): The random vector 𝑿 = (𝑋1, . . . , 𝑋𝑑)′ is said to have a multivariate normal distribution, and we compactly write 𝑿 ∼ 𝑁(µ, 𝜮), if 𝑿 has (joint) density where µ is the 𝑑 × 1 mean vector and Σ is the 𝑑 × 𝑑 covariance matrix. Univariate case: pdf (dnorm()): As a special case, when d = 1, we obtain the well-known pdf Remark: Why does Σ need to be positive definite? Since the rv 𝑌 = ∑ 𝑎𝑗𝑋𝑗 𝑑 𝑗=1 may be written as 𝑌 = 𝒂′𝑿, and 𝑉𝑎𝑟(𝑌) = 𝑉𝑎𝑟(𝒂′𝑿) = 𝒂′𝜮𝒂 (quadratic form), it is natural to require that Σ is positive definite. All it means is that every non-zero linear combination of 𝑿 values has a positive variance. 2 - non-diagonal X Case d 2) 0.5 0.5 (08 /EZZ ii \\ { (x I DD) SZ 30 Case d = 2 - non-diagonal X | ni i Di il ni if iù n) li di i i IA UO 31 32 More properties of the multivariate normal: An example: Multivariate Data Data Matrix The starting point of any analysis of multivariate data is a data matrix, that is a 𝑛 × 𝑑 matrix collecting the observations/measurements on 𝑛 units and 𝑑 variables 𝑋1, . . . , 𝑋𝑑 35 referred to as unsupervised because we lack an output that can supervise our analysis. Unsupervised statistical learning searches for structure in unlabeled data (data without output) We can seek to understand the relationships between the 𝑑 variables or between the 𝑛 observations. The goal of unsupervised learning is to find homogenous subgroups (clusters), and to find patterns (usually through dimensionality reduction). 36 SLIDE 4: Principal Component Analysis (PCA) The aim of principal component analysis (PCA) is to summarize and to visualize the information in a data set containing statistical units described by multiple correlated quantitative variables. Each variable could be considered as a different dimension. Issue: if we have more than 3 variables, it could be very difficult to visualize the underlying multi-dimensional hyperspace. Principal components: PCA is used to extract the important information from a multivariate data set and to express this information as a set of few new variables called principal components (PCs). These new variables correspond to a linear combination of the originals. The number of PCs is less than or equal to the number of original variables. Variation as a source of information: the information in a given data set corresponds to the total variation (variability) it contains. The goal of PCA is to identify directions (or PCs) along which the variation in the data is maximal. In other words, PCA reduces the dimensionality of multivariate data to two or three PCs, that can be visualized graphically, with minimal loss of information. In the Plot 1A below, an example of bivariate data is represented in the Cartesian coordinate system. The dimension reduction is achieved by identifying the principal directions, called PCs, in which the data varies as much as possible. PCA assumes that the directions with the largest variances are the most “important” (i.e., the most principal). • The PC1 axis is the first principal direction along which the sample shows the largest variation. • The PC2 axis is the second most important direction and it is orthogonal to the PC1 axis. The dimensionality of our two-dimensional data can be reduced to a single dimension by projecting each sample point in the first PC (Plot 1B). 37 The eigenvalue as a measure of variation: the amount of variance retained by each PC is measured by the so-called eigenvalue (autovalore). PCA is particularly useful when the available variables are highly correlated. Correlation indicates that there is redundancy1 in the data. Due to this redundancy, PCA can be used to reduce the original variables into a smaller number of new variables (i.e. the PCs) explaining most of the variance in the original variables. Underlying idea: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of 𝑛 observations of possibly correlated numerical variables1 𝑋1, . . . , 𝑋𝑑, into a set of 𝑛 values of new linearly uncorrelated variables 𝑌1, . . . , 𝑌𝑑 called principal components (PCs). This transformation is defined in such a way that the first PC has the largest possible variance (that is, accounts for as much of the variability of 𝑋1, . . . , 𝑋𝑑 as possible), and each succeeding PC in turn has the highest variance possible under the constraint that it is orthogonal to the preceding PCs. 1 Variables are called features in unsupervised learning and, more in general, in machine learning! Usefulness: PCA is typically used as a dimensionality reduction technique aiming to reduce the number of variables (number of columns of X) from 𝑑 to 𝑞, with 𝑞 < 𝑑, allowing most of the variability of 𝑋1, . . . , 𝑋𝑑 to be explained using fewer variables 𝑌1, . . . , 𝑌𝑞 . To be noted: • Each PC is a linear combination of all the centered variables 𝑋1̃, . . . , 𝑋?̃?. 40 2) Second principal component After the first PC 𝑌1 of the original variables has been determined, we can find the second PC 𝑌2, which is the linear combination of 𝑋1̃, . . . , 𝑋?̃? that has maximal variance out of all linear combinations that are uncorrelated (i.e. orthogonal, geometrically speaking) with 𝑌1. The second PC takes the form Remaining PCs: this proceeds until all PCs are computed Important: PCA & Eigen-decomposition: it can be shown, using techniques from linear algebra, that the eigenvector corresponding to the largest eigenvalue of the sample covariance matrix S is the set of loadings that explains the greatest proportion of the variability. PCA: an example The population size (pop) and ad spending (ad) for 𝑛 = 100 different cities are shown as purple circles. The green solid line indicates the first principal component, and the blue dashed line indicates the second principal component. 41 Geometrical interpretation: the green solid line (first principal component) is the direction along which there is the greatest variability in the data. That is, if we project the n = 100 observations onto this line, then the resulting projected observations would have the largest possible variance. Remark: the first PC vector defines the line that is as close as possible to the data according to the squared Euclidean distance. What does projecting a point mean? Projecting a point onto a line simply involves finding the location on the line which is closest (w.r.t. the considered distance) to the point. In other words Projecting the observations onto any other line would yield projected observations with lower variance. Is it possible to summarize mathematically this first principal component? Yes, it is. The first PC is a new variable, say 𝑌1, with the following formulation where 𝜙11 = 0.839 and 𝜙21 = 0.544 are said to be the component loadings. Intuitively: under the constraint 𝜙11 2 + 𝜙21 2 = 1, this particular linear combination yields the highest variance, i.e. 𝑌1 is defined such that 𝑉𝑎𝑟(𝑌1) is maximized. Principal component scores: for the generic unit 𝑖, related to the pair/point (𝑝𝑜𝑝𝑖, 𝑎𝑑𝑖) in the original space, we have the following value along the 𝑦𝑖1-axis in the PC space 𝑦𝑖1 is called the principal component score of unit 𝑖. (Below a subset of the data) 42 The first PC line minimizes the sum of the squared perpendicular distances (dashed line segments = segmenti di linea tratteggiata) between each point and the line. The crosses on the line represent the projection of each point onto the first PC line. In the figure, the right-hand panel has been rotated, as usual, so that the first PC direction coincides with the x-axis. Remarks: 1) It is possible to show that the first PC score for the 𝑖th observation is the distance in the x-direction of the 𝑖th cross from zero (blue dot)! 2) The PC-space spanned by (𝑌1, 𝑌2) is not a “deformed” version of the original space spanned by (pop, ad), but it is simply a roto-translation! Interpretation: for example, the point in the bottom-left corner of the left-hand panel of the figure has a large negative PC score (𝑦𝑖1 = −26.1), while the point in the top- right corner has a large positive score (𝑦𝑖1 = 18.7). Thus, the first PC score can be seen as single-number summary of the joint behavior of pop and ad spending for each city. By looking at left-hand panel of the figure: • A negative 𝑦1 score roughly indicates a city with below-average population size and below-average ad spending; • A positive 𝑦1 score roughly indicates a city with above-average population size and above-average ad spending. So far we have looked only at the first PC. In general, one can construct up to 𝑑 PCs! Second PC: the second PC 𝑌2 is a linear combination of the variables that is uncorrelated with 𝑌1, and has largest variance subject to this constraint. What does uncorrelation mean?: uncorrelation between 𝑌1 and 𝑌2 amounts to imposing the second principal component direction to be orthogonal to the first one! In our example, the second PC 𝑌2 is given by Computing PCs Assume we have an 𝑛 × 𝑑 data matrix 𝑿. Since we are only interested in variance, we can just work on the centered data, i.e. we work with ?̃?. 45 Properties of PCs Choosing the number of PCs When dealing with a real-data situation, we can derive PCs starting either from the sample correlation matrix R, or the sample covariance matrix S, depending on whether the data have been standardized or not. Goal: the goal is dimensionality reduction, that is reducing from 𝑑 to 𝑞 the number of variables. Question: how do we select 𝑞 such that we reduce the dimensionality of our data, still retaining as much variability as possible? Answer(s): there are several criteria to do so. These are all heuristic rules rather than rigorous methodologies. The reason is that PCA is completely distribution free. We can choose 𝑞 based on: 1. Cumulative proportion of variance explained: extract the first 𝑞 PCs which are able to explain a substantial proportion of variance explained; 2. Kaiser’s rule: omit the PCs containing less information than the mean information per PC; 46 3. Scree plot: retain as many components as until a significant jump on the scree- plot appears; 1) Cumulative proportion of explained variance Retain as many PCs as needed to explain at least the 80% of the total variance. The proportion of variance explained by the 𝑚th principal component, say PVEm, is when we deal with the sample covariance matrix S, and When we deal with standardized variables, that is with the sample correlation matrix R The criterion: 2) Kaiser’s rule This rule was first proposed by Kaiser as a selection criterion for PCs derived from a correlation matrix R. The rule suggests to retain as many principal components as are the eigenvalues of R larger than 1. But this rule can be generalized to PCs computed via the eigen decomposition of S. 47 The criterion: 3) Scree plot This rule, that can be applied to PCs extracted both from S or R, is a graphical criterion and is based on what is called scree plot. This is simply a plot of the 𝜆𝑚 values against the corresponding 𝑚, 𝑚 = 1, . . . , 𝑑. How should the scree plot be used? The scree plot suggests to select 𝑞, on the 𝑥-axis, as the value to the left of the point where the curve becomes flat, namely to the left of the point at which the elbow occurs Preliminaries for biplot... To be noted: we can always compute correlations between the original variables and the PCs! 50 1) First issue: different mean Each original variable may have a different mean. It is usually beneficial for each variable to be centered at zero for PCA, due to the fact that it makes comparing each principal component to the mean straightforward. 2) Second issue: different scale The variance of Assault is about 6945, while the variance of Murder is only 18.97. The Assault data isn’t necessarily more variable, it’s simply on a different scale relative to Murder! Remember that the variance is not a relative measure of variability! Standardizing (scaling) each variable will fix the issue However, keep in mind that there may be instances where scaling is not desirable. An example would be if every variable in the data set had the same units and the analyst wished to capture this difference in variance for his or her results. Since Murder, Assault, and Rape are all measured on occurrences per 100,000 people this may be reasonable depending on how you want to interpret the results. But since UrbanPop is measured as a percentage of total population it wouldn’t make sense to compare the variability of UrbanPop to Murder, Assault, and Rape. Important!: the important thing to remember is that PCA is influenced by the magnitude of each variable; therefore, the results obtained when we perform PCA will also depend on whether the variables have been individually scaled. Because it is undesirable for the PCs obtained to depend on an arbitrary choice of scaling, we typically scale each variable to have standard deviation one before we perform PCA. 51 If we perform PCA on the unscaled variables, then the first PC loading vector 𝜙1 will have a very large loading for Assault, since that variable has by far the highest variance The right-hand plot displays the first two PCs, without scaling the variables to have standard deviation one. As predicted, 𝜙1 places almost all of its weight on Assault, while the second PC loading vector 𝜙2 places almost all of its weight on UrbanPop. Comparing this to the left-hand plot, we see that scaling does indeed have a substantial effect on the results obtained. Looking at the arrows Overall, we see that the crime-related variables (Murder, Assault, and Rape) are located close to each other, and that UrbanPop is far from the other three. This indicates that the crime-related variables are correlated with each other – states with high murder rates tend to have high assault and rape rates – and that the UrbanPop variable is less correlated with the other three. Computing PCs To calculate principal components, we start by using the 𝑐𝑜𝑣() function to calculate the covariance matrix S (or the correlation matrix R if data have been standardized), followed by the 𝑒𝑖𝑔𝑒𝑛() command to calculate the eigenvectors and eigenvalues of 52 S. 𝑒𝑖𝑔𝑒𝑛() produces an object that contains both the ordered eigenvalues ($𝑣𝑎𝑙𝑢𝑒𝑠) and the corresponding eigenvector matrix ($𝑣𝑒𝑐𝑡𝑜𝑟𝑠). Just as an example, we take the first two sets of loadings and store them in the matrix called phi. Principal component loadings Warning: eigenvectors that are calculated in any software package are unique up to a sign flip. Suggestion: choice the sign of each eigenvector in order to make the interpretation of the corresponding PC easier. By default, eigenvectors in R point in the negative direction. For this example, we’d prefer the eigenvectors point in the positive direction because it leads to more logical interpretation of graphical results as we’ll see shortly. To use the positive-pointing vector, we multiply the default loadings by -1. The set of loadings for the first principal component (𝑃𝐶1) and second principal component (𝑃𝐶2) are shown below: 55 Remark: The elbow point is not so clear in this example; anyway, according to the cumulative PVE rule, reduction from d = 4 original variables to 2 PCs, while still explaining 86.7% of the variability, is a good compromise gomito 56 SLIDE 5: Cluster Analysis Cluster Analysis (CA), simply said clustering, is one of the most important statistical methods for discovering knowledge in multidimensional data. The goal of CA is to identify patterns (or groups, or clusters) of similar units within a data set X. In the literature, CA is also referred to as “unsupervised machine learning”: unsupervised → because we are not guided by a priori ideas about the underlying clusters that, for this reason, are often referred to as “latent groups”. learning → because the machine algorithm “learns” how to cluster. Some very general considerations Clustering observations means partitioning them into distinct groups such that: • observations within the same group are similar; • observations from different (between) groups are as different as possible from each other. Preliminary requirement: to make CA concrete, we must define what it means for two or more observations to be similar or different. Comparison with PCA: both clustering and PCA seek to simplify the data via a small number of summaries, but their mechanisms are different: o PCA looks to find a low-dimensional representation of the observations that explain a good fraction of the variance; o Clustering looks to find homogeneous subgroups among the observations. The logic 57 Examples: • Finding homogeneous groups of users of mobile phones; • Finding personality types based on questionnaire data; • Looking for families, species, or groups of animals or plants; • Looking for search types based on Google search histories; • Based on observed choices, find customer typologies to introduce new products; • Arranging insured persons of an insurance company into groups (risk classes). Clustering distance/dissimilarity measures As for PCA, the starting point of CA is an 𝑛 × 𝑑 data matrix X. Clustering units: in CA we are interested in the units (rows of X). Where is the information for unit i?: with reference to unit 𝑖, its information is so contained in the d-dimensional vector 𝒙𝑖 (row of X), with 𝑖 = 1, . . . , 𝑛. The classification of observations (rows of X) into groups requires some methods for computing the distance or the dissimilarity between each pair of observations. Dissimilarity or distance matrix: the result of this computation, for each pair of units, yields the so-called dissimilarity or distance matrix, denoted as D hereafter. Important: the choice of dissimilarity or distance measures is a critical step in clustering. It defines how the similarity of two elements 𝑥𝑖 and 𝑥𝑗 , vectors of length 𝑑, is calculated and it will influence the shape of the clusters. But...: → Many alternative dissimilarity or distance measures may be applied on the same data, depending on the purposes and the premises of the analysis. → Cluster analysis requires either: • an explicit definition of quantitative (dis-)similarity measures, • or an implicit definition of these measures by models, methods, etc. 60 The top example shows a case where 𝑧 is much less than the sum 𝑥 + 𝑦 of the other two sides, and the bottom example shows a case where the side z is only slightly less than 𝑥 + 𝑦 • City-block (or Manhattan) distance: the city-block (or Manhattan) distance is defined as the sum of differences across dimensions, that is Properties of the city-block (Manhattan) distance: the properties stated for the Euclidean distance also hold for this measure. Robusteness: the effect of single large differences (outliers) is dampened/reduced (since they are not squared). Remark: the choice of the distance is very important, as it has a strong influence on the clustering results. For most common clustering software, the default distance measure is the Euclidean distance. 61 • Minkowski distance: the Minkowski distance of order p is defined as To be noted: the Minkowski distance can be considered as a generalization of both the Euclidean distance (obtained as a special case when 𝑝 = 2) and the Manhattan distance (obtained as a special case when 𝑝 = 1). • Weighted Euclidean distance: if we have some idea of the relative importance that should be assigned to each variable, then we can weight them and obtain the weighted Euclidean distance 2) Binary variables A binary variable assumes only two values: 1 (positive/present) or 0 (negative/absent). • Symmetric binary variables: A binary variable is symmetric if both of its values are equally valuable, that is, there is no preference on which outcome should be coded as 1. Example: The binary variable X = “is evergreen?” for a plant has the possible states “loses leaves in winter” and “does not lose leaves in winter”. Both are equally valuable and carry the same weight when a dissimilarity measure is computed. • Asymmetric binary variables: a binary variable is asymmetric if the outcomes are not equally important. 62 Example: an example of asymmetric binary variable is the presence or absence of a relatively rare attribute, such as X = “is color-blind” for a human being. While you say that two people who are color-blind have something in common, you cannot say that people who are not color-blind have something in common. The most important outcome is usually coded as 1 (present) and the other is coded as 0 (absent). The agreement of two 1’s (a present-present match or a positive match) is more significant than the agreement of two 0’s (an absent-absent match or a negative match). Usually, the negative match is treated as irrelevant. To know: when binary variables are employed, the comparison of one data unit with another can only be in terms of whether the data units score the same or different on the variables. If a variable is defined as an asymmetric binary variable and two data units score the same but fall into the absent category, the absent- absent match is excluded from the computation of the dissimilarity measure. Dissimilarity measures Problem: when dealing with binary (or categorical) variables it is not possible to define true distance measures. Can we define alternative dissimilarity measures that still do their job? Answer: Indeed, there are dissimilarity measures, for binary data, which: • are close to zero when 𝑥𝑖 and 𝑥𝑗 are similar; • increase when differences between two units are larger; • are symmetric; • are equal to zero when considering the dissimilarity of a unit with him/herself. To be noted: nevertheless, these dissimilarity measures generally do not satisfy the triangle inequality. Contingency table for binary data: all relevant information to calculate the dissimilarity between two units, with respect to the available d binary variables, is displayed in the following 2 × 2 contingency table. 65 5) Variables of mixed types A data matrix X may contain variables of different (mixed) types: continuous, symmetric binary, asymmetric binary, nominal and ordinal. Problem: what dissimilarity measure can we use in this case? Answer: Gower’s index: we can use the Gower index, where the dissimilarity between 𝒙𝒊 and 𝒙𝒋 is measured as a weighted average of dissimilarities 𝑑𝑖𝑗ℎ for each variable 𝑋ℎ, that is for binary and nominal variables Distance matrix computation i, j -> unit h = variable R functions and packages There are many R functions for computing distances between pairs of observations: dist () (in the stats package): accepts only numeric data as an input. daisy() (inthe cluster package): able to handle other variable types (e.g. nominal, ordinal, (a)symmetric binary). In that case, the Gower coefficient will be automatically used as the metric. UIORot= Rae) Co) All these functions compute distances between rows of the data. Computing Euclidean distance To compute Euclidean distance, we can use the dist () function, as follows: > # Computing Euclidean distance > > dist.euc]l <- dist(df.scaled, method = "euclidean") Allowed values for the option method include one of: "eucliaean", "maximum", "manhattan", "canberra", "binary", and "minkowski". To make it easier to see the distance information generated by the dist () function, you can reformat the distance vector into a matrix using the as.matrix() function. > # Reformat as a matrix > # Subset the first 5 columns and rows and round the values > > round(as.matrix(dist.euc1)[1:5, 1:5], 2) New Mexico Iowa Indiana Arizona Tennessee New Mexico 0.00 4.11 2.48 1.09 1.50 Towa 4.11 0.00 1.78 3.96 3.47 Indiana 2.48 1.78 0.00 2.50 1.86 Arizona 1.09 3.96 2.50 0.00 2.32. Tennessee 1.50 3.47 1.86 2.32 0.00 In this symmetric matrix, each value represents the distance between units. The values on the diagonal represent the distance between units and themselves (which is zero). If we want to compute pairwise distances between variables, we must start by transposing the data to have variables in the rows of the data set before using the dist () function. The function t () is used for transposing the data. 66 67 Computing distances for mixed data Visualizing distance matrices The color level is proportional to the value of the dissimilarity between observations. Red indicates high similarity (i.e.: low dissimilarity) while blue indicates low similarity 70 Remark: for any two observations, we can look for the point in the tree where branches containing those two observations are first fused. The height of this fusion, as measured on the vertical axis, indicates how different (distant or dissimilar) the two observations are! Stating it differently, we draw conclusions about the similarity of two observations based on the location on the vertical axis where the fusion first occurs. 1) Single linkage (or nearest neighbor) method Let A and B two clusters of size 𝑛𝐴 and 𝑛𝐵, respectively. According to the single linkage (or nearest neighbor) method, the distance between A and B is defined as the smallest of the nAnB distances between the units in A and the units in B, that is 2) Complete linkage (or furthest neighbor) method According to the complete linkage (or furthest neighbor) method, the distance between A and B is defined as the greatest of the 𝑛𝐴𝑛𝐵 distances between the units in A and the units in B, that is 71 3) Average linkage method According to the average linkage method, the distance between A and B is defined as the average of the 𝑛𝐴𝑛𝐵 distances between the units in A and the units in B, that is All the linkage methods discussed so far work on distances of units belonging to two different clusters. Comparison of linkage methods © Because of the chain-effect, the single linkage method is sensitive to outliers. © Thanks to the chain-effect, the single linkage method has the advantage to handle groups with shapes which are different from the hyper-spherical ones, which are instead guaranteed by the complete linkage method. @ The complete linkage method © is less susceptible to noise and outliers; © can break large clusters; © favors hyper-spherical shapes. @ The average linkage method is an intermediate approach between the single and complete link approaches. Set of 6 Two-Dimensional Points xy Coordinates of 6 Points Pi Point x Coordi 05 pl 0.40 0.53 da p2 0.22 0.38 p3 0.35 0.32 0.3 pi 0.26 0.19 0.2 ps 0.08 0.41 p6 0.45 0.30 0.1 0 0.1 02 03 04 0.5 0.6 Nested Cluster Diagram Single Link Distance Matrix 1° 4,2,5,3,6 1) 0 0.22 2,5,3,6 0 And iterate until there is one all-inclusive cluster. J Nested Cluster Diagram Hierarchical Tree Diagram 0.2 0.15 a 0.1 0.05 0 Nested Cluster Diagram Single Link Distance Matrix 1 2 3 4 5 6 i 1| 0 024 0.22 0.37 0.34 0.23 n° 2 0 0.15 0.20 0.14 0.25 è; 3 0 0.15 0.28 0.11 ®3 06 4 0 0.29 0.22 5 0 039 4 6 0 75 Nested Cluster Diagram Complete Link Distance Matrix 1 2 8 4 5 6 ei 1| 0 0.24 0.22. 0.37 0.34 (0.23 n 2 0 0.15 020 0.14 025 ©? 3 O 0.15 028 0.11 4 0 029 022 } 5 0 039 (I 6 o Points 3 and 6 have the smallest Euclidean distance. Merge these points into one cluster and update the distances to this new cluster. For example, by looking at the yellow rectangles in the first row of D, and due to the complete linkage method, the distance from point 1 to this cluster is 0.23 (the distance to point 6). Nested Cluster Diagram Complete Link Distance Matrix 1 2 4 5 3,6 ©1 0 0.24 0.37 034 0.23 2 0 020 0.14 0.25 0 029 0.22 0 0.39 ®4 È 0 Points 2 and 5 have the smallest complete link proximity distance. Merge these points into one cluster and update the distances to this new cluster. Nested Cluster Diagram Complete Link Distance Matrix 1 4 ZA 3,6 0 037 0.34 0.23 ei 0 0.29 0.22 0 0.39 And iterate... 76 Nested Cluster Diagram Complete Link Distance Matrix 1 25 4,3,6 1| 0 0.34 0.37 2,5 0 0.39 4,3,6 0 And iterate... J Nested Cluster Diagram Complete Link Distance Matrix 12,5 4,3,6 1,2,5 0 0.39 4,3,6 0 And iterate until there is one all-inclusive cluster. J Nested Cluster Diagram Hierarchical Tree Diagram 0.4 0.3 0.2 0.1 77 80 Internal measures: Cohesion and Separation → So, TSS is the sum of • BSS, a measure of cluster separation, i.e. a measure of how well-separated are clusters (in terms of variability between means). • WSS, a measure of cluster cohesion, i.e. a measure of how closely related are units within a cluster. Remarks: • An optimal partition should have high BSS and low WSS values; • High BSS implies low WSS and viceversa, since TSS does not depend on the partition. The peculiarities of Ward’s minimum deviance method are: • as for the centroid linkage method, it assumes that a cluster is represented by its centroid; • an objective function is explicitly defined. At each step, the groups that involve the minimum increase in WSS, i.e. the groups that ensure the greatest possible cluster cohesion, are joined together; • as usual the classification process starts by considering n clusters (i.e. clusters containing a single point). In this case WSS = 0; • each merge leads to an increase of WSS; • it does not require the preliminary calculation of a distance matrix D; • it is possible to show that the minimization of the increase in WSS at each step of the hierarchical procedure is equivalent to the use of distance in the centroids linkage method, which corresponds to the squared Euclidean distance between the centroids ?̅?𝑨 and ?̅?𝐵 of the two groups, multiplied by a quantity which is a function of the number of units of the two groups. 81 Agglomerative hierarchical methods – to sum up Of the hierarchical methods, average linkage and Ward’s methods have been shown to perform better than the other procedures. Example: USArrests The USArrests data are contained in the datasets package for R. 82 As said before, average linkage and Ward’s method are generally preferred. Remark: note that, the only difference between 𝑤𝑎𝑟𝑑. 𝐷 and 𝑤𝑎𝑟𝑑. 𝐷2 is the distance matrix to be given as an input to ℎ𝑐𝑙𝑢𝑠𝑡(). In detail ℎ𝑐𝑙𝑢𝑠𝑡(𝑑𝑖𝑠𝑡(𝑥)2, 𝑚𝑒𝑡ℎ𝑜𝑑 = "𝑤𝑎𝑟𝑑. 𝐷") is equivalent to ℎ𝑐𝑙𝑢𝑠𝑡(𝑑𝑖𝑠𝑡(𝑥), 𝑚𝑒𝑡ℎ𝑜𝑑 = "𝑤𝑎𝑟𝑑. 𝐷2"). So, 𝑚𝑒𝑡ℎ𝑜𝑑 = "𝑤𝑎𝑟𝑑. 𝐷2" is typically preferred! Dendogram The dendrogram can be produced in R using the base function 𝑝𝑙𝑜𝑡(𝑟𝑒𝑠. ℎ𝑐), where 𝑟𝑒𝑠. ℎ𝑐 is the output of ℎ𝑐𝑙𝑢𝑠𝑡(). Here, we’ll use the function 𝑓𝑣𝑖𝑧 𝑑𝑒𝑛𝑑() (in the factoextra package) to produce a beautiful dendrogram. Height The result of the cuts can be visualized easily using the function fviz_dena () (in the factoextra package). > # Cut the tree in 4 groups and color by groups > > fviz_dend(res.hc, k = 4, # Cut in four groups + cex = 0.5, # label size + k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"), + d SE color_labels_by_k = TRUE, # color labels by groups rect = TRUE # Add rectangle around groups Cluster Dendrogram 3 a iii; 1 853 3 ti siti I sa8i si 85 We can visualize these clustering results in the original space, via the matrix of pairwise scatterplots, using the pairs() function. > # Matrix of scatterplots on the original variables > > pairs(df, gap=0, pch=grp, col=c("#2E9FDF", "#00AFBB", "#E78800", "#FC4£07")[grp]) 49 25 06 18 400102 P, al °, a 4 ° è A È a a? sad, see, ae 4 da AL di 48 A Murder 4a a + ata A * ta La gi Sid + # E SERI vat et dt Mep (a+ " ° Assault 3 n _ K - +, f d 4 al Ba fa 4 ‘a +e e 5 + 40 7 x x I° " 2A nd SA nl n° A + pd Bi pa Rape gie da . s > 44 Si 4, _ EH 4° 0 CRE i t + a 4A 00102 2040001 Using the function £viz_cluster() (inthe factoextra package), we can also visualize the results in the scatter plot (space) of the first 2 PCs (since d > 2). > # Scatter plot of the first 2 >cs > > fviz_cluster(list(data = dî, cluster = grp), + palette = c("#2E95DF", "#00AFBB", "#E7B800", "#FC4E07"), + ellipse.type = "convex", # concentration ellipse + repel = TRUS, # avoid lahe] overplotting (sTow) + show. c]ust.cent = FALSE, gatheme = theme_minimal (0) Cluster plot Wegi ig fa te a Ternessee Ù Dimi {e2%) 86 The package cluster makes it easier to perform cluster analysis in R. It provides the functions agnes () and diana () for computing agglomerative and divisive clustering, respectively. These functions perform all the necessary steps for us. We don't need to execute the scale (), dist () andhclust () functions separately. > library("cluster") > > # AGglomerative NESting (Hierarchical clustering) > > res.agnes <- agnes(x = USArrests, # data matrix * stand = TRUE, # Standardize the data + metric = "euclidean", # metric for distance matrix + method "ward" # Linkage method +) > > # DIvisive ANAlysis clustering > > res.diana <- diana(x = USArrests, # data matrix + stand = TRUE, # standardize the data + metric = "euclidean" # metric for distance matrix +) After running agnes () and/or diana (), we can use the function fviz_dend() (in the factoextra package) to visualize the output: > # visualize the output > > fviz_dend(res.agnes, cex = 0.6, k = 4) Cluster Dendrogram Height AAT Michigan Arvansas Kansas Oklahoma Missouri Oregon Washington Newexico Pennsylvania Massachusetts NewJersey 87 90 Example: USArrests Data: the USArrests data are contained in the datasets package for R. Important: the data must contain only continuous variables, as the K-means algorithm uses variable means. As we don’t want the K-means algorithm to depend to an arbitrary variable unit, we start by scaling the data using the R function 𝑠𝑐𝑎𝑙𝑒() as follows: Required R packages and functions: the standard R function for K-means clustering is kmeans() (within the stats package), which simplified format is as follows: Estimating the optimal number of clusters: The K-means clustering requires the users to specify the number of clusters K to be generated. 91 Fundamental question: how to choose K? Different methods can be considered with this aim. A simple solution: compute K-means clustering using different values for K. Next, for each K, compute the WSS. The location of a bend (knee or elbow) in the plot is generally considered as an indicator of the appropriate number of clusters. The R function 𝑓𝑣𝑖𝑧 𝑛𝑏𝑐𝑙𝑢𝑠𝑡() (in the factoextra package) provides a way to put into practice this idea. Computing K-means clustering As K-means clustering algorithm starts with K randomly selected centroids, it’s always recommended to use the 𝑠𝑒𝑡. 𝑠𝑒𝑒𝑑() function in order to set a seed for R’s random number generator. The aim is to make reproducible the results, so that the reader of the article/report will obtain exactly the same results as those shown below. As the final result of K-means clustering is sensitive to the random starting 92 assignments, we specify 𝑛𝑠𝑡𝑎𝑟𝑡 = 25. This means that R will try 25 different random starting assignments and then select the best results corresponding to the one with the lowest within cluster variation (WSS). Important!: the default value is 𝑛𝑠𝑡𝑎𝑟𝑡 = 1, but it’s strongly recommended to compute K-means clustering with a large value of 𝑛𝑠𝑡𝑎𝑟𝑡 such as 25 or 50, in order to have a more stable result It’s possible to compute the mean of each variable by cluster, using the original data, via the aggregate() function: 95 K-means clustering advantages and disadvantages Advantages: K-means clustering is a very simple and fast algorithm. K-means can efficiently deal with very large data sets. Disadvantages: K-means requires the analyst to choose the appropriate number of clusters (K) in advance. The final results obtained are sensitive to the initial random selection of cluster centers. Why is this a problem? Because, for every different run of the algorithm on the same data set, you may choose a different set of initial centers. This may lead to different clustering results on different runs of the algorithm. 3 K-means is sensitive to outliers. 96 2) K-medoids clustering (PAM algorithm) Differently from K-means, in K-medoids clustering each cluster is represented by one of the data points in the cluster. These points are named cluster medoids. The term medoid refers to a unit within a cluster for which average dissimilarity between it and all the other members of the cluster is minimal. It corresponds to the most centrally located point in the cluster. These units (one per cluster) can be considered as a representative example of the members of that cluster. Comparison with the mean of the K-means clustering: In K-means clustering, each cluster is represented by the mean value of all the data points in the cluster. Advantage: K-medoids clustering is not limited to be used with numerical variables. Robustness: K-medoids is a robust alternative to K-means clustering. This means that, the algorithm is less sensitive to noise and outliers, compared to K-means, because it uses medoids as cluster centers instead of means (used in K-means). PAM Algorithm The most common K-medoids clustering method is the PAM (partitioning around medoids) algorithm. The PAM algorithm is based on: 1) The search for K representative medoids among the observations of the data set. 2) After finding a set of K medoids, clusters are constructed by assigning each observation to the nearest medoid. 97 3) Next, the objective function is computed Objective function: the objective function is the sum of the dissimilarities of all units to their nearest medoid, that is where 𝑑𝑖𝑗 is the distance between units i and j both assigned to cluster 𝐶𝑘 . Goal: the goal is to find K representative units which minimize the sum of the dissimilarities of the observations to their closest representative unit (medoid). Choice of the dissimilarity measure: the PAM algorithm works with a dissimilarity matrix D. To compute D we can use, for example: 1) the Euclidean distance, that is the root sum-of-squares of differences; 2) the Manhattan distance, that is the sum of absolute differences, that should give more robust results for data containing outliers. Example: USArrests Data: the USArrests data are contained in the datasets package for R. As we don’t want the PAM algorithm to depend to an arbitrary variable unit, we start by scaling the data using the R function scale() as follows: