Download Big data analytics presentation and more Study notes Computer Science in PDF only on Docsity! Outliers/Anomalies Detecting Anomalies is critical to any business either by identifying faults or being proactive. Vasanth C bhagawat @ AMCEC What is Anomaly/Outlier? In statistics, outliers are data points that don’t belong to a certain population. It is an abnormal observation that lies far away from other values. An outlier is an observation that diverges from otherwise well- structured data. For Example, you can clearly see the outlier in this list: [20,24,22,19,29,18,4300,30,18] Vasanth C bhagawat @ AMCEC Another reason why we need to detect anomalies is that when preparing datasets for machine learning models, it is really important to detect all the outliers and either get rid of them or analyze them to know why you had them there in the first place. Actually, two types of outliers can be considered: 1. Valid observations (e.g., salary of boss is $1 million) 2. Invalid observations (e.g., age is 300 years) Vasanth C bhagawat @ AMCEC Both are univariate outliers in the sense that they are outlying on one dimension. However, outliers can be hidden in unidimensional views of the data. Multivariate outliers are observations that are outlying in multiple dimensions. Figure 2.3 gives an example of two outlying observations considering both the dimensions of income and age. Vasanth C bhagawat @ AMCEC 4,500
4,000
3,500
3,000
2,500
2,000
1,500
1,000
500
Income and Age
*
40
Vasanth C bhagawat @ AMCEC
Therefore, if you have any data point that is more than 3 times the standard deviation, then those points are very likely to be anomalous or outliers. Vasanth C bhagawat @ AMCEC Method 2 — Boxplots Box plots are a graphical depiction of numerical data through their quantiles. It is a very simple but effective way to visualize outliers. Think about the lower and upper whiskers as the boundaries of the data distribution. Any data points that show above or below the whiskers, can be considered outliers or anomalous. Vasanth C bhagawat @ AMCEC Boxplot Anatomy: The concept of the Interquartile Range (IQR) is used to build the boxplot graphs. IQR is a concept in statistics that is used to measure the statistical dispersion and data variability by dividing the dataset into quartiles. In simple words, any dataset or any set of observations is divided into four defined intervals based upon the values of the data and how they compare to the entire dataset. A quartile is what divides the data into three points and four intervals. Vasanth C bhagawat @ AMCEC Z - Score Z scores can be used to find the outliers also. Z score tells how many standard deviation a number is from the mean. A positive z score means that the number is above the mean and negative means it is below the mean. Using z - scores : If we use z-score to find possible outliers, our data set must be symmetric(bell shaped) and the outlier has a z-score <-3 or > than 3 Vasanth C bhagawat @ AMCEC where μ represents the average of the variable (mean)and σ its standard deviation. 1. Given that a data set is bell shaped(normal distribution) and has a mean of 64.2 and standard deviation on 10.37 Determine if a data value of 78 is an outlier. 2. Give same information above, test the data value of 98 to see it is outlier. 3. Give data set 3,5,7,9,11 find the z score of 5. Vasanth C bhagawat @ AMCEC Some analytical techniques (e.g., decision trees, neural networks, Support Vector Machines (SVMs)) are fairly robust with respect to outliers. Others (e.g., linear/logistic regression) are more sensitive to them. It highly depends on whether the outlier represents a valid or invalid observation. For invalid observations (e.g., age is 300 years), one could treat the outlier as a missing value using any of the schemes discussed in the previous section. For valid observations (e.g., income is $1 million), other schemes are needed. A popular scheme is truncation/capping/winsorizing. One hereby imposes both a lower and upper limit on a variable and any values below/above are brought back to these limits. Vasanth C bhagawat @ AMCEC