Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Big data analytics presentation, Study notes of Computer Science

Video transcript of the presentation for more details have to see classroom

Typology: Study notes

2019/2020

Uploaded on 07/03/2020

vasanth-bhagawat
vasanth-bhagawat 🇮🇳

1 document

Partial preview of the text

Download Big data analytics presentation and more Study notes Computer Science in PDF only on Docsity! Outliers/Anomalies Detecting Anomalies is critical to any business either by identifying faults or being proactive. Vasanth C bhagawat @ AMCEC What is Anomaly/Outlier? In statistics, outliers are data points that don’t belong to a certain population. It is an abnormal observation that lies far away from other values. An outlier is an observation that diverges from otherwise well- structured data. For Example, you can clearly see the outlier in this list: [20,24,22,19,29,18,4300,30,18] Vasanth C bhagawat @ AMCEC Another reason why we need to detect anomalies is that when preparing datasets for machine learning models, it is really important to detect all the outliers and either get rid of them or analyze them to know why you had them there in the first place. Actually, two types of outliers can be considered: 1. Valid observations (e.g., salary of boss is $1 million) 2. Invalid observations (e.g., age is 300 years) Vasanth C bhagawat @ AMCEC Both are univariate outliers in the sense that they are outlying on one dimension. However, outliers can be hidden in unidimensional views of the data. Multivariate outliers are observations that are outlying in multiple dimensions. Figure 2.3 gives an example of two outlying observations considering both the dimensions of income and age. Vasanth C bhagawat @ AMCEC 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1,000 500 Income and Age * 40 Vasanth C bhagawat @ AMCEC Therefore, if you have any data point that is more than 3 times the standard deviation, then those points are very likely to be anomalous or outliers. Vasanth C bhagawat @ AMCEC Method 2 — Boxplots Box plots are a graphical depiction of numerical data through their quantiles. It is a very simple but effective way to visualize outliers. Think about the lower and upper whiskers as the boundaries of the data distribution. Any data points that show above or below the whiskers, can be considered outliers or anomalous. Vasanth C bhagawat @ AMCEC Boxplot Anatomy: The concept of the Interquartile Range (IQR) is used to build the boxplot graphs. IQR is a concept in statistics that is used to measure the statistical dispersion and data variability by dividing the dataset into quartiles. In simple words, any dataset or any set of observations is divided into four defined intervals based upon the values of the data and how they compare to the entire dataset. A quartile is what divides the data into three points and four intervals. Vasanth C bhagawat @ AMCEC Z - Score Z scores can be used to find the outliers also. Z score tells how many standard deviation a number is from the mean. A positive z score means that the number is above the mean and negative means it is below the mean. Using z - scores : If we use z-score to find possible outliers, our data set must be symmetric(bell shaped) and the outlier has a z-score <-3 or > than 3 Vasanth C bhagawat @ AMCEC where μ represents the average of the variable (mean)and σ its standard deviation. 1. Given that a data set is bell shaped(normal distribution) and has a mean of 64.2 and standard deviation on 10.37 Determine if a data value of 78 is an outlier. 2. Give same information above, test the data value of 98 to see it is outlier. 3. Give data set 3,5,7,9,11 find the z score of 5. Vasanth C bhagawat @ AMCEC Some analytical techniques (e.g., decision trees, neural networks, Support Vector Machines (SVMs)) are fairly robust with respect to outliers. Others (e.g., linear/logistic regression) are more sensitive to them. It highly depends on whether the outlier represents a valid or invalid observation. For invalid observations (e.g., age is 300 years), one could treat the outlier as a missing value using any of the schemes discussed in the previous section. For valid observations (e.g., income is $1 million), other schemes are needed. A popular scheme is truncation/capping/winsorizing. One hereby imposes both a lower and upper limit on a variable and any values below/above are brought back to these limits. Vasanth C bhagawat @ AMCEC
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved