Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Big data analytics presentation, Study notes of Computer Science

Visvesvaraya Technological University Computer Science

Video transcript of the presentation for more details have to see classroom

Typology: Study notes

2019/2020

Uploaded on 07/03/2020

vasanth-bhagawat 🇮🇳

1 document

1 / 17

Partial preview of the text

Download Big data analytics presentation and more Study notes Computer Science in PDF only on Docsity! Outliers/Anomalies Detecting Anomalies is critical to any business either by identifying faults or being proactive. Vasanth C bhagawat @ AMCEC What is Anomaly/Outlier? In statistics, outliers are data points that don’t belong to a certain population. It is an abnormal observation that lies far away from other values. An outlier is an observation that diverges from otherwise well- structured data. For Example, you can clearly see the outlier in this list: [20,24,22,19,29,18,4300,30,18] Vasanth C bhagawat @ AMCEC Another reason why we need to detect anomalies is that when preparing datasets for machine learning models, it is really important to detect all the outliers and either get rid of them or analyze them to know why you had them there in the first place. Actually, two types of outliers can be considered: 1. Valid observations (e.g., salary of boss is $1 million) 2. Invalid observations (e.g., age is 300 years) Vasanth C bhagawat @ AMCEC Both are univariate outliers in the sense that they are outlying on one dimension. However, outliers can be hidden in unidimensional views of the data. Multivariate outliers are observations that are outlying in multiple dimensions. Figure 2.3 gives an example of two outlying observations considering both the dimensions of income and age. Vasanth C bhagawat @ AMCEC 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1,000 500 Income and Age * 40 Vasanth C bhagawat @ AMCEC Therefore, if you have any data point that is more than 3 times the standard deviation, then those points are very likely to be anomalous or outliers. Vasanth C bhagawat @ AMCEC Method 2 — Boxplots Box plots are a graphical depiction of numerical data through their quantiles. It is a very simple but effective way to visualize outliers. Think about the lower and upper whiskers as the boundaries of the data distribution. Any data points that show above or below the whiskers, can be considered outliers or anomalous. Vasanth C bhagawat @ AMCEC Boxplot Anatomy: The concept of the Interquartile Range (IQR) is used to build the boxplot graphs. IQR is a concept in statistics that is used to measure the statistical dispersion and data variability by dividing the dataset into quartiles. In simple words, any dataset or any set of observations is divided into four defined intervals based upon the values of the data and how they compare to the entire dataset. A quartile is what divides the data into three points and four intervals. Vasanth C bhagawat @ AMCEC Z - Score Z scores can be used to find the outliers also. Z score tells how many standard deviation a number is from the mean. A positive z score means that the number is above the mean and negative means it is below the mean. Using z - scores : If we use z-score to find possible outliers, our data set must be symmetric(bell shaped) and the outlier has a z-score <-3 or > than 3 Vasanth C bhagawat @ AMCEC where μ represents the average of the variable (mean)and σ its standard deviation. 1. Given that a data set is bell shaped(normal distribution) and has a mean of 64.2 and standard deviation on 10.37 Determine if a data value of 78 is an outlier. 2. Give same information above, test the data value of 98 to see it is outlier. 3. Give data set 3,5,7,9,11 find the z score of 5. Vasanth C bhagawat @ AMCEC Some analytical techniques (e.g., decision trees, neural networks, Support Vector Machines (SVMs)) are fairly robust with respect to outliers. Others (e.g., linear/logistic regression) are more sensitive to them. It highly depends on whether the outlier represents a valid or invalid observation. For invalid observations (e.g., age is 300 years), one could treat the outlier as a missing value using any of the schemes discussed in the previous section. For valid observations (e.g., income is $1 million), other schemes are needed. A popular scheme is truncation/capping/winsorizing. One hereby imposes both a lower and upper limit on a variable and any values below/above are brought back to these limits. Vasanth C bhagawat @ AMCEC

Documents

questions

Big data analytics presentation, Study notes of Computer Science

Related documents

Partial preview of the text