Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Cloud computing is a technology that enables users to access computing resources, such as, Assignments of Mathematics

Karachi Institute Of Economics And Technology (KIET)Mathematics

Cloud computing is a technology that enables users to access computing resources, such as servers, storage, databases, and software, over the internet. It allows businesses to use these resources on a pay-as-you-go basis, making it a cost-effective solution for companies of all sizes.

Typology: Assignments

2017/2018

Uploaded on 04/18/2023

muhammad-muneeb-10 🇵🇰

5 documents

1 / 17

Often downloaded together

Lecture Notes on Operating Systems

Google Cloud Digital Leader Exam Dumps 2023

Google Cloud Certified - Professional Cloud Architect Exam Dumps 2023

CPCM Practice Exam 2024 | All Questions and Answers | Verified Answers | Latest Version

Amazon Web Services (AWS) and Google Cloud Platform (GCP) and to deploy a LAMP server with

Google Associate Cloud Engineer Exam Dumps 2023

ITIL Service Strategy

Cloud computing virtualization

Google Professional Cloud Network Engineer Certification Exam Dumps 2023

ITIL best practise – ITIL introduction IT3860

Cloud Services Innovation Platform (CSIP)

Cloud Security Technical Reference Architecture

Cloud Computing (Fundamentals of Computers)

Cloud Computing – How did we get here?

Google Analytics Terms to Know

Partial preview of the text

Download Cloud computing is a technology that enables users to access computing resources, such as and more Assignments Mathematics in PDF only on Docsity! DATA SCIENCE Semester: ______________________ Date of Experiment: ____________________ Student name: __________________ Faculty Signature: ______________________ Student ID: ____________________ Lab05 DATA CLEANING PLOs PLO1 – Engineering Knowledge Bloom’s Taxonomy C1 – Recall PLO2- Problem Analysis C3 - Apply PLO5 – Modern Tool Usage C3 - Apply PLO8 – Ethics P2 – Set LAB TASK PERFORMANCE CLO’s Aspects of Assessments Excellent (75-100%) Average (50-75%) Poor (<50%) Marks CLO1 PLO1 10% Recall The associated concepts of Programming Language. Complete understanding of Programming / actively participate during lecture. Complete understanding of Programming / less actively participate during lecture. Student lacks clear understanding of concepts of Programming / Unable to read and interpret it. CLO2 PLO2 40% Problem Analysis Problem identification, analysis /literature review, resulting in meaningful conclusions Completely identifies the problem in question through efficient analysis/produces near to exact results Partially identifies the problem in question and with academic support produces the required results. Lack of identification of the problem, needing more than par support to analyze the problem and production of results. CLO5 PLO5 40% Tools Utilization Apply and discover different level functions of Jupyter Notebook, Colab. Accurately implement the functions of Jupyter Notebook or Colab, tools and obtain the correct output as per requirement/ given tasks. Implement the functions of Jupyter Notebook or Colab, tools with minor errors that will lead to a slightly different output as per given in a task. Not able to implement the functions of Jupyter Notebook or Colab, tools and don’t understand how required output and task is achieved. CLO6 PLO8 10% Lab Safety Properly handle lab infrastructure/ safety precautions Properly handle lab equipment & obey safety measures. Moderate level lab handling and safety measurements Minor or no safety measurements has been considered. Total Marks: 10 KARACHI INSTITUTE OF ECONOMICS AND TECHNOLOGY Department of Software Engineering College of Engineering Objective In this lab, you will learn how to effectively clean and preprocess data using statistical measures such as mean, median, mode, variance, and identification of outliers. These techniques are essential for accurate and reliable data analysis and visualization. Requirement for This Lab: 1. Jupyter Notebook or Google Colab 2. Pandas library Prerequisites Basic knowledge of Python and data analysis. Dataset: Netflix Movies and TV Shows and Corona Virus data. The dataset can be found on Kaggle. Data cleaning Data cleaning is the process of detecting and correcting or removing errors, inconsistencies, and inaccuracies in data, to ensure that it is accurate, complete, and reliable for analysis. Preprocessing Preprocessing is the step-in data analysis where data is cleaned, transformed, and prepared for analysis. It involves tasks such as data cleaning, data integration, data transformation, and data reduction. Mean The mean is a statistical measure that represents the average value of a set of data. It is calculated by summing all the values in the data set and dividing the sum by the number of values. Median The median is a statistical measure that represents the middle value of a set of data. It is the value that separates the data into two equal halves when the data set is ordered from smallest to largest. Mode The mode is a statistical measure that represents the most frequently occurring value in a set of data. #find the number of rows and columns, the dataset contains, use the .shape method. df_netflix_2019.shape # For checking data types df_netflix_2019.dtypes # Checking for duplicates in the dataset df_netflix_2019.duplicated().sum() 3. Identify Missing Data Next, we will check if there are any missing values in the dataset and handle them accordingly. We can use the isnull() function to check for missing values in the dataset, and the fillna() function to fill in missing values with appropriate values. # Checking for missing values in the dataset df_netflix_2019.isnull().sum() # % of rows missing in each column for column in df_netflix_2019.columns: percentage = df_netflix_2019[column].isnull().mean() print(f'{column}: {round(percentage*100,2)}%') 4. Dealing with Missing Data There are different ways of dealing with missing data. The correct approach to handling missing data will be highly influenced by the data and goals your project has. Remove a column or row with .drop, .dropna or .isnull #drop column df_netflix_2019.drop('director', axis=1) #dropns() df_netflix_2019.dropna(subset=['director']) #, inplace=True) fig, ax = plt.subplots(nrows=1, ncols=1) plt.hist(df_movie['minute']) fig.tight_layout() Using boxplots to identify outliers within numeric data fig, ax = plt.subplots(nrows=1, ncols=1) ax = sns.boxplot(x=df_movie['minute']) fig.tight_layout() df_movie['minute'].describe() Using bars to identify outliers within categorical data fig=df_netflix_2019['rating'].value_counts().plot.bar().get_figure() fig.tight_layout() 6. Dealing with Outliers Using operators & | to filter out outliers In this case, we’re going to filter out outliers based on the values revealed by the boxplot. #outiliers df_movie[df_movie['minute'] < 43 | df_movie['minute'] > 158] #filtering outliers out df_movie = df_movie[(df_movie['minute'] > 43) & (df_movie['minute'] <158)] fig, ax = plt.subplots(nrows=1, ncols=1) ax = sns.boxplot(x=df_movie['minute']) fig.tight_layout() fig, ax = plt.subplots(nrows=1, ncols=1) plt.hist(df_movie['minute']) fig.tight_layout() 7. Save the cleaned dataset Finally, we use the to_csv() method on the df_netflix_2019 DataFrame to save the cleaned dataset as a new CSV file named 'cleaned_netflix_titles.csv' in the current working directory. The index=False parameter ensures that the row index is not included in the output CSV file. # Save the cleaned dataset as a new CSV file df_netflix_2019.to_csv('cleaned_netflix_titles.csv', index=False) #Dropping Column as being string data #Check for Outliers TYPE OF DISTRIBUTION DATA FOLLOWS It is the most important step for checking the data and perform final steps of cleaning. We first check the type of distribution data follows and then we perform that type of Normalization/Standardization techniques to remove the outliers. The data shows that the column “Active” follows exponential distribution (we know it with the help of mean, median, mode, etc.) So we normalize the data using exponential normalizing. In last we save these into a separate file using “.to_csv” method. Conclusion: In this lab, we learned how to clean and preprocess the Netflix and corona virus dataset. We went through various data cleaning techniques such as handling missing values, removing duplicates, and transforming, normalization/standardization data to make it more useful for analysis. We also learned that data cleaning methods cannot be applied uniformly to all types of data. Different datasets have different types of data, features, and patterns, and therefore require different cleaning techniques. Before applying any cleaning methods, it is crucial to first check and understand the data thoroughly to identify its unique characteristics, such as the data types, missing values, outliers, inconsistencies, and errors. Lab Exercises

Documents

questions

Cloud computing is a technology that enables users to access computing resources, such as, Assignments of Mathematics

Often downloaded together

Related documents

Partial preview of the text