Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Cloud computing is a technology that enables users to access computing resources, such as, Assignments of Mathematics

Cloud computing is a technology that enables users to access computing resources, such as servers, storage, databases, and software, over the internet. It allows businesses to use these resources on a pay-as-you-go basis, making it a cost-effective solution for companies of all sizes.

Typology: Assignments

2017/2018

Uploaded on 04/18/2023

muhammad-muneeb-10
muhammad-muneeb-10 🇵🇰

5 documents

1 / 17

Toggle sidebar

Often downloaded together


Related documents


Partial preview of the text

Download Cloud computing is a technology that enables users to access computing resources, such as and more Assignments Mathematics in PDF only on Docsity! DATA SCIENCE Semester: ______________________ Date of Experiment: ____________________ Student name: __________________ Faculty Signature: ______________________ Student ID: ____________________ Lab05 DATA CLEANING PLOs PLO1 – Engineering Knowledge Bloom’s Taxonomy C1 – Recall PLO2- Problem Analysis C3 - Apply PLO5 – Modern Tool Usage C3 - Apply PLO8 – Ethics P2 – Set LAB TASK PERFORMANCE CLO’s Aspects of Assessments Excellent (75-100%) Average (50-75%) Poor (<50%) Marks CLO1 PLO1 10% Recall The associated concepts of Programming Language. Complete understanding of Programming / actively participate during lecture. Complete understanding of Programming / less actively participate during lecture. Student lacks clear understanding of concepts of Programming / Unable to read and interpret it. CLO2 PLO2 40% Problem Analysis Problem identification, analysis /literature review, resulting in meaningful conclusions Completely identifies the problem in question through efficient analysis/produces near to exact results Partially identifies the problem in question and with academic support produces the required results. Lack of identification of the problem, needing more than par support to analyze the problem and production of results. CLO5 PLO5 40% Tools Utilization Apply and discover different level functions of Jupyter Notebook, Colab. Accurately implement the functions of Jupyter Notebook or Colab, tools and obtain the correct output as per requirement/ given tasks. Implement the functions of Jupyter Notebook or Colab, tools with minor errors that will lead to a slightly different output as per given in a task. Not able to implement the functions of Jupyter Notebook or Colab, tools and don’t understand how required output and task is achieved. CLO6 PLO8 10% Lab Safety Properly handle lab infrastructure/ safety precautions Properly handle lab equipment & obey safety measures. Moderate level lab handling and safety measurements Minor or no safety measurements has been considered. Total Marks: 10 KARACHI INSTITUTE OF ECONOMICS AND TECHNOLOGY Department of Software Engineering College of Engineering Objective In this lab, you will learn how to effectively clean and preprocess data using statistical measures such as mean, median, mode, variance, and identification of outliers. These techniques are essential for accurate and reliable data analysis and visualization. Requirement for This Lab: 1. Jupyter Notebook or Google Colab 2. Pandas library Prerequisites Basic knowledge of Python and data analysis. Dataset: Netflix Movies and TV Shows and Corona Virus data. The dataset can be found on Kaggle. Data cleaning Data cleaning is the process of detecting and correcting or removing errors, inconsistencies, and inaccuracies in data, to ensure that it is accurate, complete, and reliable for analysis. Preprocessing Preprocessing is the step-in data analysis where data is cleaned, transformed, and prepared for analysis. It involves tasks such as data cleaning, data integration, data transformation, and data reduction. Mean The mean is a statistical measure that represents the average value of a set of data. It is calculated by summing all the values in the data set and dividing the sum by the number of values. Median The median is a statistical measure that represents the middle value of a set of data. It is the value that separates the data into two equal halves when the data set is ordered from smallest to largest. Mode The mode is a statistical measure that represents the most frequently occurring value in a set of data. #find the number of rows and columns, the dataset contains, use the .shape method. df_netflix_2019.shape # For checking data types df_netflix_2019.dtypes # Checking for duplicates in the dataset df_netflix_2019.duplicated().sum() 3. Identify Missing Data Next, we will check if there are any missing values in the dataset and handle them accordingly. We can use the isnull() function to check for missing values in the dataset, and the fillna() function to fill in missing values with appropriate values. # Checking for missing values in the dataset df_netflix_2019.isnull().sum() # % of rows missing in each column for column in df_netflix_2019.columns: percentage = df_netflix_2019[column].isnull().mean() print(f'{column}: {round(percentage*100,2)}%') 4. Dealing with Missing Data There are different ways of dealing with missing data. The correct approach to handling missing data will be highly influenced by the data and goals your project has. Remove a column or row with .drop, .dropna or .isnull #drop column df_netflix_2019.drop('director', axis=1) #dropns() df_netflix_2019.dropna(subset=['director']) #, inplace=True) fig, ax = plt.subplots(nrows=1, ncols=1) plt.hist(df_movie['minute']) fig.tight_layout() Using boxplots to identify outliers within numeric data fig, ax = plt.subplots(nrows=1, ncols=1) ax = sns.boxplot(x=df_movie['minute']) fig.tight_layout() df_movie['minute'].describe() Using bars to identify outliers within categorical data fig=df_netflix_2019['rating'].value_counts().plot.bar().get_figure() fig.tight_layout() 6. Dealing with Outliers Using operators & | to filter out outliers In this case, we’re going to filter out outliers based on the values revealed by the boxplot. #outiliers df_movie[df_movie['minute'] < 43 | df_movie['minute'] > 158] #filtering outliers out df_movie = df_movie[(df_movie['minute'] > 43) & (df_movie['minute'] <158)] fig, ax = plt.subplots(nrows=1, ncols=1) ax = sns.boxplot(x=df_movie['minute']) fig.tight_layout() fig, ax = plt.subplots(nrows=1, ncols=1) plt.hist(df_movie['minute']) fig.tight_layout() 7. Save the cleaned dataset Finally, we use the to_csv() method on the df_netflix_2019 DataFrame to save the cleaned dataset as a new CSV file named 'cleaned_netflix_titles.csv' in the current working directory. The index=False parameter ensures that the row index is not included in the output CSV file. # Save the cleaned dataset as a new CSV file df_netflix_2019.to_csv('cleaned_netflix_titles.csv', index=False) #Dropping Column as being string data #Check for Outliers TYPE OF DISTRIBUTION DATA FOLLOWS It is the most important step for checking the data and perform final steps of cleaning. We first check the type of distribution data follows and then we perform that type of Normalization/Standardization techniques to remove the outliers. The data shows that the column “Active” follows exponential distribution (we know it with the help of mean, median, mode, etc.) So we normalize the data using exponential normalizing. In last we save these into a separate file using “.to_csv” method. Conclusion: In this lab, we learned how to clean and preprocess the Netflix and corona virus dataset. We went through various data cleaning techniques such as handling missing values, removing duplicates, and transforming, normalization/standardization data to make it more useful for analysis. We also learned that data cleaning methods cannot be applied uniformly to all types of data. Different datasets have different types of data, features, and patterns, and therefore require different cleaning techniques. Before applying any cleaning methods, it is crucial to first check and understand the data thoroughly to identify its unique characteristics, such as the data types, missing values, outliers, inconsistencies, and errors. Lab Exercises
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved