Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

data wrangling cheat sheet, Cheat Sheet of Computational and Statistical Data Analysis

Schematic and complete data wrangling cheat sheet with R and Python

Typology: Cheat Sheet

2018/2019
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 09/02/2019

nicoth
nicoth 🇺🇸

4.3

(20)

18 documents

Partial preview of the text

Download data wrangling cheat sheet and more Cheat Sheet Computational and Statistical Data Analysis in PDF only on Docsity! Data Wrangling with dplyr and tidyr Cheat Sheet RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com Syntax - Helpful conventions for wrangling dplyr::tbl_df(iris) Converts data to tbl class. tbl’s are easier to examine than data frames. R displays only the data that fits onscreen: dplyr::glimpse(iris) Information dense summary of tbl data. utils::View(iris) View data set in spreadsheet-like display (note capital V). Source: local data frame [150 x 5] Sepal.Length Sepal.Width Petal.Length 1 5.1 3.5 1.4 2 4.9 3.0 1.4 3 4.7 3.2 1.3 4 4.6 3.1 1.5 5 5.0 3.6 1.4 .. ... ... ... Variables not shown: Petal.Width (dbl), Species (fctr) dplyr::%>% Passes object on left hand side as first argument (or . argument) of function on righthand side. "Piping" with %>% makes code more readable, e.g. iris %>% group_by(Species) %>% summarise(avg = mean(Sepal.Width)) %>% arrange(avg) x %>% f(y) is the same as f(x, y) y %>% f(x, ., z) is the same as f(x, y, z ) Reshaping Data - Change the layout of a data set Subset Observations (Rows) Subset Variables (Columns) F M A Each variable is saved in its own column F M A Each observation is saved in its own row In a tidy data set: & Tidy Data - A foundation for wrangling in R Tidy data complements R’s vectorized operations. R will automatically preserve observations as you manipulate variables. No other format works as intuitively with R. FAM M * A * tidyr::gather(cases, "year", "n", 2:4) Gather columns into rows. tidyr::unite(data, col, ..., sep) Unite several columns into one. dplyr::data_frame(a = 1:3, b = 4:6) Combine vectors into data frame (optimized). dplyr::arrange(mtcars, mpg) Order rows by values of a column (low to high). dplyr::arrange(mtcars, desc(mpg)) Order rows by values of a column (high to low). dplyr::rename(tb, y = year) Rename the columns of a data frame. tidyr::spread(pollution, size, amount) Spread rows into columns. tidyr::separate(storms, date, c("y", "m", "d")) Separate one column into several. wwwwwwA1005A1013A1010A1010 wwp11010 745451009wwp11010745451009 wwp11010 745451009wwp11010745451009 wppw1100 7071104510091 0945wwwww11010101010 wwwwdplyr::filter(iris, Sepal.Length > 7) Extract rows that meet logical criteria. dplyr::distinct(iris) Remove duplicate rows. dplyr::sample_frac(iris, 0.5, replace = TRUE) Randomly select fraction of rows. dplyr::sample_n(iris, 10, replace = TRUE) Randomly select n rows. dplyr::slice(iris, 10:15) Select rows by position. dplyr::top_n(storms, 2, date) Select and order top n entries (by group if grouped data). < Less than != Not equal to > Greater than %in% Group membership == Equal to is.na Is NA <= Less than or equal to !is.na Is not NA >= Greater than or equal to &,|,!,xor,any,all Boolean operators Logic in R - ?Comparison, ?base::Logic dplyr::select(iris, Sepal.Width, Petal.Length, Species) Select columns by name or helper function. Helper functions for select - ?select select(iris, contains(".")) Select columns whose name contains a character string. select(iris, ends_with("Length")) Select columns whose name ends with a character string. select(iris, everything()) Select every column. select(iris, matches(".t.")) Select columns whose name matches a regular expression. select(iris, num_range("x", 1:5)) Select columns named x1, x2, x3, x4, x5. select(iris, one_of(c("Species", "Genus"))) Select columns whose names are in a group of names. select(iris, starts_with("Sepal")) Select columns whose name starts with a character string. select(iris, Sepal.Length:Petal.Width) Select all columns between Sepal.Length and Petal.Width (inclusive). select(iris, -Species) Select all columns except Species. Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15 wwwwwwA1005A1013A1010A1010 devtools::install_github("rstudio/EDAWR") for data sets Summarise Cases group_by(.data, ..., add = FALSE) Returns copy of table 
 grouped by … g_iris <- group_by(iris, Species) ungroup(x, …) Returns ungrouped copy 
 of table. ungroup(g_iris) wwwww Use group_by() to create a "grouped" copy of a table. 
 dplyr functions will manipulate each "group" separately and then combine the results. mtcars %>% group_by(cyl) %>% summarise(avg = mean(mpg)) These apply summary functions to columns to create a new table of summary statistics. Summary functions take vectors as input and return one value (see back). VARIATIONS summarise_all() - Apply funs to every column. summarise_at() - Apply funs to specific columns. summarise_if() - Apply funs to all cols of one type. ww ww summarise(.data, …)
 Compute table of summaries. 
 summarise(mtcars, avg = mean(mpg)) count(x, ..., wt = NULL, sort = FALSE)
 Count number of rows in each group defined by the variables in … Also tally().
 count(iris, Species) RStudio® is a trademark of RStudio, Inc. • CC BY SA RStudio • info@rstudio.com • 844-448-1212 • rstudio.com • Learn more with browseVignettes(package = c("dplyr", "tibble")) • dplyr 0.7.0 • tibble 1.2.0 • Updated: 2017-03 Each observation, or case, is in its own row Each variable is in its own column & dplyr functions work with pipes and expect tidy data. In tidy data: pipes x %>% f(y) becomes f(x, y) filter(.data, …) Extract rows that meet logical criteria. filter(iris, Sepal.Length > 7) distinct(.data, ..., .keep_all = FALSE) Remove rows with duplicate values. 
 distinct(iris, Species) sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = parent.frame()) Randomly select fraction of rows. 
 sample_frac(iris, 0.5, replace = TRUE) sample_n(tbl, size, replace = FALSE, weight = NULL, .env = parent.frame()) Randomly select size rows. sample_n(iris, 10, replace = TRUE) slice(.data, …) Select rows by position. slice(iris, 10:15) top_n(x, n, wt) Select and order top n entries (by group if grouped data). top_n(iris, 5, Sepal.Width) Row functions return a subset of rows as a new table. See ?base::logic and ?Comparison for help. > >= !is.na() ! & < <= is.na() %in% | xor() arrange(.data, …) Order rows by values of a column or columns (low to high), use with desc() to order from high to low. arrange(mtcars, mpg) arrange(mtcars, desc(mpg)) add_row(.data, ..., .before = NULL, .after = NULL) Add one or more rows to a table. add_row(faithful, eruptions = 1, waiting = 1) Group Cases Manipulate Cases EXTRACT VARIABLES ADD CASES ARRANGE CASES Logical and boolean operators to use with filter() Column functions return a set of columns as a new vector or table. contains(match) ends_with(match) matches(match) :, e.g. mpg:cyl -, e.g, -Species num_range(prefix, range) one_of(…) starts_with(match) pull(.data, var = -1) Extract column values as a vector. Choose by name or index. pull(iris, Sepal.Length) Manipulate Variables Use these helpers with select (), e.g. select(iris, starts_with("Sepal")) These apply vectorized functions to columns. Vectorized funs take vectors as input and return vectors of the same length as output (see back). mutate(.data, …) 
 Compute new column(s). mutate(mtcars, gpm = 1/mpg) transmute(.data, …)
 Compute new column(s), drop others. transmute(mtcars, gpm = 1/mpg) mutate_all(.tbl, .funs, …) Apply funs to every column. Use with funs(). Also mutate_if().
 mutate_all(faithful, funs(log(.), log2(.))) mutate_if(iris, is.numeric, funs(log(.))) mutate_at(.tbl, .cols, .funs, …) Apply funs to specific columns. Use with funs(), vars() and the helper functions for select().
 mutate_at(iris, vars( -Species), funs(log(.))) add_column(.data, ..., .before = NULL, .after = NULL) Add new column(s). Also add_count(), add_tally(). add_column(mtcars, new = 1:32) rename(.data, …) Rename columns.
 rename(iris, Length = Sepal.Length) MAKE NEW VARIABLES EXTRACT CASES wwww wwww wwww wwww wwww wwww www ww ww wwww ww wwww www wwww summary function vectorized function Data Transformation with dplyr : : CHEAT SHEET A B CA B C select(.data, …) Extract columns as a table. Also select_if(). select(iris, Sepal.Length, Species)www dplyr F M A Data Wrangling with pandas Cheat Sheet http://pandas.pydata.org Syntax – Creating DataFrames Tidy Data – A foundation for wrangling in pandas In a tidy data set: F M A Each variable is saved in its own column & Each observation is saved in its own row Tidy data complements pandas’s vectorized operations. pandas will automatically preserve observations as you manipulate variables. No other format works as intuitively with pandas. Reshaping Data – Change the layout of a data set M A F* M A* pd.melt(df) Gather columns into rows. df.pivot(columns='var', values='val') Spread rows into columns. pd.concat([df1,df2]) Append rows of DataFrames pd.concat([df1,df2], axis=1) Append columns of DataFrames df.sort_values('mpg') Order rows by values of a column (low to high). df.sort_values('mpg',ascending=False) Order rows by values of a column (high to low). df.rename(columns = {'y':'year'}) Rename the columns of a DataFrame df.sort_index() Sort the index of a DataFrame df.reset_index() Reset index of DataFrame to row numbers, moving index to columns. df.drop(columns=['Length','Height']) Drop columns from DataFrame Subset Observations (Rows) Subset Variables (Columns) a b c 1 4 7 10 2 5 8 11 3 6 9 12 df = pd.DataFrame( {"a" : [4 ,5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = [1, 2, 3]) Specify values for each column. df = pd.DataFrame( [[4, 7, 10], [5, 8, 11], [6, 9, 12]], index=[1, 2, 3], columns=['a', 'b', 'c']) Specify values for each row. a b c n v d 1 4 7 10 2 5 8 11 e 2 6 9 12 df = pd.DataFrame( {"a" : [4 ,5, 6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = pd.MultiIndex.from_tuples( [('d',1),('d',2),('e',2)], names=['n','v'])) Create DataFrame with a MultiIndex Method Chaining Most pandas methods return a DataFrame so that another pandas method can be applied to the result. This improves readability of code. df = (pd.melt(df) .rename(columns={ 'variable' : 'var', 'value' : 'val'}) .query('val >= 200') ) df[df.Length > 7] Extract rows that meet logical criteria. df.drop_duplicates() Remove duplicate rows (only considers columns). df.head(n) Select first n rows. df.tail(n) Select last n rows. Logic in Python (and pandas) < Less than != Not equal to > Greater than df.column.isin(values) Group membership == Equals pd.isnull(obj) Is NaN <= Less than or equals pd.notnull(obj) Is not NaN >= Greater than or equals &,|,~,^,df.any(),df.all() Logical and, or, not, xor, any, all http://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants df[['width','length','species']] Select multiple columns with specific names. df['width'] or df.width Select single column with specific name. df.filter(regex='regex') Select columns whose name matches regular expression regex. df.loc[:,'x2':'x4'] Select all columns between x2 and x4 (inclusive). df.iloc[:,[1,2,5]] Select columns in positions 1, 2 and 5 (first column is 0). df.loc[df['a'] > 10, ['a','c']] Select rows meeting logical condition, and only the specific columns . regex (Regular Expressions) Examples '\.' Matches strings containing a period '.' 'Length$' Matches strings ending with word 'Length' '^Sepal' Matches strings beginning with the word 'Sepal' '^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4,5 '^(?!Species$).*' Matches strings except the string 'Species' df.sample(frac=0.5) Randomly select fraction of rows. df.sample(n=10) Randomly select n rows. df.iloc[10:20] Select rows by position. df.nlargest(n, 'value') Select and order top n entries. df.nsmallest(n, 'value') Select and order bottom n entries. Summarize Data Make New Columns Combine Data Sets df['w'].value_counts() Count number of rows with each unique value of variable len(df) # of rows in DataFrame. df['w'].nunique() # of distinct values in a column. df.describe() Basic descriptive statistics for each column (or GroupBy) pandas provides a large set of summary functions that operate on different kinds of pandas objects (DataFrame columns, Series, GroupBy, Expanding and Rolling (see below)) and produce single values for each of the groups. When applied to a DataFrame, the result is returned as a pandas Series for each column. Examples: sum() Sum values of each object. count() Count non-NA/null values of each object. median() Median value of each object. quantile([0.25,0.75]) Quantiles of each object. apply(function) Apply function to each object. min() Minimum value in each object. max() Maximum value in each object. mean() Mean value of each object. var() Variance of each object. std() Standard deviation of each object. df.assign(Area=lambda df: df.Length*df.Height) Compute and append one or more new columns. df['Volume'] = df.Length*df.Height*df.Depth Add single column. pd.qcut(df.col, n, labels=False) Bin column into n buckets. Vector function Vector function pandas provides a large set of vector functions that operate on all columns of a DataFrame or a single selected column (a pandas Series). These functions produce vectors of values for each of the columns, or a single Series for the individual Series. Examples: shift(1) Copy with values shifted by 1. rank(method='dense') Ranks with no gaps. rank(method='min') Ranks. Ties get min rank. rank(pct=True) Ranks rescaled to interval [0, 1]. rank(method='first') Ranks. Ties go to first value. shift(-1) Copy with values lagged by 1. cumsum() Cumulative sum. cummax() Cumulative max. cummin() Cumulative min. cumprod() Cumulative product. x1 x2 A 1 B 2 C 3 x1 x3 A T B F D T adf bdf Standard Joins x1 x2 x3 A 1 T B 2 F C 3 NaN x1 x2 x3 A 1.0 T B 2.0 F D NaN T x1 x2 x3 A 1 T B 2 F x1 x2 x3 A 1 T B 2 F C 3 NaN D NaN T pd.merge(adf, bdf, how='left', on='x1') Join matching rows from bdf to adf. pd.merge(adf, bdf, how='right', on='x1') Join matching rows from adf to bdf. pd.merge(adf, bdf, how='inner', on='x1') Join data. Retain only rows in both sets. pd.merge(adf, bdf, how='outer', on='x1') Join data. Retain all values, all rows. Filtering Joins x1 x2 A 1 B 2 x1 x2 C 3 adf[adf.x1.isin(bdf.x1)] All rows in adf that have a match in bdf. adf[~adf.x1.isin(bdf.x1)] All rows in adf that do not have a match in bdf. x1 x2 A 1 B 2 C 3 x1 x2 B 2 C 3 D 4 ydf zdf Set-like Operations x1 x2 B 2 C 3 x1 x2 A 1 B 2 C 3 D 4 x1 x2 A 1 pd.merge(ydf, zdf) Rows that appear in both ydf and zdf (Intersection). pd.merge(ydf, zdf, how='outer') Rows that appear in either or both ydf and zdf (Union). pd.merge(ydf, zdf, how='outer', indicator=True) .query('_merge == "left_only"') .drop(columns=['_merge']) Rows that appear in ydf but not zdf (Setdiff). Group Data df.groupby(by="col") Return a GroupBy object, grouped by values in column named "col". df.groupby(level="ind") Return a GroupBy object, grouped by values in index level named "ind". All of the summary functions listed above can be applied to a group. Additional GroupBy functions: max(axis=1) Element-wise max. clip(lower=-10,upper=10) Trim values at input thresholds min(axis=1) Element-wise min. abs() Absolute value. The examples below can also be applied to groups. In this case, the function is applied on a per-group basis, and the returned vectors are of the length of the original DataFrame. Windows df.expanding() Return an Expanding object allowing summary functions to be applied cumulatively. df.rolling(n) Return a Rolling object allowing summary functions to be applied to windows of length n. size() Size of each group. agg(function) Aggregate group using function. Handling Missing Data df.dropna() Drop rows with any column having NA/null data. df.fillna(value) Replace all NA/null data with value. Plotting df.plot.hist() Histogram for each column df.plot.scatter(x='w',y='h') Scatter chart using pairs of points http://pandas.pydata.org/ This cheat sheet inspired by Rstudio Data Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) Written by Irv Lustig, Princeton Consultants
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved