Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Tidying and Visualization using R tidyr and ggplot2, Lecture notes of Applied Computing

The concept of tidy data and its importance in data analysis. It explains how to tidy data using the R package tidyr and how to visualize tidy data using ggplot2. The document also provides examples of tidying and visualizing data related to life expectancy and population. It includes exercises to help the reader practice tidying and visualizing data.

Typology: Lecture notes

2021/2022

Uploaded on 07/05/2022

barbara_gr
barbara_gr 🇦🇺

4.6

(74)

1K documents

1 / 35

Toggle sidebar

Related documents


Partial preview of the text

Download Data Tidying and Visualization using R tidyr and ggplot2 and more Lecture notes Applied Computing in PDF only on Docsity! 1 R Data Wrangling: tidyverse package tidyr Boriana Pratt, Office of Population Research, Princeton University Presented as part of Princeton's Research Computing Winter 2021 Bootcamp 2 Messy data Tidy modeling Models model Tidy Data tidying Data tidying Models (tidyr) (broom) data visualization model visualization (ggplot2) (ggplot2) Graphs Model Graphs data manipulation (dplyr) Diagram is based on David Robinson, 4/11/2015, “broom: an R Package to Convert Statistical Models into Tidy Data Frames” Data Example 1 Tidy? year le le_male le_female le_w le_wmale le_wfemale le_b le_bmale le_bfemale 1900 47.3 46.3 48.3 47.6 46.6 48.7 33.0 32.5 33.5 1901 49.1 47.6 50.6 49.4 48.0 51.0 33.7 32.2 35.3 1902 51.5 49.8 53.4 51.9 50.2 53.8 34.6 32.9 36.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1994 75.7 72.4 79.0 76.5 73.3 79.6 69.5 64.9 73.9 1995 75.8 72.5 78.9 76.5 73.4 79.6 69.6 65.2 73.9 1996 76.1 73.1 79.1 76.8 73.9 79.7 70.2 66.1 74.2 1997 76.5 73.6 79.4 77.2 74.3 79.9 71.1 67.2 74.7 1998 76.7 73.8 79.5 77.3 74.5 80.0 71.3 67.6 74.8 1999 76.7 73.9 79.4 77.3 74.6 79.9 71.4 67.8 74.7 5 Tidy? year le le_male le_female le_w le_wmale le_wfemale le_b le_bmale le_bfemale 1900 47.3 46.3 48.3 47.6 46.6 48.7 33.0 32.5 33.5 1901 49.1 47.6 50.6 49.4 48.0 51.0 33.7 32.2 35.3 1902 51.5 49.8 53.4 51.9 50.2 53.8 34.6 32.9 36.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1994 75.7 72.4 79.0 76.5 73.3 79.6 69.5 64.9 73.9 1995 75.8 72.5 78.9 76.5 73.4 79.6 69.6 65.2 73.9 1996 76.1 73.1 79.1 76.8 73.9 79.7 70.2 66.1 74.2 1997 76.5 73.6 79.4 77.2 74.3 79.9 71.1 67.2 74.7 1998 76.7 73.8 79.5 77.3 74.5 80.0 71.3 67.6 74.8 1999 76.7 73.9 79.4 77.3 74.6 79.9 71.4 67.8 74.7 Column headers contain values: male, female, b, w 6 Data Example 1 Data Example 2 Tidy? country continent year lifeExp pop gdpPercap Afghanistan Asia 1992 41.674 16317921 649.3414 Afghanistan Asia 1997 41.763 22227415 635.3414 Afghanistan Asia 2002 42.129 25268405 726.7341 Afghanistan Asia 2007 43.828 31889923 974.5803 Albania Europe 1992 71.581 3326498 2497.4379 Albania Europe 1997 72.950 3428038 3193.0546 Albania Europe 2002 75.651 3508512 4604.2117 Albania Europe 2007 76.423 3600523 5937.0295 . . . . . . . . . . . . . . . . . . Zambia Africa 1992 46.100 8381163 1210.8846 Zambia Africa 1997 40.238 9417789 1071.3538 Zambia Africa 2002 39.193 10595811 1071.6139 Zambia Africa 2007 42.384 11746035 1271.2116 Zimbabwe Africa 1992 60.377 10704340 693.4208 Zimbabwe Africa 1997 46.809 11404948 792.4500 Zimbabwe Africa 2002 39.989 11926563 672.0386 Zimbabwe Africa 2007 43.487 12311143 469.7093 Data about two different types of observations are stored in the same table: continent is an attribute of country lifeExp, pop and gdpPercap are attributes of country-year Storing variable continent in a table where unit of observation is country-year is redundant therefore is prone to error and wastes space. 7 Data Example 3 Tidy? country pop2012 imr tfr le leM leF region area Algeria 37.4 24 2.9 73 72 75 Northern Africa Africa Egypt 82.3 24 2.9 72 70 74 Northern Africa Africa Libya 6.5 14 2.6 75 72 77 Northern Africa Africa Morocco 32.6 30 2.3 72 70 74 Northern Africa Africa Sth Sudan 9.4 101 5.4 52 50 53 Northern Africa Africa Sudan 33.5 67 4.2 60 58 62 Northern Africa Africa Tunisia 10.8 20 2.1 75 73 77 Northern Africa Africa Benin 9.4 81 5.4 56 54 58 Western Africa Africa Gambia 1.8 70 4.9 58 57 59 Western Africa Africa Ghana 25.5 47 4.2 64 63 65 Western Africa Africa . . . . . . . . . . . . . . . . . . . . . . . . . . . Serbia 7.1 7 1.3 74 71 77 Southern Europe Europe Slovenia 2.1 3 1.5 80 76 83 Southern Europe Europe Spain 46.2 3 1.4 82 79 85 Southern Europe Europe Australia 22.0 4 1.9 82 80 84 Oceania Oceania New Zeal. 4.4 5 2.1 81 79 83 Oceania Oceania 10 Column name contains value: 2012 Column name contains values: M, F ... use 2nd table with unit of observation country-year-sex Using ggplot2 to display gender-specific life-expectancy p <- ggplot(data=subset(w,area=="Africa"), aes(x=reorder(factor(country),leF),y=leF)) p + geom_point(color="red") + geom_point(aes(y=leM), color="blue") 11 Data Example 3 country year sex le Algeria 2012 male 72 Algeria 2012 female 75 . . . . . . . . Zambia 2012 male 48 Zambia 2012 female 49 Zimbabwe 2012 male 48 Zimbabwe 2012 female 47 12 Data Example 3 Using ggplot2 to display gender-specific life-expectancy p <- ggplot(data=subset(w,area=="Africa"), aes(x=reorder(factor(country),le), y=le, color=sex)) p + geom_point() Data Example 5 15 Tidy? country imr tfr le region Algeria 24 2.9 73 Northern Africa Egypt 24 2.9 72 Northern Africa Benin 81 5.4 56 Western Africa Burkina Faso 65 6.0 55 Western Africa . . . . . . . . . . . . . . . Albania 18 1.4 75 Southern Europe Bosnia-Herz. 5 1.2 76 Southern Europe Croatia 4 1.5 77 Southern Europe Greece 4 1.5 80 Southern Europe Italy 3 1.4 82 Southern Europe Are multiple variables are stored in one column? Should there be a region column (Northern, Western, Southern, ...) and a separate continent column (Africa, Europe, ...)? Data Example 5 16 Tidy? country imr tfr le region Algeria 24 2.9 73 Northern Africa Egypt 24 2.9 72 Northern Africa Benin 81 5.4 56 Western Africa Burkina Faso 65 6.0 55 Western Africa . . . . . . . . . . . . . . . Albania 18 1.4 75 Southern Europe Bosnia-Herz. 5 1.2 76 Southern Europe Croatia 4 1.5 77 Southern Europe Greece 4 1.5 80 Southern Europe Italy 3 1.4 82 Southern Europe Are multiple variables stored in one column? Should there be a region column (Northern, Western, Southern, ...) and a separate continent column (Africa, Europe, ...)? 17 country continent lifeExp pop gdpPercap Afghanistan Asia 41.674 16317921 649.3414 Albania Europe 71.581 3326498 2497.4379 Algeria Africa 67.744 26298373 5023.2166 . . . . . . . . . . . . . . . Yemen, Rep. Asia 55.599 13367997 1879.4967 Zambia Africa 46.100 8381163 1210.8846 Zimbabwe Africa 60.377 10704340 693.4208 Afghanistan Asia 42.129 25268405 726.7341 Albania Europe 75.651 3508512 4604.2117 Algeria Africa 70.994 31287142 5288.0404 . . . . . . . . . . . . . . . Yemen, Rep. Asia 60.308 18701257 2234.8208 Zambia Africa 39.193 10595811 1071.6139 Zimbabwe Africa 39.989 11926563 672.0386 Data Example 6 Tidy? country continent lifeExp pop gdpPercap file: world_data_2002.csv file: world_data_1992.csv 20 install.packages("tidyr") install.packages("dplyr") install.packages("ggplot2") library("tidyr") library("dplyr") library("ggplot2") usle <- read.csv(file="uslifeexp.csv", head=TRUE, sep=",“, as.is=T) usle Install and Load Packages, Read Data year le le_male le_female le_w le_wmale le_wfemale le_b le_bmale le_bfemale 1 1900 47.3 46.3 48.3 47.6 46.6 48.7 33.0 32.5 33.5 2 1901 49.1 47.6 50.6 49.4 48.0 51.0 33.7 32.2 35.3 3 1902 51.5 49.8 53.4 51.9 50.2 53.8 34.6 32.9 36.4 4 1903 50.5 49.1 52.0 50.9 49.5 52.5 33.1 31.7 34.6 . . . . . . . . . . . 97 1996 76.1 73.1 79.1 76.8 73.9 79.7 70.2 66.1 74.2 98 1997 76.5 73.6 79.4 77.2 74.3 79.9 71.1 67.2 74.7 99 1998 76.7 73.8 79.5 77.3 74.5 80.0 71.3 67.6 74.8 100 1999 76.7 73.9 79.4 77.3 74.6 79.9 71.4 67.8 74.7 Source: National Vital Statistics Report, Vol. 50, No. 6, 21mar2000 21 gather(key="sex", value="lifeexp", le_male, le_female) year le le_w le_wmale le_wfemale le_b le_bmale le_bfemale sex lifeexp 1 1900 47.3 47.6 46.6 48.7 33.0 32.5 33.5 le_male 46.3 2 1901 49.1 49.4 48.0 51.0 33.7 32.2 35.3 le_male 47.6 3 1902 51.5 51.9 50.2 53.8 34.6 32.9 36.4 le_male 49.8 4 1903 50.5 50.9 49.5 52.5 33.1 31.7 34.6 le_male 49.1 . . . . . . . . . . . 197 1996 76.1 76.8 73.9 79.7 70.2 66.1 74.2 le_female 79.1 198 1997 76.5 77.2 74.3 79.9 71.1 67.2 74.7 le_female 79.4 199 1998 76.7 77.3 74.5 80.0 71.3 67.6 74.8 le_female 79.5 200 1999 76.7 77.3 74.6 79.9 71.4 67.8 74.7 le_female 79.4 # "pipe operator" ... think of "then" select(usle, year, le_male, le_female) %>% gather(key="race", value="lifeexp", le_male, le_female) %>% arrange(year) year sex lifeexp 1 1900 le_male 46.3 2 1900 le_female 48.3 3 1901 le_male 47.6 4 1901 le_female 50.6 . . . . 197 1998 le_male 73.8 198 1998 le_female 79.5 199 1999 le_male 73.9 200 1999 le_female 79.4 Reshape Data 22 select(usle, year, le_male, le_female) %>% rename(male = le_male, female = le_female) %>% gather(key="sex", value="lifeexp", male, female) %>% arrange(year) year sex lifeexp 1 1900 male 46.3 2 1900 female 48.3 3 1901 male 47.6 4 1901 female 50.6 . . . . . . . . . . . . 197 1998 male 73.8 198 1998 female 79.5 199 1999 male 73.9 200 1999 female 79.4 Rename and Reshape Data select(usle, year, male=le_male, female=le_female) %>% pivot_longer(cols=c(male,female), names_to="sex", values_to="lifeexp") %>% arrange(year) 25 # Exercise 3: display life expectancy for black, white, male, female on same graph mfle <- select(usle, year, le_male, le_female) %>% rename(male = le_male, female = le_female) %>% gather(key="sex", value="lifeexp", male, female) %>% arrange(year) select(usle, year, le_b, le_w) %>% rename(black = le_b, white = le_w) %>% gather(key="race", value="lifeexp", black, white) %>% arrange(year) %>% ggplot(aes(year, lifeexp, color= race)) + geom_point() + geom_point(data = mfle, aes(color = sex )) + scale_color_discrete(name="Group") Reshape Data 26 # get life expectancy gap between male and female and also between black and white mutate(usle, race_le_gap = le_w - le_b, sex_le_gap = le_female - le_male) %>% select(year, race_le_gap, sex_le_gap) # are above data tidy? mutate(usle, race_le_gap = le_w - le_b, sex_le_gap = le_female - le_male) %>% select(year, race_le_gap, sex_le_gap) %>% ggplot(aes(year, race_le_gap)) + geom_point(color="red") + geom_point(aes(y=sex_le_gap), color="blue") + ylab("life expectancy gap") Add New Columns 27 # how to tidy these data? mutate(usle, race = le_w - le_b, sex = le_female - le_male) %>% select(year, race, sex) %>% gather(key="le_gap_type", value="le_gap_years", race, sex) # use the tidy data to draw graph? mutate(usle, race = le_w - le_b, sex = le_female - le_male) %>% select(year, race, sex) %>% gather(key="le_gap_type", value="le_gap_years", race, sex) %>% ggplot(aes(year, le_gap_years, color = le_gap_type)) + geom_point()+ ylab("life expectancy gap") Reshape Data 30 # tidyr function that does the opposite of separate: unite tidy_le <- select(usle, year, le_wmale, le_wfemale, le_bmale, le_bfemale) %>% rename(white_male = le_wmale, white_female = le_wfemale, black_male = le_bmale, black_female = le_bfemale) %>% gather(key="racesex", value="lifeexp", white_male, white_female, black_male, black_female) %>% arrange(year, racesex) %>% separate(racesex, c("race", "sex"), sep = "_") unite(tidy_le, "racesex", c(race, sex), sep = "_") year racesex lifeexp 1 1900 black_female 33.5 2 1900 black_male 32.5 3 1900 white_female 48.7 4 1900 white_male 46.6 397 1999 black_female 74.7 398 1999 black_male 67.8 399 1999 white_female 79.9 400 1999 white_male 74.6 . . . . . . . . Combine Multiple Columns 31 unite(tidy_le, "racesex", c(race, sex), sep = "_") %>% spread(key="racesex", value="lifeexp") # tidyr function opposite of gather: spread (rows -> columns; “long” to “wide”) unite(tidy_le, "race_sex", c(race, sex), sep = "_") %>% spread(key="race_sex", value="lifeexp") %>% rename(le_bfemale = black_female, le_bmale = black_male, le_wfemale = white_female, le_wmale = white_male) year black_female black_male white_female white_male 1 1900 33.5 32.5 48.7 46.6 2 1901 35.3 32.2 51.0 48.0 3 1902 36.4 32.9 53.8 50.2 4 1903 34.6 31.7 52.5 49.5 97 1996 74.2 66.1 79.7 73.9 98 1997 74.7 67.2 79.9 74.3 99 1998 74.8 67.6 80.0 74.5 100 1999 74.7 67.8 79.9 74.6 . . . . . . . . . . . . Reverse of gather(): spread() 32 unite(tidy_le, "racesex", c(race, sex), sep = "_") %>% pivot_wider(names_from="racesex", values_from="lifeexp") # tidyr function opposite of pivot_longer: pivit_wider (rows -> columns; “long” to “wide”) unite(tidy_le, "race_sex", c(race, sex), sep = "_") %>% pivot_wider(names_from="racesex", values_from="lifeexp") %>% rename(le_bfemale = black_female, le_bmale = black_male, le_wfemale = white_female, le_wmale = white_male) Reverse of pivot_longer(): pivot_wider()
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved