Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Creating Effective Visualizations with ggplot2: A Comprehensive Guide, Study Guides, Projects, Research of Statistics

A step-by-step guide on constructing ggplots using R's ggplot2 library. It covers various aspects such as creating a scatterplot, adjusting axis limits, changing colors, and customizing themes. The document also introduces other types of plots like bubble charts and ordered bar charts.

Typology: Study Guides, Projects, Research

2021/2022

Uploaded on 09/27/2022

blueeyes_11
blueeyes_11 🇺🇸

4.6

(17)

28 documents

1 / 25

Toggle sidebar

Related documents


Partial preview of the text

Download Creating Effective Visualizations with ggplot2: A Comprehensive Guide and more Study Guides, Projects, Research Statistics in PDF only on Docsity! Data Visualization with ggplot2 1. Understanding the ggplot syntax The syntax for constructing ggplots could be puzzling if you are a beginner or work primarily with base graphics. The main difference is that, unlike base graphics, ggplot works with dataframes and not individual vectors. All the data needed to make the plot is typically be contained within the dataframe supplied to the ggplot() itself or can be supplied to respective geoms. More on that later. The second noticeable feature is that you can keep enhancing the plot by adding more layers (and themes) to an existing plot created using the ggplot() function. Let's initialize a basic ggplot based on the midwest dataset. # Setup options(scipen=999) # turn off scientific notation like 1e+06 library(ggplot2) data("midwest", package = "ggplot2") # load the data # midwest <- read.csv("http://goo.gl/G1K41K") # alt source # Init Ggplot ggplot(midwest, aes(x=area, y=poptotal)) # area and poptotal are columns in 'midw est' A blank ggplot is drawn. Even though the x and y are specified, there are no points or lines in it. This is because, ggplot doesn't assume that you meant a scatterplot or a line chart to be drawn. I have only told ggplot what dataset to use and what columns should be used for X and Y axis. I haven't explicitly asked it to draw any points. Also note that aes() function is used to specify the X and Y axes. That's because, any information that is part of the source dataframe has to be specified inside the aes() function. 2. How to Make a Simple Scatterplot Let's make a scatterplot on top of the blank ggplot by adding points using a geom layer called geom_point. library(ggplot2) ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() We got a basic scatterplot, where each point represents a county. However, it lacks some basic components such as the plot title, meaningful axis labels etc. Moreover most of the points are concentrated on the bottom portion of the plot, which is not so nice. You will see how to rectify these in upcoming steps. Like geom_point(), there are many such geom layers which we will see in a subsequent part in this tutorial series. For now, let's just add a smoothing layer using geom_smooth(method='lm'). Since the method is set as lm (short for linear model), it draws the line of best fit. library(ggplot2) g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method= "lm") # set se=FALSE to turnoff confidence bands # Zoom in without deleting the points outside the limits. # As a result, the line of best fit is the same as the original plot. g1 <- g + coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) # zooms in plot(g1) Since all points were considered, the line of best fit did not change. 4. How to Change the Title and Axis Labels I have stored this as g1. Let's add the plot title and labels for X and Y axis. This can be done in one go using the labs() function with title, x and y arguments. Another option is to use the ggtitle(), xlab() and ylab(). library(ggplot2) g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method= "lm") # set se=FALSE to turnoff confidence bands g1 <- g + coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) # zooms in # Add Title and Labels g1 + labs(title="Area Vs Population", subtitle="From midwest dataset", y="Populati on", x="Area", caption="Midwest Demographics") # or g1 + ggtitle("Area Vs Population", subtitle="From midwest dataset") + xlab("Area") + ylab("Population") Excellent! So here is the full function call. # Full Plot call library(ggplot2) ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm") + coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) + labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population" , x="Area", caption="Midwest Demographics") 5. How to Change the Color and Size of Points How to Change the Color and Size To Static? We can change the aesthetics of a geom layer by modifying the respective geoms. Let's change the color of the points and the line to a static value. library(ggplot2) ggplot(midwest, aes(x=area, y=poptotal)) + geom_point(col="steelblue", size=3) + # Set static color and size for points geom_smooth(method="lm", col="firebrick") + # change the color of line coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population" , x="Area", caption="Midwest Demographics") Also, You can change the color palette entirely. gg + scale_colour_brewer(palette = "Set1") # change color palette 6. How to Change the X Axis Texts and Ticks Location How to Change the X and Y Axis Text and its Location? Alright, now let's see how to change the X and Y axis text and its location. This involves two aspects: breaks and labels. Step 1: Set the breaks The breaks should be of the same scale as the X axis variable. Note that I am using scale_x_continuous because, the X axis variable is a continuous variable. Had it been a date variable, scale_x_date could be used. Like scale_x_continuous() an equivalent scale_y_continuous() is available for Y axis. library(ggplot2) # Base plot gg <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point(aes(col=state), size=3) + # Set color to vary based on state categor ies. geom_smooth(method="lm", col="firebrick", size=2) + coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population" , x="Area", caption="Midwest Demographics") # Change breaks gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01)) Step 2: Change the labels You can optionally change the labels at the axis ticks. labels take a vector of the same length as breaks. Let me demonstrate by setting the labels to alphabets from a to k (though there is no meaning to it in this context). library(ggplot2) # Base Plot gg <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point(aes(col=state), size=3) + # Set color to vary based on state categor ies. geom_smooth(method="lm", col="firebrick", size=2) + coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population" , x="Area", caption="Midwest Demographics") # Change breaks + label gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01), labels = letters[1:11]) If you need to reverse the scale, use scale_x_reverse(). gg + theme_classic() + labs(subtitle="Classic Theme") For more customized and fancy themes have a look at the ggthemes package and the ggthemr package. Bubble plot While scatterplot lets you compare the relationship between 2 continuous variables, bubble chart serves well if you want to understand relationship within the underlying groups based on: A Categorical variable (by changing the color) and Another continuous variable (by changing the size of points). In simpler words, bubble charts are more suitable if you have 4-Dimensional data where two of them are numeric (X and Y) and one other categorical (color) and another numeric variable (size). The bubble chart clearly distinguishes the range of displ between the manufacturers and how the slope of lines-of-best-fit varies, providing a better visual comparison between the groups. # load package and data library(ggplot2) data(mpg, package="ggplot2") # mpg <- read.csv("http://goo.gl/uEeRGu") mpg_select <- mpg[mpg$manufacturer %in% c("audi", "ford", "honda", "hyundai"), ] # Scatterplot theme_set(theme_bw()) # pre-set the bw theme. g <- ggplot(mpg_select, aes(displ, cty)) + labs(subtitle="mpg: Displacement vs City Mileage", title="Bubble chart") g + geom_jitter(aes(col=manufacturer, size=hwy)) + geom_smooth(aes(col=manufacturer), method="lm", se=F) Bubble chart mpg: Displacement vs City Mileage — audi — ford —> honda —> hyundai Histogram By default, if only one variable is supplied, the geom_bar() tries to calculate the count. In order for it to behave like a bar chart, the stat=identity option has to be set and x and y values must be provided. Histogram on a continuous variable Histogram on a continuous variable can be accomplished using either geom_bar() or geom_histogram(). When using geom_histogram(), you can control the number of bars using the bins option. Else, you can set the range covered by each bin using binwidth. The value of binwidth is on the same scale as the continuous variable on which histogram is built. Since, geom_histogram gives facility to control both number of bins as well as binwidth, it is the preferred option to create histogram on continuous variables. library(ggplot2) theme_set(theme_classic()) # Histogram on a Continuous (Numeric) Variable g <- ggplot(mpg, aes(displ)) + scale_fill_brewer(palette = "Spectral") g + geom_histogram(aes(fill=class), binwidth = .1, col="black", size=.1) + # change binwidth labs(title="Histogram with Auto Binning", subtitle="Engine Displacement across Vehicle Classes") g + geom_histogram(aes(fill=class), bins=5, col="black", size=.1) + # change number of bins labs(title="Histogram with Fixed Bins", subtitle="Engine Displacement across Vehicle Classes") Density plot library(ggplot2) theme_set(theme_classic()) # Plot g <- ggplot(mpg, aes(cty)) g + geom_density(aes(fill=factor(cyl)), alpha=0.8) + labs(title="Density plot", subtitle="City Mileage Grouped by Number of cylinders", caption="Source: mpg", x="City Mileage", fill="# Cylinders")
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved