“ggplot” (technically “ggplot2”) is an R package* that facilitates elegant design of graphics. Even if you are brand new to R, you might have heard about “ggplot”–in fact, for some people it might be the main reason they want to learn R.
ggplot2 is much more ambitious and in some ways much more challenging
than most other packages because it creates a new “grammar” of graphics,
and it requires you to learn some new syntax. But with practice, this
syntax will start to make sense, and it can help you make excellent
quality figures. In addition, there are now many extentions packages
that allow you to do even more with the ggplot grammar (e.g., make maps
with ggmap
or display networks with ggraph
,
etc.–see a gallery of extensions here)
ggplot2 is part of the “tidyverse” suite of packages. There is a separate module on other major aspects of tidyverse, such as tidyr and dplyr.
Super Useful References:
ggplot2 website: https://r-graph-gallery.com/ggplot2-package.html
The ggplot2 book (free online version): https://ggplot2-book.org/index.html
The online ‘tidyverse’ book: https://r4ds.had.co.nz/data-visualisation.html
*** What is a “package” in R? ***
R packages are essentially a set of custom functions that R users have created and compiled, along with help files and vignettes, etc. Many of them are archived at CRAN–The Comprehensive R Archive Network–and available to install from the R console using the function
install.packages()
. There are still many other packages that users have not archived but are available from other sources, such as github. “Installing” the package means that the package is downloaded onto your computer. When you are ready to use them, you will have to load the package onto the environment by running the functionlibrary()
orrequire()
.
One can install each package separately, but you can also just install all “tidyverse” packages simply by running this command:
install.packages("ggplot2")
Note that this simply downloads the packages onto your computer. You only have to do this once on a given computer.
You now have the package downloaded on your computer, but to actually
use it, you have to load the package. We can load the entire
tidyverse
package (or, if you prefer, you can just load the
tidyr
package).
library(ggplot2)
ggplot2 uses what is called layered grammar of graphics
We can break down the layers of any graphic to different components (see this pdf for full explanation):
The data
Mapping: how the variables in the data are converted to “aesthetics” of the figure
The geom, or geometric object: the type of visual object you want to make
Scaling: i.e., how different values of variables are represented
Faceting: i.e., representing subsets of data as subplots
First specify the data using the ggplot()
function.
Add “aesthetic mapping” (i.e., specify the visual parameters of
the graphic) using aes()
. This can be set within the
ggplot()
function if you want the aesthetic to apply as
default to all layers you are going to define, or within the
geom_
function if you want different layers to have
different aesthetics.
Define specific plot components using additional
geom_
functions (such as geom_points()
). Note
that you literally add these components using +
Layer on any other components with additional geom_
or stat_
functions
Define scaling of variables if needed (e.g., color palette)
Make adjustments via scales, axes, legends, etc.
Scatterplots are used to display the relationships between two continuous variables.
In the “basics of plots” module, we created a scatterplot of sepal
lengths and widths from the iris
dataset that looked like
this:
colorset=rainbow(3) #create a palette of 3 colors
pt.cols=colorset[as.numeric(iris$Species)] #This is now a vector of colors for each point
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col=pt.cols)
Here, we will go through step-by-step on how to recreate this figure, but in ggplot2
This will only create a blank plot
ggplot(data=iris, mapping=aes(x=Sepal.Length, y=Sepal.Width))
geom_point()
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_point()
geom_
function.ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_point(size=2)
aes()
argument)ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=2)
In the base R example plot above, we used a rainbow(3)
palette to generate 3 color values. We can do that here using a
scale_color_discrete()
function. Note: there are lots of
different scale_color_
functions, and it might take you a
while to get familiar with them.
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=2) +
scale_color_discrete(type=rainbow(3))
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=2) +
scale_color_discrete(type=rainbow(3)) +
theme_bw()
theme
to remove grid
linesggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=2) +
scale_color_discrete(type=rainbow(3)) +
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
Right now, the labels say “Sepal.Length” and “Sepal.Width”. Let’s change the periods into spaces:
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=2) +
scale_color_discrete(type=rainbow(3)) +
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
xlab("Sepal Length") +
ylab("Sepal Width")
… ok, most of the time, you probably should have a legend. But, it
will be helpful for you to learn how to play around with it. There are
several ways to do this, but one way is to edit the
legend.position
argument in the theme()
function.
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=2) +
scale_color_discrete(type=rainbow(3)) +
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "none") +
xlab("Sepal Length") +
ylab("Sepal Width")
Just to note: You can also move the aesthetic mapping to the
geom_point()
function rather than the ggplot()
function. It doesn’t make any difference in this example because you
have only one geom
function. But it might make a difference
when you are doing more complex visualizations.
ggplot(iris) +
geom_point(aes(x=Sepal.Length, y=Sepal.Width, color=Species), size=2) +
scale_color_discrete(type=rainbow(3)) +
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "none") +
xlab("Sepal Length") +
ylab("Sepal Width")
ggsave()
functionYou can export the last plot you made using the function
ggsave()
. Enter the file name you want to save it as,
including the file extension.
ggsave("scatterplot.png")
You will find the file in your Rproject folder.
You might find that you want to adjust the width and height of the plot. You can set this in inches or whatever other unit (see ?ggsave() for details).
ggsave("scatterplot.png", width=8, height=4, units="in")
A better way is to save the plot as an object, and
then save it. Here, we will assign the plot with the legend as
p
and then save it.
p=ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(size=2) +
scale_color_discrete(type=rainbow(3)) +
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
xlab("Sepal Length") +
ylab("Sepal Width")
#display the plot
p
#save the plot
ggsave("scatterplot_w_legend.pdf", width=8, height=4, units="in")
Here is a vignette for other aesthetic specifications: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html
Here is the “themes” section in the ggplot2 book: https://ggplot2-book.org/polishing.html
Boxplots and violin plots are used to display the relationship between a categorical variable and a continuous variable.
Boxplots (aka “box-and-whiskers plot”) typically displays the median, 25th & 75th percentile, the 25th & 75th percentile +/- 1.5 IQR (inter-quartile range) and outliers. Violin plots show the distribution of data for each category as a density plot.
The typical boxplot
ggplot(iris, aes(x= Species, y=Sepal.Width, fill=Species)) +
geom_boxplot() +
scale_fill_brewer(palette="RdYlBu") +
ylab("Sepal Width")
ggplot(iris, aes(y= Species, x=Sepal.Width, fill=Species)) +
geom_boxplot(notch=T) +
scale_fill_brewer(palette="RdYlBu") +
ylab("Sepal Width")
ggplot(iris, aes(x= Species, y=Sepal.Width, fill=Species)) +
geom_violin() +
scale_fill_brewer(palette="Blues") +
ylab("Sepal Width")
Histograms and density plots are used to visualize the distribution of a continuous value.
ggplot(iris) +
geom_histogram(aes(x=Sepal.Width)) +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(iris, aes(x=Sepal.Width, fill=Species)) +
geom_histogram(alpha=0.7, position='identity', color="black") +
scale_fill_manual(values=c("red", "yellow", "blue")) +
facet_grid(rows=vars(Species)) +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(iris, aes(x=Sepal.Width)) +
geom_density() +
theme_classic()
ggplot(iris, aes(x=Sepal.Width, fill=Species)) +
geom_density(alpha=0.5) +
theme_classic()
A line chart is typically used to display the change in a numberical value across time.
Note: This is not the same as fitting a best-fit line to a set of data
For this, let’s use a different data set included in base R, called
Orange
. This dataset shows the change in circumference of 5
orange trees across different ages.
Orange
## Tree age circumference
## 1 1 118 30
## 2 1 484 58
## 3 1 664 87
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142
## 7 1 1582 145
## 8 2 118 33
## 9 2 484 69
## 10 2 664 111
## 11 2 1004 156
## 12 2 1231 172
## 13 2 1372 203
## 14 2 1582 203
## 15 3 118 30
## 16 3 484 51
## 17 3 664 75
## 18 3 1004 108
## 19 3 1231 115
## 20 3 1372 139
## 21 3 1582 140
## 22 4 118 32
## 23 4 484 62
## 24 4 664 112
## 25 4 1004 167
## 26 4 1231 179
## 27 4 1372 209
## 28 4 1582 214
## 29 5 118 30
## 30 5 484 49
## 31 5 664 81
## 32 5 1004 125
## 33 5 1231 142
## 34 5 1372 174
## 35 5 1582 177
ggplot(Orange, aes(x=age, y=circumference, color=Tree)) +
geom_line()