R is amazing for graphics–i.e., visualizing data. Once you learn how to play around with the codes, it is almost infinitely customizable. In fact, R has become a major tool for graphics departments at major journalism outlets such as the BBC. There are also many wonderful resources for how to make compelling and clear visualizations of data (e.g., this book by Claus Wilke)
But let’s not get ahead of ourselves. In this module, we will cover the basics of the plotting functions in the base package of R. There are many additional packages that help you produce fancier plots (e.g., ‘ggplot2’ and ‘lattice’). We may cover ggplot2 later in the course.
There many, many different ways to visualize data (see datavizproject.com). In fact, “data visualization” is an industry in and of itself at this point. What we need is some directory of different kinds of ways to visualize a given set of data.
The directory of visualizations presented in Claus Wilke’s online version of his book is as good as any I’ve seen.
Continuous data are quantitative measurements that can take any value. That is, it can be divided and reduced to finer levels. For example, take weight, height, or length. In R, these types of data would be handled as numerical vectors.
Count data or frequencies are also quantitative measurements, but they cannot be divided into smaller levels. They are counts of whole things. These are typically integers. In R, they would be either numerical vectors or integers.
Categorical data are qualitative data are characteristics, descriptors or labels that are not numerical. In R, these would be stored as factors or character strings.
Depending on the mix of different variables, you can display them in different ways.
We will start with the generic plot()
function. This
function can produce a variety of basic plots. The function can take two
different syntax:
plot(x , y)
– The Cartesian format will take two
variables and plot the first element on the x-axis and the second
element on the y-axis. The plot this produces will depend on the type of
variables (i.e., numeric or factor) that you put in.plot(y~x)
is an alternative syntax (“formula” syntax)
which will produce the same plot as above. In this case, you are
plotting a relationship–i.e., y
as a function of
x
R will automatically detect the type of vector (variable) you are trying to plot.
Try plotting two numerical vectors (i.e., continuous variables) that we create:
a=seq(1,15, 1) #continuous variable 1
b=seq(16,30,1) #continuous variable 2
plot(a, b) #plot two continuous variables
You can see that R automatically detects the two vectors
(a
and b
) as continuous variables and creates
a scatter plot.
Now, let’s create a fictitious factor vector (i.e., discrete
variable) and plot one of the continuous variables against that.
fac=factor(c(rep("A", 5), rep("B", 5), rep("C", 5))) #a factor
plot(fac, a) #plot factor on x-axis and continuous variable on y-axis
Here, R recognizes that we want to plot a factor on the x-axis and continuous variable on y-axis and defaults to a box plot.
You can also plot using a formula syntax, using a
~
:
plot(a~fac) #formula syntax using "~"
Notice that, in the formula syntax, the order of the variables is
different. Here, you are saying “plot a
as a function of
fac
”, so the first variable will show up on the y-axis and
the second variable will be plotted on the x-axis.
If you have a timeseries, you sometimes want to create a line
plot. You can do this simply by designating the plot ‘type’
inside the plot()
function:
t=seq(1,10,1) # time sequence, from 1 to 10
response=c(1,5,4,2,3,9,6,7,8,10) # 10 random numbers
plot(t, response, type="l") #plot a line plot
You can plot the same figure with both lines and points by using
type="b"
for “both”
t=seq(1,10,1) # time sequence, from 1 to 10
response=c(1,5,4,2,3,9,6,7,8,10) # 10 random numbers
plot(t, response, type="b") #plot a line plot
You can plot the same figure using a few other formats (you can look
these up using ?plot()
). Try these out (outputs not
shown):
plot(t, response, type="p") #plot just points
plot(t, response, type="h") #Make a histogram-like plot with thin lines
plot(t, response, type="s") #Make stair steps
Now let’s try making a simple barplot showing frequencies of some categorical variable. Here, we’ll just set up some fictional set of observations of individuals belonging to some groups (say species), “a”, “b” and “c”
groups=c(rep("a", 10), rep("b", 5), rep("c", 2))
We can plot the frequencies of these groups as a barplot:
groups.table=table(groups) #create a frequency table
groups.table #look at it
## groups
## a b c
## 10 5 2
barplot(groups.table) #plot it
Let’s practice plotting data from a dataframe.
First, we need some data to play with. For this exercise, we will use a
dataset called iris
, which is pre-loaded in R:
iris #see all the data
str(iris) #see the structure of the iris dataset
?iris #learn more about the dataset
The basic structure of the dataset is that there are 4 continuous variables (length & width of sepals and Sepals) measured for individuals in 3 different species of iris (Figure 1). This dataset is useful for demonstrating the basics of plotting because it has both continuous and discrete variables.
Let’s try plotting two continuous variables against each other:
plot(iris$Sepal.Length, iris$Sepal.Width)
Compare this with what happens when you plot a continuous variable (sepal length) against a factor (iris species):
plot(iris$Species, iris$Sepal.Width)
You can also use the formula syntax to do the same thing. Try:
plot(iris$Sepal.Width~iris$Sepal.Length) #scatterplot
plot(iris$Sepal.Width~iris$Species) #box plot
One useful feature of the formula syntax is that it allows you to
specify a dataframe you are using with the data=
argument.
Then, you can refer to the column name for that dataframe instead of
specifying both the dataframe and column name with the $
operator. This can simplify the code a little bit:
plot(Sepal.Width~Sepal.Length, data=iris)
R presents many many ways to customize the format of your plots. We will go through several of these, and then we will learn how to find out more.
You can relabel the axes using the arguments xlab=
and
ylab=
:
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width")
Now add a plot title using main=
and change re-orient
the y-axis values to be horizontal using las=
:
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", main="Length vs. width of sepals in Iris", las=1)
You can change the x- and y-axis ranges by using xlim=
and ylim=
. These arguments each take a vector of two
numbers representing the minimum and maximum values. Let’s make the same
plot, but with different axis limits:
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", main="Length vs. width of sepals in Iris", las=1, xlim=c(0,10), ylim=c(0,5))
You can specify the types of points you want to use with the
pch=
argument. Try this:
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=4)
Here are what the symbols look like for ‘pch’ values ranging from 1
to 25.
You can manipulate the sizes of points by using the argument
cex=
:
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=1, cex=3)
You can specify the colors of the points using the argument
col=
inside the plot()
function. This argument
can take many different types of inputs:
rgb()
functionplot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col=4)
However, using numerical values is pretty limiting. R only uses
values of 1 through 8 to assign colors. Here are those colors:
You can also specify colors using just the names of colors as characters. For example, to make a plot with red dots:
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col="red")
There are infinite possibilities for color in R. Many colors have straightforward names (e.g., like “blue”), but there are also lots of crazy color names. Here is one useful page for named colors: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
Here are some examples:
You can use the standard hexadecimal color specification. For example, the hexadecimal code for green is #00FF00:
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col="#00FF00")
You can look up hex color codes here: http://www.color-hex.com/
###5.5.4 Creating a color palette: There are several functions to
create color palettes from a template. For example, we can use a
rainbow()
function. You can put the number of colors you
want, and the function will pick three colors that are the most
different from each other within the specified spectrum.
Try it out by plotting 12 different colors:
my.colors=rainbow(12) #set up color palette of rainbow colors with n = 12
plot(1:12, pch=19, cex=2, col=my.colors) #plot dots, each with different colors
Other examples of color palettes include: heat.colors()
,
terrain.colors()
, topo.colors()
and
cm.colors()
my.colors=topo.colors(12) #set up color palette of topo colors with n = 12
plot(1:12, pch=19, cex=2, col=my.colors) #plot dots, each with different colors
Try the same with heat.colors()
,
topo.colors()
, and cm.colors()
.
You can get more color palettes from other R Packages such as RColorBrewer.
There are a couple of ways to create color gradients. First, you can
use the rgb()
function to create custom colors. This
function takes four arguments: intensities of red, green, blue, and the
alpha transparency value. Thus, you can create any color with any
opacity. Here are some examples of colors that you can produce with the
rgb()
function.
Let’s try plotting the figure with enlarged points and transparent color:
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=1, cex=3, col=rgb(1, 0, 0, 0.5))
par()
The par()
function is used before the
plot()
function to set up aspects of the plotting region.
For example, you can set the plotting region to be two columns and plot
two different plots using mfrow=
argument inside the
par()
function.
par(mfrow=c(1,2)) #set up a plotting region with one row and two columns:
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col="#00FF00") #first plot
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col="tomato") #secon plot
plot()
and
par()
functionsSo we have learned some ways to customize plots in the sections
above. As you might imagine, there are many, many more ways to customize
plots. Most of the graphical parameters you need can be found by looking
in the help file for the function par()
. Take a look at
this help file:
?par
Many of the arguments presented in this help file can be called
within the plot()
function or the par()
function. However, as noted near the beginning of the “Details” section,
there are several arguments that can only be called from the
par()
function, and there are also several argumnets that
produce slightly different results based on whether they are called
within the plot()
or par()
function.
I have included an Appendix to this module that
contains a table of all graphical parameters I know of and their
usage.
So far, we have plotted all items within a plot using the same
symbols. Howevever, we may often want to display more information in a
figure. For example, the iris
dataset contains information
about three different species. How about we plot data from different
species in different colors?
Let’s think about how we might do this. Recall that the dataframe has a
column for species:
head(iris$Species) #recall that head() shows the first three elements of a vector
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
So what we want to do is convert this information to colors. There are a couple of ways to do this. First, we can convert species into numbers, and then use that to assign colors… like this:
as.numeric(iris$Species) #convert species into numbers.
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3
So, we can use this trick to assign different colors to dots for different species.
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col=as.numeric(iris$Species))
You can look above to where we discussed colors and see that col=1 is black, col=2 is red, and col=3 is green.
However, you may want to assign your own custom color sets to the
plots. If so, what you need to do is come up with three colors that you
want to use, say from the topo.colors()
function. Then, we
can use as.numeric(iris$Species)
(remember: these values
range from 1 to 3) as the index to point to the color that will
correspond to the species. Try this:
colorset=rainbow(3) #create a palette of 3 colors
pt.cols=colorset[as.numeric(iris$Species)] #This is now a vector of colors for each point
Now, you can use this new set of colors, pt.cols
, for
the plot.
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col=pt.cols)
Of course, if you start using different colors for different
information, you need to add a legend. You can do that with the
legend()
function. Take a quick look at the help file by
using ?legend
. This function has arguments to assign the
location (either using a keyword or x-y coordinates) within the plotting
region where you want to put the legend, what the legends are, and
several parameters to assign the symbols. Here is an example:
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col=pt.cols)
legend("bottomright", legend=c("I. setosa", "I. versicolor", "I. virginica"), pch=19, col=colorset)
Here, I just told R to plot the legend on the bottom right, but you can also tell it the x-y coordinates of the top left corner of the legend. Often, just using the key words (“bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right” or “center”) works well.
Ok, the legend looks good, but it’s on top of some of the points.
What I would really like to do is plot the legend outside of the plot.
To do this, I have to do two things to set up the plotting region using
the par()
function: Tell it to allow me to plot something
outside the axes (xpd=TRUE
), and expand the margins of the
plotting region, especially to the right of the plot. This can be
specified using the mar=
argument (see ?par
for details). Then, you can set the legend plotting region to be outside
the plot (say at x = 8.2 and y = 3):
par(mar=c(4,4,1,7), xpd=TRUE) #make the margin wider, and allow me to plot outside the box.
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col=pt.cols)
legend(8.2, 3, legend=c("I. setosa", "I. versicolor", "I. virginica"), pch=19, col=colorset)
Sometimes, you want to add extra things like special points or lines
to an existing plot. You can do this using the points()
,
lines()
, text()
and abline()
functions. When you get comfortable with adding elements to plots, you
can start to create really nice plots with lots of information.
To start, let’s assign the plot the we created above (without the
legend) to an object. This will save us from having to run the plotting
function each time we add an element. To do this, we generate the plot,
and then use the recordPlot()
function to save it as an
object (output not shown):
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col=pt.cols) #make the plot again without the legend
example.plot=recordPlot() #save the plot as an object called 'example.plot'
Now, let’s say we want to add some information about the average
Sepal length and width for each species here. We will first calculate
those average values using the tapply()
function:
mean.Sepal.length=tapply(iris$Sepal.Length, iris$Species, mean)
mean.Sepal.width=tapply(iris$Sepal.Width, iris$Species, mean)
mean.Sepal.length
## setosa versicolor virginica
## 5.006 5.936 6.588
mean.Sepal.width
## setosa versicolor virginica
## 3.428 2.770 2.974
We will now use the points()
function to add “x” at the
mean values for all three species:
example.plot
points(x=mean.Sepal.length, y=mean.Sepal.width, pch="x", cex=2) #add an "x" at the mean Sepal length & width for all three species
We can add text too. The text()
function takes arguments
for where the text should be centered, but you can specify whether the
text goes below, left, above or right of the point (using the
pos=
argument), and how far it should be offset from the
point (offset=
argument):
example.plot
points(x=mean.Sepal.length, y=mean.Sepal.width, pch="x", cex=2)
text(x=mean.Sepal.length, y=mean.Sepal.width, labels=c("setosa", "versicolor", "virginica"), pos=1, offset=0.5) #add text below each of the "x" (pos=1), and how far it should be offset (offset = 0.5)
You can also add lines to plots. There are two functions to add
lines: abline()
and lines()
. The
abline()
function is a bit more straightforward. This
function can be used to draw a line that is vertical, horizontal, or
with a specific slope and intercept. Let’s use this function to draw a
vertical line at the mean Sepal length for I. setosa using the
v=
argument within the function. We will use the
lty=3
argument to make a dotted line.
example.plot
abline(v=mean.Sepal.length, lty=3) #add a vertical lines at the mean Sepal length for each species.
We can also add lines for the mean Sepal widths, and make the line colors correspond to the point colors:
example.plot
abline(v=mean.Sepal.length, lty=3, col=colorset) #add vertical lines at the mean Sepal lengths for each species.
abline(h=mean.Sepal.width, lty=3, col=colorset) #add horizontal lines at the mean Sepal widths for each species.
Of course, you will want to sometimes save the figures you generated. There are two simple ways to do that: using the RStudio interface, or using a command.
####4.5.1 Using the RStudio interface Use the “Export” button under “Plots”. I would recommend saving the file as a pdf. Or, if you are just looking to quickly embed a plot in a word file, you can just copy to the clipboard and paste.
####4.5.2 Using R command You can save graphics in many different
formats using R commands. This procedure takes three steps:
1. open up a ‘graphical device’ using a function (e.g.,
pdf()
or png()
)
2. plot a figure into that graphical device
3. close the graphical device
Only after you ‘close’ the graphical device can you see the output. For example, to save a figure as a pdf, you will follw these three lines of code:
pdf(file="exampleplot.pdf") #open up a pdf 'graphical device'
plot(Sepal.Width~Sepal.Length, data=iris, xlab="Sepal Length", ylab="Sepal Width", las=1, pch=19, col=pt.cols) #draw the plot
dev.off() # close the graphical device
Argument | How To Call | Default | Effect |
---|---|---|---|
adj | par() or plot() |
0.5 (center) | text justification |
ann | par() or plot() |
TRUE | axis label on or off |
ask | par() |
FALSE | ask before plotting |
bg | par() or plot() ** |
white | if called in par() : background of plot; if called in
plot() : background of symbol, only if pch=21 through 25
(filled symbols) |
bty | plot() |
o | the type of box around the plot. Select from “o”, “l”, “7”, “c”, “u” or “]” |
cex | par() or plot() ** |
1 | character size expansion (i.e., magnification). If called in
par() : magnifies both symbols and text; if in
plot() : magnifies symbols only |
cex.axis | par() or plot() |
1 | axis magnification |
cex.lab | par() or plot() |
1 | axis label magnification |
cex.main | par() or plot() |
1 | title magnification |
cex.sub | par() or plot() |
1 | subtitle magnification |
col | par() or plot() ** |
black | color. If in par() : color of symbols and box; in plot:
color of symbols |
col.axis | par() or plot() |
black | color of axis values |
col.lab | par() or plot() |
black | color of axis labels |
col.main | par() or plot() |
black | color of main title |
col.sub | par() or plot() |
black | color of sub-title |
family | par() or plot() |
“” | font family: can be “serif”, “sans”, “mono”, “symbol” |
fg | par() or plot() |
black | foreground color. |
fig | par() |
set region of the display device in which to plot. Use vector of form c(x1,x2,y1,y2), with values ranging from 0 to 1. E.g., par(fig=c(0,0.5,0,0.5)) will put the subsequent plot in the lower left corner of the plotting window. You need to also specify new=TRUE to add more plots to the same window. | |
fin | par() |
Same as fig=, but in inches | |
font | plot() |
1 | font type: 1 = plain text, 2 = bold, 3 = italic, 4 = bold italic |
font.axis | par() or plot() |
1 | font type of axis values |
font.lab | par() or plot() |
1 | font type of axis labels |
font.main | par() or plot() |
2 | font type of main title |
font.sub | par() or plot() |
1 | font type of sub-title |
las | par() or plot() |
0 | axis annotation type: 0 = parallel to axis, 1 = horizontal, 2 = perpendicular to axis, 3 = vertical |
log | plot() |
can set axes to be log by log=“x”, log=“y” or log=“xy” | |
lty | par() or plot() |
1 | line type. Specify by integer (0=blank, 1=solid, 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or character (“blank”, “solid”, “dashed”, “dotted”, “dotdash”, “longdash”, or “twodash”) |
lwd | par() or plot() |
1 | line width. If called in par() : include line width of
symbols & box. In plot() or lines(): only line width of
plotted lines |
mai | par() |
margin size in inches. Use vector form c(bottom, left, top, right) | |
mar | par() |
c(5,4,4,2)+0.1 | margin size in number of lines of text |
mfcol | par() |
c(1,1) | specify number of rows and columns to split the plotting device in the form c(nrow,ncol). Will plot in order by columns |
mfrow | par() |
c(1,1) | Same as above, but will plot in order by rows |
mfg | par() |
set the order of figures to be plotted from an array | |
mgp | par() or plot() |
c(3,1,0) | margin lines for axis title, label and line |
new | par() |
FALSE | if TRUE, plot the next plot over the pre-existing one |
oma | par() |
set outer margin size in lines of text c(bottom, left, top, right) | |
omd | par() |
set outer margin in “device coordinates” (values of 0 to 1): form c(x1, x2, y1, y2) | |
omi | par() |
set outer margin in inches | |
pch | par() or plot() |
1 | symbol type. See class notes |
tck | par() or plot() |
-0.05 | length of tick marks. If tck=1, shows grid lines |
xaxp | plot() |
can set cordinates of extreme tick marks and intervals. | |
xaxt | par() or plot() |
if xaxt=“n”, suppresses plotting x-axis | |
xlim | plot() |
set the x-axis limits by c(x1,x2) | |
xpd | par() |
FALSE | plot clipping. If FALSE, clips to plot region. if TRUE, clips to figure region, if NA, clips to device region |
yaxp | plot() |
can set cordinates of extreme tick marks and intervals. | |
yaxt | par() or plot() |
if “n”, suppresses plotting y-axis by c(y1,y2) | |
ylim | plot() |
set the y-axis limits by c(y1,y2) |