1. The Study

Shore crab, Carcinus maenas

Shore crab, Carcinus maenas

Today, we will recreate some aspects of a study (Wales et al. 2013)1 on the effect of anthropogenic noise on the metabolic rate of the shore crab, Carcinus maenas. You can download the pdf here.


The study is summarized by these two figures:


2 The Data

The data are publicly accessible on Dryad: http://datadryad.org/resource/doi:10.5061/dryad.36f65 The data are saved in two separate “worksheets” within .xlsx file. We will need to save the two sheets separately as individual .csv files. Follow these steps:

  1. Save the “Wale et al_data file.xlsx” in your working directory folder for the week.
  2. Save the first sheet (“Single exposure experiment”) as a .csv file named “single.csv”
  3. Delete the measurement units for “Oxygen consumed”
  4. Save the second sheet (“Multiple exposure experiment”) as a .csv file names “multiple.csv”

Let’s load the datasets into R and take a quick a look to check that the dataframes look correct.

single = read.csv("data/single_exposure.csv")
head(single)
##   Playback.treatment Crab.mass Oxygen.consumed
## 1            Ambient     50.73        142.4711
## 2            Ambient     59.72        168.6700
## 3            Ambient     54.51        205.1131
## 4            Ambient     57.32        168.3071
## 5            Ambient     84.90        209.4783
## 6            Ambient     49.58        154.7141
mult = read.csv("data/multiple_exposure.csv")
head(mult)
##   Playback.treatment     Day1      Day3      Day5      Day7      Day9
## 1            Ambient 1.921936 0.9849995 0.9842713 0.9828148 0.9828148
## 2            Ambient 1.857020 0.8899608 1.6646716 2.0493675 2.0493675
## 3            Ambient 1.723259 1.1569210 1.1565645 1.1558514 3.4240576
## 4            Ambient 1.943196 0.6500865 1.2966410 3.2363590 5.8227945
## 5            Ambient 1.630638 0.8660895 1.6321760 1.8849740 2.9084740
## 6            Ambient 1.798980 0.5874803 0.3429576 2.0403275 2.0403275
##       Day11    Day13    Day15
## 1 0.9791737 1.456381 1.453468
## 2 2.0405021 2.254127 1.858793
## 3 1.7211203 3.426197 1.723616
## 4 3.8826962 3.883294 2.589859
## 5 2.9007815 2.405955 2.655676
## 6 2.5182593 1.806918 1.800568

3 Testing the relationships between mass, noise, and physiology

(Testing the interaction between a continuous variable and a factor)


A central question here is whether the sound environment affects the physiology of crabs. However, we also know that physiological measures are often affected by body size. Thus, we would like to know whether the relationship between mass and oxygen consumption differ across treatments.

Let’s break this down. We are interested in the relationships between 3 measures:

  1. Oxygen consumption (continuous variable)
  2. Body mass (continuous variable)
  3. Experimental treatment (discrete factor w/ two levels: ambient vs. ship noise)

3.1 Plotting Figure 1

We can plot the relationships between these three things this way:

# Run two regression analyses: one for each treatment
mod1 = lm(Oxygen.consumed ~ Crab.mass, data = subset(single, single$Playback.treatment == 
    "Ambient"))
mod2 = lm(Oxygen.consumed ~ Crab.mass, data = subset(single, single$Playback.treatment == 
    "Ship"))

# set up the x and y coordinates for the regression lines
xv = c(min(single$Crab.mass), max(single$Crab.mass))
yv1 = predict(mod1, list(Crab.mass = xv))
yv2 = predict(mod2, list(Crab.mass = xv))

# Figure 1
plot(Oxygen.consumed ~ Crab.mass, data = single, pch = c(1, 18)[Playback.treatment], 
    cex = 1.5, las = 1, xlim = c(0, 90), ylim = c(0, 375))
lines(xv, yv1, lty = 2, lwd = 1.5)
lines(xv, yv2, lty = 1, lwd = 1.5)
*Figure 1: The relationship between body mass and oxygen consumption in shore crabs. Individuals in the 'Ambient' treatment shown in open circles, individuals in 'Ship' treatment shown in filled diamonds.*

Figure 1: The relationship between body mass and oxygen consumption in shore crabs. Individuals in the ‘Ambient’ treatment shown in open circles, individuals in ‘Ship’ treatment shown in filled diamonds.

So, what should we take away from this figure? There are three relationships shown here:

  1. There seems to be a positive relationship between mass and oxygen consumption

  2. Crabs that were exposed to ship noise (filled diamonds) seem to have higher oxygen consumption than those exposed to ambient noise (open circles)

  3. The crabs that were exposed to ship noise seem to have a steeper relationship between mass and oxygen consumption

How do we test these relationships statistically? When we have a continuous dependent variable (Oxygen consumption) and two independent variables–one continuous (mass) and one discrete (playback treatment), we conduct an Analysis of Covariance, or ANCOVA.2

3.2 Running an ANCOVA

You can run an ANCOVA in R in two different ways. First, you could build a linear model and then compute the analysis of variance table. Second, you could run use the aov() wrapper for a linear model and then just look at the output using a summary function.

ANCOVA Option 1: lm() and anova()

mod = lm(Oxygen.consumed ~ Playback.treatment * Crab.mass, data = single)
anova(mod)

ANCOVA Option 2: aov() and summary()

aov.mod = aov(Oxygen.consumed ~ Playback.treatment * Crab.mass, data = single)
summary(aov.mod)
##                              Df Sum Sq Mean Sq F value   Pr(>F)    
## Playback.treatment            1  39355   39355  34.342 2.06e-06 ***
## Crab.mass                     1  33456   33456  29.195 7.46e-06 ***
## Playback.treatment:Crab.mass  1   7828    7828   6.831   0.0139 *  
## Residuals                    30  34379    1146                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So what does this result tell us?

This model has a significant interaction effect (F = 6.8, P = 0.014). In this case, this means that the slopes of the relationship between mass and oxygen consumption are different between treatments. We can see this in the figure above: In both treatments, oxygen consumption increases with body mass, but heavier individuals increase their consumption rate more when they are exposed to ship noise.

Technically speaking, you don’t really want to interpret the other single-variable results (i.e., the effect of mass and treatment on oxygen consumption independently) because they are NOT independent. If you want to look at the effect of mass per se on oxygen consumption, you need to do that within each treatment (because we know that the relationship is different depending on the treatment!). Likewise, if you want to look at the relationship between treatment and oxygen consumption, you should do a simple ANOVA to compare differences between treatments. You can do those like this (outputs not shown):

# The effects of mass on oxygen consumption within each treatment
summary(lm(Oxygen.consumed ~ Crab.mass, data = subset(single, single$Playback.treatment == 
    "Ambient")))
summary(lm(Oxygen.consumed ~ Crab.mass, data = subset(single, single$Playback.treatment == 
    "Ship")))
# The effect of playback treatment, regardless of mass
summary(aov(Oxygen.consumed ~ Playback.treatment, data = single))

3.3: Table of basic statistical tests

Here is a table with some of the most common statistical tests

Dependent Variable Independent Variable(s) Test function
continuous continuous Linear Regression lm()
continuous categorical ANOVA or t-test aov() or t.test()
categorical categorical Fisher’s exact test or contingency table fisher.test() or chisq.test()
continuous continuous + categorical ANCOVA lm() or aov()
continuous categorical + categorical two-way ANOVA aov()
continuous continuous + continuous multiple regression lm()
binary (e.g., yes/no) continuous logistic regression glm()
counts or proportions continuous poisson regression glm()

4 Examining the effect of repeated exposure to noise

The second part of the study uses a repeated playback of ambient or ship noise to look at cumulative effects of noise on the physiology of shore crabs.

In this part of the study, each individual was exposed to repeated trials of noise playbacks, and their oxygen consumption was measured each time.

We have already import the data for this part of the study at the beginning of this module. Let’s refresh our memory about what this looks like.

head(mult)
##   Playback.treatment     Day1      Day3      Day5      Day7      Day9
## 1            Ambient 1.921936 0.9849995 0.9842713 0.9828148 0.9828148
## 2            Ambient 1.857020 0.8899608 1.6646716 2.0493675 2.0493675
## 3            Ambient 1.723259 1.1569210 1.1565645 1.1558514 3.4240576
## 4            Ambient 1.943196 0.6500865 1.2966410 3.2363590 5.8227945
## 5            Ambient 1.630638 0.8660895 1.6321760 1.8849740 2.9084740
## 6            Ambient 1.798980 0.5874803 0.3429576 2.0403275 2.0403275
##       Day11    Day13    Day15
## 1 0.9791737 1.456381 1.453468
## 2 2.0405021 2.254127 1.858793
## 3 1.7211203 3.426197 1.723616
## 4 3.8826962 3.883294 2.589859
## 5 2.9007815 2.405955 2.655676
## 6 2.5182593 1.806918 1.800568

Ok, what you can see here is that the data is stored with each individual as a row, and each trial as a separate column. We will refer to this format as the “wide-format”. To conduct a repeated-measures analysis on this data, we actually need to reformat this data so that each row represents a trial–i.e., each individual measured on a given day of the experiment. We will call this the “long-format”. Let’s see how to use R to manipulate the format of the data:

8.4.1: Re-formatting the data using the tidyr package

The ‘tidyr’ package is an R package designed for re-arranging data in a flexible way. It can be a really nice way to reshape your data or First, if you have not installed the package on your computer yet:

install.packages("tidyr")

We will proceed in two steps:

  1. Create a column for individual ID (which we simply assign as a unique number)
  2. gather() the data using these attributes:
  • This function will take all columns (unless indicated by “-”) and arrange them into a long-format, with the former column headers arranged in a single column, and the values arranged in a second column.
  • “key=” to give a name you want for the column of variables
  • “value=” to give a name you want for the column of values
  • The columns indicated prepended by “-” will be kept as-is. This will be your “ID” and the variables that are unique to that ID.

Let’s try this with our dataset! We will name the “key” variable as “day_txt” (for reasons that will become clearer below), and the “value” variable as “oxy” (for oxygen consumption).

library(tidyr)  #remember to load the package
mult$ID = 1:nrow(mult)
mult_long = gather(mult, key = day_txt, value = oxy, -ID, -Playback.treatment)
head(mult_long)
##   Playback.treatment ID day_txt      oxy
## 1            Ambient  1    Day1 1.921936
## 2            Ambient  2    Day1 1.857020
## 3            Ambient  3    Day1 1.723259
## 4            Ambient  4    Day1 1.943196
## 5            Ambient  5    Day1 1.630638
## 6            Ambient  6    Day1 1.798980
# mult_long$day_txt=as.ordered(mult_long$day_txt)

You can use the spread() function to re-arrange the data back to a “wide-format”.

mult_wide = spread(mult_long, key = day_txt, value = oxy)
head(mult_wide)
##   Playback.treatment ID     Day1     Day11    Day13    Day15      Day3
## 1            Ambient  1 1.921936 0.9791737 1.456381 1.453468 0.9849995
## 2            Ambient  2 1.857020 2.0405021 2.254127 1.858793 0.8899608
## 3            Ambient  3 1.723259 1.7211203 3.426197 1.723616 1.1569210
## 4            Ambient  4 1.943196 3.8826962 3.883294 2.589859 0.6500865
## 5            Ambient  5 1.630638 2.9007815 2.405955 2.655676 0.8660895
## 6            Ambient  6 1.798980 2.5182593 1.806918 1.800568 0.5874803
##        Day5      Day7      Day9
## 1 0.9842713 0.9828148 0.9828148
## 2 1.6646716 2.0493675 2.0493675
## 3 1.1565645 1.1558514 3.4240576
## 4 1.2966410 3.2363590 5.8227945
## 5 1.6321760 1.8849740 2.9084740
## 6 0.3429576 2.0403275 2.0403275

However, note that this will sort the days incorrectly because “Day11” comes after “Day1” when sorting these values alphabetically. This is another reason to keep your data in long-format and keep days as a separate numerical vector…

In any case, we will only be dealing with the long-format data from here on out.

4.2 Conducting the repeated-measures ANOVA

Now, we are ready to conduct our repeat-measures two-way ANOVA analysis.

  • “Repeated-measures” means you have multiple measurements taken for each subject–in this case, you measured the oxygen consumption rate for a crab at Days 1, 3, 5, 7, 9, 11, 13, and 15.
  • “Two-way ANOVA” means you have a continuous dependent variable and two categorical (discrete) independent variable (i.e., days are treated as discrete factors here)

Running a two-way ANOVA is easy: you simply add two dependent variables and an interaction term using “*" as with the ANCOVA. To add a repeated-measures term, you add what is called an “Error()” term, which indicates which variable is the “ID” or “subjects” that is being repeatedly sampled.

aov.mod = aov(oxy ~ Playback.treatment * day_txt + Error(ID), data = mult_long)
summary(aov.mod)
## 
## Error: ID
##                    Df Sum Sq Mean Sq
## Playback.treatment  1  141.2   141.2
## 
## Error: Within
##                             Df Sum Sq Mean Sq F value   Pr(>F)    
## Playback.treatment           1  34.64   34.64  33.933 3.07e-08 ***
## day_txt                      7   6.03    0.86   0.844   0.5527    
## Playback.treatment:day_txt   7  15.57    2.22   2.179   0.0388 *  
## Residuals                  159 162.32    1.02                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This gives us the result that there is a significant interaction between playback treatment and the changes in the crab’s response to the playback experiment across days. Graphically, we can see from Figure 2 in the paper that oxygen consumption increases for the control treatment (“Ambient”“) but not the”Ship" treatment. As with the single-exposure treatment, we really shouldn’t be interpreting the “general” relationship between day and oxygen consumption nor the treatment and oxygen consumption…

4.3: A little bit of a correction on the paper:

Ok, thus far we have re-created the analyses that were conducted and presented in the paper. Here, I would like to point out a couple of minor but significant-enough errors in this analysis.

  1. First, in the two-way repeated-measures ANOVA presented above, the “ID” variable needs to be assigned as a factor, i.e., a categorical variable that has no order. Currently, this variable is an integer, and as such it is interpreted as something that has rank-order (i.e., individual with ID = 1 has some quantitity that is lower than an individual with ID = 10). This is not correct.

  2. The days are currently not interpreted as factors–categorical variables without any order. Clearly, days are ordered, and we are interested in the change in values across days. One could argue about whether “days” should be treated as a continuous variable.

Let’s fix the first problem first. This is easy to do–we just need to tell R that “ID” is a factor within the aov() function:

aov.mod = aov(oxy ~ Playback.treatment * day_txt + Error(factor(ID)), data = mult_long)
summary(aov.mod)
## 
## Error: factor(ID)
##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## Playback.treatment  1  175.2  175.21   30.61 2.04e-05 ***
## Residuals          20  114.5    5.72                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##                             Df Sum Sq Mean Sq F value   Pr(>F)    
## day_txt                      7   6.03  0.8613   2.485   0.0196 *  
## Playback.treatment:day_txt   7  15.57  2.2242   6.416 1.48e-06 ***
## Residuals                  140  48.53  0.3467                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can see that this immediately increases the all of the effects observed here. This is because when “ID” was coded as an integer, it was creating “noise” that is irrelevant. Thus, the interaction effect is actually much stronger than is reported in the paper.

To fix the second problem (Days are not being interpreted as an ordered variable), we need to first re-code the “day” variable so that they are numbers rather than text. But we don’t have to re-type this data or anything–the information is within the character strings. We just need a way to change “Day1” into “1” and “Day3” to “3”, etc. Here, we can use a function called substr() (for substring) to extract certain characters from a character string. In this case, we just want the 4th through 5th character of the “day_txt” column. Try this (output not shown):

substr(mult_long$day_txt, 4, 5)

So we want to use this code to make a new column in the dataframe and make sure that it is interpreted as a number. Let’s do that and look at the first few lines of the revised dataset.

mult_long$day_num = as.numeric(substr(mult_long$day_txt, 4, 5))
head(mult_long)
##   Playback.treatment ID day_txt      oxy day_num
## 1            Ambient  1    Day1 1.921936       1
## 2            Ambient  2    Day1 1.857020       1
## 3            Ambient  3    Day1 1.723259       1
## 4            Ambient  4    Day1 1.943196       1
## 5            Ambient  5    Day1 1.630638       1
## 6            Ambient  6    Day1 1.798980       1

Now, we can use this new “day_num” variable as the variable of interest in our two-way repeated-measures ANOVA to see how the results might differ when days are treated as a continuous variable:

aov.mod2 = aov(oxy ~ Playback.treatment * day_num + Error(factor(ID)), 
    data = mult_long)
summary(aov.mod2)
## 
## Error: factor(ID)
##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## Playback.treatment  1  175.2  175.21   30.61 2.04e-05 ***
## Residuals          20  114.5    5.72                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##                             Df Sum Sq Mean Sq F value   Pr(>F)    
## day_num                      1   1.67   1.672   4.215   0.0418 *  
## Playback.treatment:day_num   1   8.15   8.151  20.543 1.17e-05 ***
## Residuals                  152  60.31   0.397                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now you see that the interaction effect is even stronger, although the results don’t really change qualitatively. I suspect that most of the difference comes simply from the change in degrees of freedom when days are treated as a continuous variable rather than categorical variables.

4.4: Accounting for repeated measures using a linear mixed-effects model

An alternative to using the repeated-measures ANOVA here is to use assign “ID” as a random effect. In very general terms, a random effect is a way to assign a hierarchical model in which you are exploring the relationships between a dependent variable and a set of fixed effects within categories (assigned as the random effect). It is a complicated concept, but it is widely applicable in ecological and biological statistics. There are number of R packages that can handle mixed-effects models, including nlme, lme4, MCMCglmm and others. Here is an example using the nlme package: If you don’t have this installed on your computer yet, run:

install.packages("nlme")

Here is the syntax to use to structure a mixed-effects model with “oxy” (oxygen consumption) as the dependent variable, “Playback.treatment” and “day_num” as fixed effects, and “ID” as the random effect:

# mixed-model using ML
library(nlme)
lme.mod = lme(oxy ~ Playback.treatment * day_num, random = ~1 | ID, data = mult_long)
anova(lme.mod)
##                            numDF denDF   F-value p-value
## (Intercept)                    1   152 257.66516  <.0001
## Playback.treatment             1    20  30.61482  <.0001
## day_num                        1   152   4.21465  0.0418
## Playback.treatment:day_num     1   152  20.54278  <.0001

You should see that the result is actually the same as the last version of the two-way repeated-measures ANOVA that we ran (with days as continuous variable and ID as factor). Thus, these are two viable alternative methods for doing the same thing.
More generally, linear mixed-effects models, and its extension, the Generalized Linear Mixed Model (GLMM) can be powerful tools for statistical analysis.

Mixed-models are very complicated, and different packages deal with them in different ways. You need to sit down and really learn how to deal with them. Here are a couple important references:

  • Bolker et al. (2009) Generalized linear mixed models: a practical guide for ecology and evolution, TREE, 24: 127-135.link

  • Zurr et al. (2009) Mixed Effects Models and Extensions in Ecology with R. Springer.


5 Figure 2

Let’s play around with different versions of Figure 2. For all of these, we will need to calculate the standard error, so first, the custom function for calculating standard error:

se = function(x) sqrt(var(x)/length(x))

5.1 The old way

# step 1: create a summarized dataset with means and standard errors
d1 = aggregate(mult_long$oxy, list(mult_long$day_num, mult_long$Playback.treatment), 
    mean)
ses = aggregate(mult_long$oxy, list(mult_long$day_num, mult_long$Playback.treatment), 
    se)
d1$se = ses[, 3]
names(d1) = c("day", "treatment", "mean", "se")

# plot blank plot, populate with dots & lines & standard error bars
plot(1:15, seq(0, 5, length = 15), type = "n", xlab = "", ylab = "", las = 1)
points(mean ~ day, data = subset(d1, treatment == "Ambient"), type = "b", 
    pch = 21, bg = "white")
points(mean ~ day, data = subset(d1, treatment == "Ship"), type = "b", 
    pch = 21, bg = "black")
arrows(d1$day, d1$mean - d1$se, d1$day, d1$mean + d1$se, code = 3, angle = 90, 
    length = 0.1)

5.2 A fancier way to summarize data using dplyr

There are a couple of packages that help us accomplish complex tasks in fewer keystrokes or in a more organized way. Hadley Wickham http://hadley.nz/ is an R whiz that has created a series of groundbreaking R packages including tidyr, dplyr and ggplot2. These packages enhance your R skills. However, the downside is that it takes a little bit of additional learning curve to understand how to use these tools. We have already touched on tidyr. Here, I will introduce you very briefly to dplyr, which is a package for manipulating data, and ggplot2, an popular package for graphics.

Here is how to make the summarized dataset (mean and standard error) using dplyr. You can find a more complete guide on this package here: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
d2 = mult_long %>% group_by(Playback.treatment, day_num) %>% summarise(mean = mean(oxy), 
    se = se(oxy))
head(d2)
## # A tibble: 6 x 4
## # Groups:   Playback.treatment [1]
##   Playback.treatment day_num  mean     se
##   <fct>                <dbl> <dbl>  <dbl>
## 1 Ambient                  1  1.85 0.0795
## 2 Ambient                  3  1.16 0.159 
## 3 Ambient                  5  1.29 0.159 
## 4 Ambient                  7  1.68 0.221 
## 5 Ambient                  9  2.42 0.442 
## 6 Ambient                 11  2.28 0.292

5.3 Plotting the figure with ggplot2

See this page for details on how to use ggplot2: http://docs.ggplot2.org/current/

## make figure 2 with ggplot2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
p = ggplot(d2)
p = p + geom_errorbar(aes(x = day_num, ymax = mean + se, ymin = mean - 
    se, width = 0.5))
p = p + geom_point(aes(x = day_num, y = mean, colour = Playback.treatment), 
    size = 5)
p

5.4 Figure 2 as bar plots

We can display Figure 2 as a series of paired bar plots.

# make it bar plots
tab.mean = xtabs(mean ~ Playback.treatment + day_num, data = d2)
tab.se = xtabs(se ~ Playback.treatment + day_num, data = d2)

bp = barplot(tab.mean, beside = T, ylim = c(0, max(tab.mean) + max(tab.se)))
arrows(bp, tab.mean - tab.se, bp, tab.mean + tab.se, angle = 90, code = 3, 
    length = 0.05)

We can do the same using ggplot2

p = ggplot(d2, aes(x = day_num, y = mean, group = Playback.treatment))
p = p + geom_bar(aes(fill = Playback.treatment), stat = "identity", position = "dodge")
p = p + geom_errorbar(aes(ymax = mean + se, ymin = mean - se), position = "dodge", 
    stat = "identity")
p


  1. Wale, M.A., Simpson, S.D., & Radford, A.N. (2013) Size-dependent physiological responses of shore crabs to single and repeated playback of ship noise. Biology Letters 9: 20121194

  2. The paper calls this analysis a “Two-way ANOVA” (in Results section), but this is technically wrong. A two-way ANOVA is when there is a continuous dependent variable and two discrete independent variables.