Group Exercise 1: Fixing the SampleData_w_errors.csv in R

In Group Exercise 1 for Module 4, I asked you to fix the following in the spreadsheet:

  • The column called “sex” only has “female” and “male”.
  • The column called “size” is numeric
  • The column called “weight” is numeric
  • The column called “X” is removed.

In reality, it is better if you can keep the raw data raw and fix these errors in R. This is because fixing data errors in R will leave a reproducible record of how you changed the data. In contrast, if you go and fix these errors in the spreadsheet, you will likely eventually forget how the original data was changed and you’ll never be able to “unfix” it. Also, if you have a backup copy of this data, you will now have two versions of the data, which will be confusing later. Finally, this prevents any scientific malpractice. So, to the extent that I can, I try to NOT change things in my original data!

Here are how I would fix the problems in the example data:

dat=read.csv("data/SampleData_w_errors.csv")
dat
##    Indiv_ID    sex      age size weight  X
## 1     20-01   male    adult 29.5   66.5  1
## 2     20-02   male juvenile   28   58.5 NA
## 3     20-03 female    adult   26  #N/A! NA
## 4     20-04 female    adult   25   55.5 NA
## 5     20-05 female juvenile   25     62 NA
## 6     20-06   male juvenile   28     61 NA
## 7     20-07 female    adult   26     58 NA
## 8     20-08   male    adult 28.5     65 NA
## 9     20-09   male juvenile 27.5     60 NA
## 10    20-10 female juvenile   26     59 NA
## 11    20-11   male    adult 25.5     62 NA
## 12    20-12 female    adult   27     59 NA
## 13    20-13 female    adult 26.5     60 NA
## 14                            **        NA

Setting “#N/A!” as N/A value

dat=read.csv("data/SampleData_w_errors.csv", na.strings=c("#N/A!", "**"))
dat
##    Indiv_ID    sex      age size weight  X
## 1     20-01   male    adult 29.5   66.5  1
## 2     20-02   male juvenile 28.0   58.5 NA
## 3     20-03 female    adult 26.0     NA NA
## 4     20-04 female    adult 25.0   55.5 NA
## 5     20-05 female juvenile 25.0   62.0 NA
## 6     20-06   male juvenile 28.0   61.0 NA
## 7     20-07 female    adult 26.0   58.0 NA
## 8     20-08   male    adult 28.5   65.0 NA
## 9     20-09   male juvenile 27.5   60.0 NA
## 10    20-10 female juvenile 26.0   59.0 NA
## 11    20-11   male    adult 25.5   62.0 NA
## 12    20-12 female    adult 27.0   59.0 NA
## 13    20-13 female    adult 26.5   60.0 NA
## 14                            NA     NA NA

Remove the mystery column “X”

dat=dat[,which(names(dat)!="X")]
dat
##    Indiv_ID    sex      age size weight
## 1     20-01   male    adult 29.5   66.5
## 2     20-02   male juvenile 28.0   58.5
## 3     20-03 female    adult 26.0     NA
## 4     20-04 female    adult 25.0   55.5
## 5     20-05 female juvenile 25.0   62.0
## 6     20-06   male juvenile 28.0   61.0
## 7     20-07 female    adult 26.0   58.0
## 8     20-08   male    adult 28.5   65.0
## 9     20-09   male juvenile 27.5   60.0
## 10    20-10 female juvenile 26.0   59.0
## 11    20-11   male    adult 25.5   62.0
## 12    20-12 female    adult 27.0   59.0
## 13    20-13 female    adult 26.5   60.0
## 14                            NA     NA

Remove lines for which we have no Individual ID

dat=dat[which(dat$Indiv_ID!=""),]
dat
##    Indiv_ID    sex      age size weight
## 1     20-01   male    adult 29.5   66.5
## 2     20-02   male juvenile 28.0   58.5
## 3     20-03 female    adult 26.0     NA
## 4     20-04 female    adult 25.0   55.5
## 5     20-05 female juvenile 25.0   62.0
## 6     20-06   male juvenile 28.0   61.0
## 7     20-07 female    adult 26.0   58.0
## 8     20-08   male    adult 28.5   65.0
## 9     20-09   male juvenile 27.5   60.0
## 10    20-10 female juvenile 26.0   59.0
## 11    20-11   male    adult 25.5   62.0
## 12    20-12 female    adult 27.0   59.0
## 13    20-13 female    adult 26.5   60.0

Group Exercise 2: Fixing errors in the egg measurement data:

In Group Exercise 2, I provided you with data I collected on American coot eggs before I became practiced in using R… And there are a lot of thigns that need to be fixed here.

1. Periods used as N/A data.

eggs=read.csv("data/EggMeasurements_example.csv")
head(eggs)
##   Nest EggNumber  EggID ChickID Nest_Fate mass length width EggVolume LayDate
## 1 6002         1 6002-1  6002-1    Active 34.4   51.4  35.2  34.60005       .
## 2 6002         2 6002-2  6002-2    Active 34.1   49.4  36.3  35.35432       .
## 3 6002         3 6002-3  6002-3    Active 35.2   49.5  36.2  35.23259       .
## 4 6002         4 6002-4  6002-4    Active   35   48.4  36.2  34.45996       .
## 5 6002         5 6002-5  6002-5    Active 35.1   49.2    36  34.64108  5/5/06
## 6 6002         6 6002-6  6002-6    Active 35.1   49.1  36.1  34.76136  5/6/06
##   Clutch.Avg
## 1   34.40166
## 2   34.40166
## 3   34.40166
## 4   34.40166
## 5   34.40166
## 6   34.40166
str(eggs)
## 'data.frame':    4075 obs. of  11 variables:
##  $ Nest      : chr  "6002" "6002" "6002" "6002" ...
##  $ EggNumber : chr  "1" "2" "3" "4" ...
##  $ EggID     : chr  "6002-1" "6002-2" "6002-3" "6002-4" ...
##  $ ChickID   : chr  "6002-1" "6002-2" "6002-3" "6002-4" ...
##  $ Nest_Fate : chr  "Active" "Active" "Active" "Active" ...
##  $ mass      : chr  "34.4" "34.1" "35.2" "35" ...
##  $ length    : chr  "51.4" "49.4" "49.5" "48.4" ...
##  $ width     : chr  "35.2" "36.3" "36.2" "36.2" ...
##  $ EggVolume : num  34.6 35.4 35.2 34.5 34.6 ...
##  $ LayDate   : chr  "." "." "." "." ...
##  $ Clutch.Avg: num  34.4 34.4 34.4 34.4 34.4 ...

First, you can tell that variables like “mass”, “length” and “width” should be numbers, but they are read as characters.

One way to see what is going on here is to have R show you all of the unique values of the variable. Let’s do this for mass:

unique(eggs$mass)
##   [1] "34.4" "34.1" "35.2" "35"   "35.1" "34.2" "34.3" "33.5" "33.3" "31.2"
##  [11] "21.7" "24.3" "26.3" "26"   "24.9" "26.2" "25.3" "25.7" "25.2" "29.3"
##  [21] "29.8" "28.6" "30.8" "31.8" "32.9" "32.2" "32.5" "30.4" "29.9" "30.7"
##  [31] "33.4" "31.7" "31.1" "28.4" "29"   "29.1" "27.1" "26.8" "26.5" "31.5"
##  [41] "32.4" "31.3" "32"   "30.6" "27.9" "26.9" "28.7" "34.8" "20.1" "18.6"
##  [51] "33"   "31.4" "27.4" "31"   "31.6" "27.7" "27.5" "28.5" "28.8" "32.7"
##  [61] "32.8" "30"   "30.2" "35.8" "29.5" "27.2" "29.6" "33.2" ""     "20.4"
##  [71] "32.1" "29.2" "28.1" "25.6" "26.1" "26.6" "25.8" "20.3" "28.2" "28"  
##  [81] "32.3" "30.1" "30.9" "34.6" "29.4" "32.6" "36.4" "31.9" "26.4" "27.8"
##  [91] "28.3" "36.7" "33.6" "33.9" "34.5" "35.6" "35.9" "36.3" "34"   "33.1"
## [101] "33.8" "30.3" "30.5" "29.7" "."    "35.3" "21.8" "25.9" "38.4" "35.5"
## [111] "36.1" "0"    "26.7" "25.1" "27"   "37.8" "37.6" "35.7" "27.6" "33.7"
## [121] "34.9" "39.3" "34.7" "28.9" "25"   "27.3" "24"   "24.1" "23.5" "25.4"
## [131] "35.4" "24.7" "24.6" "37.5" "24.2" "25.5" "23.7" "24.5" "24.8" "36"  
## [141] "36.2" "22.8" "20.8" "23.9" "23.1" "23.3" "16"   "36.6" "21"   "19.8"
## [151] "19.2" "19.7" "18.8" "23"   "18.1"

You can see that at least one of the things that is going on is that “.” is a value here. That is not a number–so that could be the cause of the problem. You can see that this is the same problem for length and width too.

unique(eggs$length)
##   [1] "51.4" "49.4" "49.5" "48.4" "49.2" "49.1" "49.6" "48.9" "49.3" "49"  
##  [11] "45.6" "46.8" "47.4" "46.7" "46.6" "47.7" "46.4" "47.5" "46"   "46.5"
##  [21] "47.2" "48.1" "47"   "48.3" "48.2" "48.7" "50.4" "49.7" "48"   "45.7"
##  [31] "47.8" "46.3" "46.1" "46.2" "49.8" "48.5" "49.9" "45.5" "51.6" "48.8"
##  [41] "50.9" "51"   "50.5" "50.6" "50.1" "50.8" "50.2" "50"   "47.9" "47.3"
##  [51] "38.5" "45.3" "45.2" "51.1" "52.5" "51.8" "51.3" "48.6" "52"   "45.1"
##  [61] "50.3" "47.6" "46.9" "45.4" "51.2" "53.3" "53"   "52.4" "52.2" "53.1"
##  [71] "52.8" "52.9" "53.2" "52.3" "."    "51.5" "51.9" "44.3" "50.7" "52.1"
##  [81] "52.7" "53.8" "53.4" "54.9" "47.1" "44.9" "54.4" ""     "54.1" "53.7"
##  [91] "45"   "45.9" "45.8" "51.7" "37.8" "44.6" "44.8" "52.6" "58.7" "43.9"
## [101] "44.1" "44.5" "41.8" "43.7" "44.2" "53.5" "44.4" "26.6" "54"   "39"  
## [111] "53.6" "55"   "39.6" "36.9" "54.5" "40"   "54.8" "43.8" "53.9" "54.2"
## [121] "37.6" "54.6"
unique(eggs$width)
##  [1] "35.2" "36.3" "36.2" "36"   "36.1" "35.8" "35.6" "34.9" "34.2" "29.2"
## [11] "30.7" "32.1" "31.8" "31.5" "31.9" "31.7" "31.4" "30.6" "33.5" "33.7"
## [21] "33.3" "35.5" "34.7" "34.6" "33.8" "34.1" "34.4" "33.9" "32.9" "33"  
## [31] "32.8" "32.2" "34.3" "33.4" "34"   "33.1" "35.4" "33.2" "33.6" "35"  
## [41] "35.3" "32.6" "32.5" "32.7" "35.1" "34.5" "32.4" "32"   "32.3" "34.8"
## [51] "35.9" "."    "31.6" "30"   "35.7" "36.5" "30.2" "36.6" ""     "38.4"
## [61] "36.4" "39.2" "31"   "31.1" "343"  "30.4" "28.8" "31.2" "31.3" "37.3"
## [71] "30.9" "29.3" "39"   "27.8" "37"   "22.9" "25.2" "42.6" "42.9" "30.8"
## [81] "30.5" "37.6" "37.9" "36.7" "38.6"

Fix this by telling R that “.” is an na.string

eggs=read.csv("data/EggMeasurements_example.csv", na.strings=".")
unique(eggs$mass)
##   [1] 34.4 34.1 35.2 35.0 35.1 34.2 34.3 33.5 33.3 31.2 21.7 24.3 26.3 26.0 24.9
##  [16] 26.2 25.3 25.7 25.2 29.3 29.8 28.6 30.8 31.8 32.9 32.2 32.5 30.4 29.9 30.7
##  [31] 33.4 31.7 31.1 28.4 29.0 29.1 27.1 26.8 26.5 31.5 32.4 31.3 32.0 30.6 27.9
##  [46] 26.9 28.7 34.8 20.1 18.6 33.0 31.4 27.4 31.0 31.6 27.7 27.5 28.5 28.8 32.7
##  [61] 32.8 30.0 30.2 35.8 29.5 27.2 29.6 33.2   NA 20.4 32.1 29.2 28.1 25.6 26.1
##  [76] 26.6 25.8 20.3 28.2 28.0 32.3 30.1 30.9 34.6 29.4 32.6 36.4 31.9 26.4 27.8
##  [91] 28.3 36.7 33.6 33.9 34.5 35.6 35.9 36.3 34.0 33.1 33.8 30.3 30.5 29.7 35.3
## [106] 21.8 25.9 38.4 35.5 36.1  0.0 26.7 25.1 27.0 37.8 37.6 35.7 27.6 33.7 34.9
## [121] 39.3 34.7 28.9 25.0 27.3 24.0 24.1 23.5 25.4 35.4 24.7 24.6 37.5 24.2 25.5
## [136] 23.7 24.5 24.8 36.0 36.2 22.8 20.8 23.9 23.1 23.3 16.0 36.6 21.0 19.8 19.2
## [151] 19.7 18.8 23.0 18.1

Now, you can tell that R is interpreting mass as a number (it’s not in quotes), and conveniently, it correctly interprets missing values as “NA”.

You can see that mass, length and width all show up as numerics now:

str(eggs)
## 'data.frame':    4075 obs. of  11 variables:
##  $ Nest      : chr  "6002" "6002" "6002" "6002" ...
##  $ EggNumber : chr  "1" "2" "3" "4" ...
##  $ EggID     : chr  "6002-1" "6002-2" "6002-3" "6002-4" ...
##  $ ChickID   : chr  "6002-1" "6002-2" "6002-3" "6002-4" ...
##  $ Nest_Fate : chr  "Active" "Active" "Active" "Active" ...
##  $ mass      : num  34.4 34.1 35.2 35 35.1 35.1 34.2 34.3 33.5 33.3 ...
##  $ length    : num  51.4 49.4 49.5 48.4 49.2 49.1 49.6 48.9 48.4 49.3 ...
##  $ width     : num  35.2 36.3 36.2 36.2 36 36.1 36 35.8 35.6 34.9 ...
##  $ EggVolume : num  34.6 35.4 35.2 34.5 34.6 ...
##  $ LayDate   : chr  NA NA NA NA ...
##  $ Clutch.Avg: num  34.4 34.4 34.4 34.4 34.4 ...

2. Wrong dates

LayDate is the date that the egg was laid, if known. We only have this data for nests that were checked in subsequent dates (because if there are two new eggs two days later, we don’t know which one was laid on which date).

Let’s look at what values this takes:

sort(unique(eggs$LayDate))
##  [1] ""        "5/10/06" "5/10/07" "5/11/06" "5/11/07" "5/12/06" "5/12/07"
##  [8] "5/13/06" "5/13/07" "5/13/08" "5/14/06" "5/14/07" "5/15/06" "5/15/07"
## [15] "5/15/08" "5/16/06" "5/16/07" "5/16/08" "5/17/06" "5/17/07" "5/17/08"
## [22] "5/18/06" "5/18/07" "5/18/08" "5/18/99" "5/19/06" "5/19/07" "5/19/08"
## [29] "5/19/99" "5/20/06" "5/20/07" "5/20/08" "5/21/06" "5/21/07" "5/21/08"
## [36] "5/22/06" "5/22/07" "5/22/08" "5/23/06" "5/23/07" "5/23/08" "5/24/06"
## [43] "5/24/07" "5/24/08" "5/25/06" "5/25/07" "5/25/08" "5/26/06" "5/26/07"
## [50] "5/26/08" "5/27/06" "5/27/07" "5/27/08" "5/28/06" "5/28/07" "5/29/06"
## [57] "5/29/07" "5/30/06" "5/30/07" "5/30/08" "5/31/06" "5/31/07" "5/4/06" 
## [64] "5/5/06"  "5/5/07"  "5/6/06"  "5/6/07"  "5/7/06"  "5/7/07"  "5/8/06" 
## [71] "5/8/07"  "5/9/06"  "5/9/07"  "6/1/06"  "6/1/07"  "6/10/07" "6/14/07"
## [78] "6/2/06"  "6/2/07"  "6/3/06"  "6/3/07"  "6/4/06"  "6/4/07"  "6/5/06" 
## [85] "6/5/07"  "6/5/08"  "6/6/07"  "6/6/08"  "6/7/07"  "6/7/08"  "6/8/07" 
## [92] "6/9/07"

There are a bunch of eggs that were apparently laid on May 18th and 19th, 1999! That can’t be right…