SampleData_w_errors.csv
in
RIn Group Exercise 1 for Module 4, I asked you to fix the following in the spreadsheet:
In reality, it is better if you can keep the raw data raw and fix these errors in R. This is because fixing data errors in R will leave a reproducible record of how you changed the data. In contrast, if you go and fix these errors in the spreadsheet, you will likely eventually forget how the original data was changed and you’ll never be able to “unfix” it. Also, if you have a backup copy of this data, you will now have two versions of the data, which will be confusing later. Finally, this prevents any scientific malpractice. So, to the extent that I can, I try to NOT change things in my original data!
Here are how I would fix the problems in the example data:
dat=read.csv("data/SampleData_w_errors.csv")
dat
## Indiv_ID sex age size weight X
## 1 20-01 male adult 29.5 66.5 1
## 2 20-02 male juvenile 28 58.5 NA
## 3 20-03 female adult 26 #N/A! NA
## 4 20-04 female adult 25 55.5 NA
## 5 20-05 female juvenile 25 62 NA
## 6 20-06 male juvenile 28 61 NA
## 7 20-07 female adult 26 58 NA
## 8 20-08 male adult 28.5 65 NA
## 9 20-09 male juvenile 27.5 60 NA
## 10 20-10 female juvenile 26 59 NA
## 11 20-11 male adult 25.5 62 NA
## 12 20-12 female adult 27 59 NA
## 13 20-13 female adult 26.5 60 NA
## 14 ** NA
Setting “#N/A!” as N/A value
dat=read.csv("data/SampleData_w_errors.csv", na.strings=c("#N/A!", "**"))
dat
## Indiv_ID sex age size weight X
## 1 20-01 male adult 29.5 66.5 1
## 2 20-02 male juvenile 28.0 58.5 NA
## 3 20-03 female adult 26.0 NA NA
## 4 20-04 female adult 25.0 55.5 NA
## 5 20-05 female juvenile 25.0 62.0 NA
## 6 20-06 male juvenile 28.0 61.0 NA
## 7 20-07 female adult 26.0 58.0 NA
## 8 20-08 male adult 28.5 65.0 NA
## 9 20-09 male juvenile 27.5 60.0 NA
## 10 20-10 female juvenile 26.0 59.0 NA
## 11 20-11 male adult 25.5 62.0 NA
## 12 20-12 female adult 27.0 59.0 NA
## 13 20-13 female adult 26.5 60.0 NA
## 14 NA NA NA
Remove the mystery column “X”
dat=dat[,which(names(dat)!="X")]
dat
## Indiv_ID sex age size weight
## 1 20-01 male adult 29.5 66.5
## 2 20-02 male juvenile 28.0 58.5
## 3 20-03 female adult 26.0 NA
## 4 20-04 female adult 25.0 55.5
## 5 20-05 female juvenile 25.0 62.0
## 6 20-06 male juvenile 28.0 61.0
## 7 20-07 female adult 26.0 58.0
## 8 20-08 male adult 28.5 65.0
## 9 20-09 male juvenile 27.5 60.0
## 10 20-10 female juvenile 26.0 59.0
## 11 20-11 male adult 25.5 62.0
## 12 20-12 female adult 27.0 59.0
## 13 20-13 female adult 26.5 60.0
## 14 NA NA
Remove lines for which we have no Individual ID
dat=dat[which(dat$Indiv_ID!=""),]
dat
## Indiv_ID sex age size weight
## 1 20-01 male adult 29.5 66.5
## 2 20-02 male juvenile 28.0 58.5
## 3 20-03 female adult 26.0 NA
## 4 20-04 female adult 25.0 55.5
## 5 20-05 female juvenile 25.0 62.0
## 6 20-06 male juvenile 28.0 61.0
## 7 20-07 female adult 26.0 58.0
## 8 20-08 male adult 28.5 65.0
## 9 20-09 male juvenile 27.5 60.0
## 10 20-10 female juvenile 26.0 59.0
## 11 20-11 male adult 25.5 62.0
## 12 20-12 female adult 27.0 59.0
## 13 20-13 female adult 26.5 60.0
In Group Exercise 2, I provided you with data I collected on American coot eggs before I became practiced in using R… And there are a lot of thigns that need to be fixed here.
eggs=read.csv("data/EggMeasurements_example.csv")
head(eggs)
str(eggs)
## Nest EggNumber EggID ChickID Nest_Fate mass length width EggVolume LayDate
## 1 6002 1 6002-1 6002-1 Active 34.4 51.4 35.2 34.60005 .
## 2 6002 2 6002-2 6002-2 Active 34.1 49.4 36.3 35.35432 .
## 3 6002 3 6002-3 6002-3 Active 35.2 49.5 36.2 35.23259 .
## 4 6002 4 6002-4 6002-4 Active 35 48.4 36.2 34.45996 .
## 5 6002 5 6002-5 6002-5 Active 35.1 49.2 36 34.64108 5/5/06
## 6 6002 6 6002-6 6002-6 Active 35.1 49.1 36.1 34.76136 5/6/06
## Clutch.Avg
## 1 34.40166
## 2 34.40166
## 3 34.40166
## 4 34.40166
## 5 34.40166
## 6 34.40166
## 'data.frame': 4075 obs. of 11 variables:
## $ Nest : chr "6002" "6002" "6002" "6002" ...
## $ EggNumber : chr "1" "2" "3" "4" ...
## $ EggID : chr "6002-1" "6002-2" "6002-3" "6002-4" ...
## $ ChickID : chr "6002-1" "6002-2" "6002-3" "6002-4" ...
## $ Nest_Fate : chr "Active" "Active" "Active" "Active" ...
## $ mass : chr "34.4" "34.1" "35.2" "35" ...
## $ length : chr "51.4" "49.4" "49.5" "48.4" ...
## $ width : chr "35.2" "36.3" "36.2" "36.2" ...
## $ EggVolume : num 34.6 35.4 35.2 34.5 34.6 ...
## $ LayDate : chr "." "." "." "." ...
## $ Clutch.Avg: num 34.4 34.4 34.4 34.4 34.4 ...
First, you can tell that variables like “mass”, “length” and “width” should be numbers, but they are read as characters.
One way to see what is going on here is to have R show you all of the unique values of the variable. Let’s do this for mass:
unique(eggs$mass)
## [1] "34.4" "34.1" "35.2" "35" "35.1" "34.2" "34.3" "33.5" "33.3" "31.2"
## [11] "21.7" "24.3" "26.3" "26" "24.9" "26.2" "25.3" "25.7" "25.2" "29.3"
## [21] "29.8" "28.6" "30.8" "31.8" "32.9" "32.2" "32.5" "30.4" "29.9" "30.7"
## [31] "33.4" "31.7" "31.1" "28.4" "29" "29.1" "27.1" "26.8" "26.5" "31.5"
## [41] "32.4" "31.3" "32" "30.6" "27.9" "26.9" "28.7" "34.8" "20.1" "18.6"
## [51] "33" "31.4" "27.4" "31" "31.6" "27.7" "27.5" "28.5" "28.8" "32.7"
## [61] "32.8" "30" "30.2" "35.8" "29.5" "27.2" "29.6" "33.2" "" "20.4"
## [71] "32.1" "29.2" "28.1" "25.6" "26.1" "26.6" "25.8" "20.3" "28.2" "28"
## [81] "32.3" "30.1" "30.9" "34.6" "29.4" "32.6" "36.4" "31.9" "26.4" "27.8"
## [91] "28.3" "36.7" "33.6" "33.9" "34.5" "35.6" "35.9" "36.3" "34" "33.1"
## [101] "33.8" "30.3" "30.5" "29.7" "." "35.3" "21.8" "25.9" "38.4" "35.5"
## [111] "36.1" "0" "26.7" "25.1" "27" "37.8" "37.6" "35.7" "27.6" "33.7"
## [121] "34.9" "39.3" "34.7" "28.9" "25" "27.3" "24" "24.1" "23.5" "25.4"
## [131] "35.4" "24.7" "24.6" "37.5" "24.2" "25.5" "23.7" "24.5" "24.8" "36"
## [141] "36.2" "22.8" "20.8" "23.9" "23.1" "23.3" "16" "36.6" "21" "19.8"
## [151] "19.2" "19.7" "18.8" "23" "18.1"
You can see that at least one of the things that is going on is that “.” is a value here. That is not a number–so that could be the cause of the problem. You can see that this is the same problem for length and width too.
unique(eggs$length)
## [1] "51.4" "49.4" "49.5" "48.4" "49.2" "49.1" "49.6" "48.9" "49.3" "49"
## [11] "45.6" "46.8" "47.4" "46.7" "46.6" "47.7" "46.4" "47.5" "46" "46.5"
## [21] "47.2" "48.1" "47" "48.3" "48.2" "48.7" "50.4" "49.7" "48" "45.7"
## [31] "47.8" "46.3" "46.1" "46.2" "49.8" "48.5" "49.9" "45.5" "51.6" "48.8"
## [41] "50.9" "51" "50.5" "50.6" "50.1" "50.8" "50.2" "50" "47.9" "47.3"
## [51] "38.5" "45.3" "45.2" "51.1" "52.5" "51.8" "51.3" "48.6" "52" "45.1"
## [61] "50.3" "47.6" "46.9" "45.4" "51.2" "53.3" "53" "52.4" "52.2" "53.1"
## [71] "52.8" "52.9" "53.2" "52.3" "." "51.5" "51.9" "44.3" "50.7" "52.1"
## [81] "52.7" "53.8" "53.4" "54.9" "47.1" "44.9" "54.4" "" "54.1" "53.7"
## [91] "45" "45.9" "45.8" "51.7" "37.8" "44.6" "44.8" "52.6" "58.7" "43.9"
## [101] "44.1" "44.5" "41.8" "43.7" "44.2" "53.5" "44.4" "26.6" "54" "39"
## [111] "53.6" "55" "39.6" "36.9" "54.5" "40" "54.8" "43.8" "53.9" "54.2"
## [121] "37.6" "54.6"
unique(eggs$width)
## [1] "35.2" "36.3" "36.2" "36" "36.1" "35.8" "35.6" "34.9" "34.2" "29.2"
## [11] "30.7" "32.1" "31.8" "31.5" "31.9" "31.7" "31.4" "30.6" "33.5" "33.7"
## [21] "33.3" "35.5" "34.7" "34.6" "33.8" "34.1" "34.4" "33.9" "32.9" "33"
## [31] "32.8" "32.2" "34.3" "33.4" "34" "33.1" "35.4" "33.2" "33.6" "35"
## [41] "35.3" "32.6" "32.5" "32.7" "35.1" "34.5" "32.4" "32" "32.3" "34.8"
## [51] "35.9" "." "31.6" "30" "35.7" "36.5" "30.2" "36.6" "" "38.4"
## [61] "36.4" "39.2" "31" "31.1" "343" "30.4" "28.8" "31.2" "31.3" "37.3"
## [71] "30.9" "29.3" "39" "27.8" "37" "22.9" "25.2" "42.6" "42.9" "30.8"
## [81] "30.5" "37.6" "37.9" "36.7" "38.6"
Fix this by telling R that “.” is an na.string
eggs=read.csv("data/EggMeasurements_example.csv", na.strings=".")
unique(eggs$mass)
## [1] 34.4 34.1 35.2 35.0 35.1 34.2 34.3 33.5 33.3 31.2 21.7 24.3 26.3 26.0 24.9
## [16] 26.2 25.3 25.7 25.2 29.3 29.8 28.6 30.8 31.8 32.9 32.2 32.5 30.4 29.9 30.7
## [31] 33.4 31.7 31.1 28.4 29.0 29.1 27.1 26.8 26.5 31.5 32.4 31.3 32.0 30.6 27.9
## [46] 26.9 28.7 34.8 20.1 18.6 33.0 31.4 27.4 31.0 31.6 27.7 27.5 28.5 28.8 32.7
## [61] 32.8 30.0 30.2 35.8 29.5 27.2 29.6 33.2 NA 20.4 32.1 29.2 28.1 25.6 26.1
## [76] 26.6 25.8 20.3 28.2 28.0 32.3 30.1 30.9 34.6 29.4 32.6 36.4 31.9 26.4 27.8
## [91] 28.3 36.7 33.6 33.9 34.5 35.6 35.9 36.3 34.0 33.1 33.8 30.3 30.5 29.7 35.3
## [106] 21.8 25.9 38.4 35.5 36.1 0.0 26.7 25.1 27.0 37.8 37.6 35.7 27.6 33.7 34.9
## [121] 39.3 34.7 28.9 25.0 27.3 24.0 24.1 23.5 25.4 35.4 24.7 24.6 37.5 24.2 25.5
## [136] 23.7 24.5 24.8 36.0 36.2 22.8 20.8 23.9 23.1 23.3 16.0 36.6 21.0 19.8 19.2
## [151] 19.7 18.8 23.0 18.1
Now, you can tell that R is interpreting mass as a number (it’s not in quotes), and conveniently, it correctly interprets missing values as “NA”.
You can see that mass, length and width all show up as numerics now:
str(eggs)
## 'data.frame': 4075 obs. of 11 variables:
## $ Nest : chr "6002" "6002" "6002" "6002" ...
## $ EggNumber : chr "1" "2" "3" "4" ...
## $ EggID : chr "6002-1" "6002-2" "6002-3" "6002-4" ...
## $ ChickID : chr "6002-1" "6002-2" "6002-3" "6002-4" ...
## $ Nest_Fate : chr "Active" "Active" "Active" "Active" ...
## $ mass : num 34.4 34.1 35.2 35 35.1 35.1 34.2 34.3 33.5 33.3 ...
## $ length : num 51.4 49.4 49.5 48.4 49.2 49.1 49.6 48.9 48.4 49.3 ...
## $ width : num 35.2 36.3 36.2 36.2 36 36.1 36 35.8 35.6 34.9 ...
## $ EggVolume : num 34.6 35.4 35.2 34.5 34.6 ...
## $ LayDate : chr NA NA NA NA ...
## $ Clutch.Avg: num 34.4 34.4 34.4 34.4 34.4 ...
LayDate is the date that the egg was laid, if known. We only have this data for nests that were checked in subsequent dates (because if there are two new eggs two days later, we don’t know which one was laid on which date).
Let’s look at what values this takes:
sort(unique(eggs$LayDate))
## [1] "" "5/10/06" "5/10/07" "5/11/06" "5/11/07" "5/12/06" "5/12/07"
## [8] "5/13/06" "5/13/07" "5/13/08" "5/14/06" "5/14/07" "5/15/06" "5/15/07"
## [15] "5/15/08" "5/16/06" "5/16/07" "5/16/08" "5/17/06" "5/17/07" "5/17/08"
## [22] "5/18/06" "5/18/07" "5/18/08" "5/18/99" "5/19/06" "5/19/07" "5/19/08"
## [29] "5/19/99" "5/20/06" "5/20/07" "5/20/08" "5/21/06" "5/21/07" "5/21/08"
## [36] "5/22/06" "5/22/07" "5/22/08" "5/23/06" "5/23/07" "5/23/08" "5/24/06"
## [43] "5/24/07" "5/24/08" "5/25/06" "5/25/07" "5/25/08" "5/26/06" "5/26/07"
## [50] "5/26/08" "5/27/06" "5/27/07" "5/27/08" "5/28/06" "5/28/07" "5/29/06"
## [57] "5/29/07" "5/30/06" "5/30/07" "5/30/08" "5/31/06" "5/31/07" "5/4/06"
## [64] "5/5/06" "5/5/07" "5/6/06" "5/6/07" "5/7/06" "5/7/07" "5/8/06"
## [71] "5/8/07" "5/9/06" "5/9/07" "6/1/06" "6/1/07" "6/10/07" "6/14/07"
## [78] "6/2/06" "6/2/07" "6/3/06" "6/3/07" "6/4/06" "6/4/07" "6/5/06"
## [85] "6/5/07" "6/5/08" "6/6/07" "6/6/08" "6/7/07" "6/7/08" "6/8/07"
## [92] "6/9/07"
There are a bunch of eggs that were apparently laid on May 18th and 19th, 1999! That can’t be right…