b2-data_cleaning.Rmd
In this tutorial we’ll
Skibiel et al 2013. The evolution of the nutrient composition of mammalian milks. Journal of Animal Ecology 82: 1254-1264. https://doi.org/10.1111/1365-2656.12095
If you haven’t already, download the mammalsmilk package from GitHub (note the “S” between mammal and milk). This is commented out in the code below - only run the code if you have never downloaded it before, or haven’t recently (in case its been updated)
# install_github("brouwern/mammalsmilk")
You can then load the package with library()
library(mammalsmilk)
We’ll start with the data in a raw form. This is most easily accessed using
data("milk_raw")
The originally .csv file is called “Skibiel_mammalsmilk_raw.csv”. See the preceding vignette on data access for how to find these files and tips on loading .csv files if you need to practice this.
We’ll use the following packages. You’ll need to download them if you haven’t already. I’d try just loading them with library() first (in the next section), then installing if needed.
# install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The raw data is from a word file of appendices from the original paper.
I pasted the data by hand into Excel and saved it as a .csv file. Once in .csv form it can be loaded into R and further cleaned up. (In the future I will re-do this all in R).
The data are a bit rough because they were formatted to be a table in a paper. There’s lots of text within cells added as annotations, and probably some errors due to switching from a table in a word doc to .csv.
There are 2 NA values in my “fat” column that shouldn’t be there; they should be valid numeric entries. I use the following code to
Note that is.na() returns TRUE/FALSE values; “TRUE” acts like specific index when you use it for row indexing. So, even though “i.NA” is a vector of TRUES and FALSE, milk_raw$spp[i.NA]" returns just the two species that match the “TRUES”
(There are probably tidyverse ways of doing this but I haven’t looked into it.)
# Look at fat column
summary(milk_raw$fat)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.200 4.575 8.550 14.068 17.575 61.100 2
# Find NAS
i.NA <- is.na(milk_raw$fat)
#Identify the species w/ NAs
milk_raw$spp[i.NA]
## [1] Hippotragus niger Cervus elaphus hispanicus
## 130 Levels: Acomys cahirinus Alces alces ... Zalophus californianus
# Extract the indexes of the appropriate species
i.H.niger <- which(milk_raw$spp == "Hippotragus niger")
i.C.elaphus <- which(milk_raw$spp == "Cervus elaphus hispanicus")
#Overwrite the bad values with correct values
milk_raw[i.H.niger,"fat"] <- 5.0
milk_raw[i.C.elaphus,"fat"] <- 12.6
There are some text annotations from the raw data still hanging out in the dataframe. I will use “regular expressions” to clean these up. Regular expression take practice to use but there are many resources online for learning about them (Cleaning up asterisks (*), as below, is even a little trickier).
What follows is just a brief snapshot into this feature of R. R is known, however, for not having stellar basic regular expression features. Some tools in the tidyverse are available to make things less painful, but I haven’t switched to them yet.
There are letters, commas, and asterisks that were annotations in the original datafile. Easy to remove with find-replace, but also very easy with gsub(), with not chance of changing anything else in your spreadsheet you don’t want
Look at all of the protein column; those asterisks are a problem. There’s also an “s” in one cell.
summary(milk_raw$protein)
## 1.1* 1.3* 1.4 1.4* 1.5 1.5* 1.6 1.6* 1.7* 1.9 10 10.2,
## 1 2 1 1 3 1 2 4 1 1 2 1
## 10.3 10.5 10.7 11 11.1 11.2* 11.8 11.9 12.1 12.2 12.3 12.4
## 3 2 3 2 1 1 1 1 1 1 1 1
## 12.5 14.6 15.8 2.1 2.2 2.5 2.6 2.7 3 3.1 3.4 3.6
## 2 1 1 2 1 1 1 2 1 1 1 2
## 3.9* 4 4.1 4.2* 4.3 4.3* 4.4 4.5 4.6 4.8 4.8* 4.9
## 1 2 1 1 1 1 2 2 1 1 1 1
## 5.1 5.2* 5.4 5.5 5.6 5.6* 5.7 5.9 6 6.2 6.3 6.3*
## 1 1 1 1 2 1 2 1 1 1 3 1
## 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1* 7.2 7.3 7.4 7.7
## 1 2 1 1 1 1 3 1 2 2 1 2
## 7.8 7.8S 8 8.1 8.1* 8.2 8.4 8.4* 8.5 8.6* 8.7* 8.8
## 1 1 2 1 1 2 1 1 1 1 1 1
## 8.9 9 9.1 9.2 9.2* 9.4 9.5 9.7 9.9*
## 1 2 1 1 1 2 3 1 1
i.letters <- grep("[a-zA-Z]",milk_raw$protein)
milk_raw$protein[i.letters]
## [1] 7.8S
## 93 Levels: 1.1* 1.3* 1.4 1.4* 1.5 1.5* 1.6 1.6* 1.7* 1.9 10 10.2, ... 9.9*
There’s the “S”.
#these are equivalent for the simple case of a single letter
i.S <- grep("S",milk_raw$protein) #no brackets
i.S <- grep("[S]",milk_raw$protein)#brackets
milk_raw$protein[i.S]
## [1] 7.8S
## 93 Levels: 1.1* 1.3* 1.4 1.4* 1.5 1.5* 1.6 1.6* 1.7* 1.9 10 10.2, ... 9.9*
milk_raw$protein <- gsub("[a-zA-Z]", #pattern
"", #replace
milk_raw$protein)#data
Check
grep("[a-zA-Z]",milk_raw$protein)
## integer(0)
These are frequent typos in data
grep(" ", milk_raw$protein)
## integer(0)
but luckily none!
milk_raw$protein.num <- as.numeric(milk_raw$protein)
Compare old and new columns
head(milk_raw[,c("protein.num","protein")])
## protein.num protein
## 1 6.3 6.3
## 2 5.7 5.7
## 3 4.1 4.1
## 4 4.3 4.3
## 5 5.6 5.6
## 6 8.8 8.8
tail(milk_raw[,c("protein.num","protein")])
## protein.num protein
## 125 6.4 6.4
## 126 8.1 8.1
## 127 4.4 4.4
## 128 8.1 8.1
## 129 10.7 10.7
## 130 9.4 9.4
Check other columns
summary(milk_raw[,c("sugar","energy")])
## sugar energy
## 3 : 5 1.44 : 4
## 4.5 : 5 0.81 : 3
## 5.3 : 5 1.13 : 3
## 5.2 : 4 0.49 : 2
## 6.6 : 4 0.5 : 2
## (Other):92 (Other):101
## NA's :15 NA's : 15
summary(milk_raw$sugar)
## <0.025 0.02 0.05 0.1 0.3 0.8 1 1.2 1.3 1.5
## 1 2 1 1 1 1 1 1 1 1
## 1.7 1.8 10.9 11 12.5 14 2.2 2.3 2.5 2.6
## 2 1 1 2 2 1 1 3 2 2
## 2.7 2.8 3 3.1 3.2 3.3 3.4 3.6 3.7 3.8
## 3 1 5 1 1 1 3 2 1 2
## 4 4.1 4.2 4.3 4.4 4.5 4.6 4.8 4.9 5.2
## 2 1 1 1 3 5 2 1 2 4
## 5.3 5.3, 5.6 5.7 5.8S 5.9 6.1 6.2 6.4 6.6
## 5 1 1 3 1 2 1 3 3 4
## 6.7 6.8 6.9 7 7.2 7.4 7.5 7.7 7.8 7.9
## 2 2 2 2 1 1 1 1 3 1
## 8.1 8.4 8.5 8.9 NA's
## 1 1 2 1 15
Remove letters and “<”
milk_raw$sugar <- gsub("[a-zA-Z<]","",milk_raw$sugar)
Convert and check
milk_raw$sugar.num <- as.numeric(milk_raw$sugar)
## Warning: NAs introduced by coercion
head(milk_raw[,c("sugar.num","sugar")])
## sugar.num sugar
## 1 5.2 5.2
## 2 NA <NA>
## 3 5.3 5.3
## 4 4.1 4.1
## 5 4.9 4.9
## 6 NA <NA>
#The followign are equivalent
summary(milk_raw[,c("energy")])
## 0.36 0.45 0.46 0.47 0.49 0.5 0.52 0.54 0.57 0.65 0.68 0.71
## 1 1 1 1 2 2 2 1 2 1 1 1
## 0.72 0.78 0.81 0.82 0.84 0.9 0.91 0.92 0.93 0.96 0.98 1.03
## 1 1 3 1 1 1 2 1 1 1 1 2
## 1.06 1.08 1.09 1.1 1.11 1.13 1.16 1.17 1.19 1.2 1.21 1.23
## 2 1 1 1 1 3 2 2 1 1 2 1
## 1.25 1.26 1.27 1.28 1.31 1.36 1.37 1.41 1.42 1.43 1.44 1.48
## 1 1 1 1 1 2 2 1 1 1 4 1
## 1.5 1.52 1.6 1.61 1.62 1.64 1.65 1.68 1.71 1.79 1.8 1.83
## 1 2 1 1 1 1 1 1 1 2 1 1
## 1.88 1.95 1.96 2 2.06 2.07 2.1 2.18 2.2 2.34 2.45 2.57
## 1 1 1 1 1 1 2 1 1 1 1 1
## 2.59 2.61 2.62 2.65 2.76 2.82 2.85 2.94 3.29 3.4 3.49 3.64
## 1 1 1 1 1 1 1 1 1 2 1 1
## 3.73 4.47 4.71 5.23 5.36 5.41 5.69* 5.89 NA's
## 1 1 1 1 1 1 1 1 15
summary(milk_raw$energy)
## 0.36 0.45 0.46 0.47 0.49 0.5 0.52 0.54 0.57 0.65 0.68 0.71
## 1 1 1 1 2 2 2 1 2 1 1 1
## 0.72 0.78 0.81 0.82 0.84 0.9 0.91 0.92 0.93 0.96 0.98 1.03
## 1 1 3 1 1 1 2 1 1 1 1 2
## 1.06 1.08 1.09 1.1 1.11 1.13 1.16 1.17 1.19 1.2 1.21 1.23
## 2 1 1 1 1 3 2 2 1 1 2 1
## 1.25 1.26 1.27 1.28 1.31 1.36 1.37 1.41 1.42 1.43 1.44 1.48
## 1 1 1 1 1 2 2 1 1 1 4 1
## 1.5 1.52 1.6 1.61 1.62 1.64 1.65 1.68 1.71 1.79 1.8 1.83
## 1 2 1 1 1 1 1 1 1 2 1 1
## 1.88 1.95 1.96 2 2.06 2.07 2.1 2.18 2.2 2.34 2.45 2.57
## 1 1 1 1 1 1 2 1 1 1 1 1
## 2.59 2.61 2.62 2.65 2.76 2.82 2.85 2.94 3.29 3.4 3.49 3.64
## 1 1 1 1 1 1 1 1 1 2 1 1
## 3.73 4.47 4.71 5.23 5.36 5.41 5.69* 5.89 NA's
## 1 1 1 1 1 1 1 1 15
#clean
milk_raw$energy <- gsub("[a-zA-Z<]","",milk_raw$energy)
#convert and check
milk_raw$energy.num <- as.numeric(milk_raw$energy)
## Warning: NAs introduced by coercion
head(milk_raw[,c("energy.num","energy")])
## energy.num energy
## 1 1.21 1.21
## 2 NA <NA>
## 3 1.13 1.13
## 4 0.91 0.91
## 5 1.31 1.31
## 6 NA <NA>
Once we’ve checked that our data have converted properly from character to numeric, we can remove the old columns of character data using select() and negative indexing.
milk_raw2 <- milk_raw %>% dplyr::select(-dry.matter,
-protein,
-sugar,
-energy,
-ref,
-lactation.stage.orig)
Note that during initial cleaning I only remove variable that are completely redundant; otherwise I leave everything in.
Remove “.NUM” and “.num” using gsub()
names(milk_raw2) <- gsub(".NUM","",names(milk_raw2))
names(milk_raw2) <- gsub(".num","",names(milk_raw2))
To polish a dataset for analysis, I like to shorten the variable names
milk_raw2 <- milk_raw2 %>% rename(fam = family,
mass.fem = mass.female,
gest.mo = gestation.month,
lac.mo = lacatation.months,
prot = protein,
dev.birth = dev.stage.at.birth,
ord = order)
Let’s call this cleaned up data just “milk”
milk <- "milk_raw2"
We can save our cleaned data using read.csv()
write.csv(milk, "data/skibiel_mammalsmilk.csv",
row.names = F)
The data produced by the above workflow are actually internal to the package. However, I recommend saving the .csv file to wherever you keep your R work and to practice reloading it.
You can load it directly, however, using data(“milk”)