<ol start="2" style="list-style-type: lower-alpha"> <li>Data cleaning</li> </ol> • mammalsmilk

Introduction

In this tutorial we’ll

load data from Skibiel et al 2013
fix some typos that were introduced during data preparation
clean up some columns of numeric data that contain character data
save the data as a .csv for future use

Important functions used

read.csv
is.na
grep
gsub
as.numeric
dplyr::select
dplyr::rename
write.csv

Original Data

Skibiel et al 2013. The evolution of the nutrient composition of mammalian milks. Journal of Animal Ecology 82: 1254-1264. https://doi.org/10.1111/1365-2656.12095

Preliminaries

Loading the mammalsmilk package

If you haven’t already, download the mammalsmilk package from GitHub (note the “S” between mammal and milk). This is commented out in the code below - only run the code if you have never downloaded it before, or haven’t recently (in case its been updated)

# install_github("brouwern/mammalsmilk")

You can then load the package with library()

library(mammalsmilk)

We’ll start with the data in a raw form. This is most easily accessed using

data("milk_raw")

The originally .csv file is called “Skibiel_mammalsmilk_raw.csv”. See the preceding vignette on data access for how to find these files and tips on loading .csv files if you need to practice this.

Packages

We’ll use the following packages. You’ll need to download them if you haven’t already. I’d try just loading them with library() first (in the next section), then installing if needed.

# install.packages("dplyr")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data preparation

The raw data is from a word file of appendices from the original paper.

Table S1. Milk composition of mammals at mid-lactation
Table S2. Ecology and life histories of mammals whose milk composition has been described

I pasted the data by hand into Excel and saved it as a .csv file. Once in .csv form it can be loaded into R and further cleaned up. (In the future I will re-do this all in R).

Data cleaning

The data are a bit rough because they were formatted to be a table in a paper. There’s lots of text within cells added as annotations, and probably some errors due to switching from a table in a word doc to .csv.

Mannually fix incorrect NAs

There are 2 NA values in my “fat” column that shouldn’t be there; they should be valid numeric entries. I use the following code to

Identify the location of the NAs with is.na()
Identify the specific row of data by using which() to ID the specific species with an NA for fat
Overwrite the NA with the correct data using row indexing
- ie “dat[i.H.niger,“fat”] <- 5.0”

Note that is.na() returns TRUE/FALSE values; “TRUE” acts like specific index when you use it for row indexing. So, even though “i.NA” is a vector of TRUES and FALSE, milk_raw$spp[i.NA]" returns just the two species that match the “TRUES”

(There are probably tidyverse ways of doing this but I haven’t looked into it.)

# Look at fat column
summary(milk_raw$fat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.200   4.575   8.550  14.068  17.575  61.100       2

# Find NAS
i.NA <- is.na(milk_raw$fat)

#Identify the species w/ NAs
milk_raw$spp[i.NA]

## [1] Hippotragus niger         Cervus elaphus hispanicus
## 130 Levels: Acomys cahirinus Alces alces ... Zalophus californianus

# Extract the indexes of the appropriate species
i.H.niger   <- which(milk_raw$spp == "Hippotragus niger")
i.C.elaphus <- which(milk_raw$spp == "Cervus elaphus hispanicus")

#Overwrite the bad values with correct values
milk_raw[i.H.niger,"fat"] <- 5.0
milk_raw[i.C.elaphus,"fat"] <- 12.6

Clean texts from numeric variables

There are some text annotations from the raw data still hanging out in the dataframe. I will use “regular expressions” to clean these up. Regular expression take practice to use but there are many resources online for learning about them (Cleaning up asterisks (*), as below, is even a little trickier).

What follows is just a brief snapshot into this feature of R. R is known, however, for not having stellar basic regular expression features. Some tools in the tidyverse are available to make things less painful, but I haven’t switched to them yet.

Locate unwanted characters with grep()

There are letters, commas, and asterisks that were annotations in the original datafile. Easy to remove with find-replace, but also very easy with gsub(), with not chance of changing anything else in your spreadsheet you don’t want

Look at all of the protein column; those asterisks are a problem. There’s also an “s” in one cell.

summary(milk_raw$protein)

##  1.1*  1.3*   1.4  1.4*   1.5  1.5*   1.6  1.6*  1.7*   1.9    10 10.2, 
##     1     2     1     1     3     1     2     4     1     1     2     1 
##  10.3  10.5  10.7    11  11.1 11.2*  11.8  11.9  12.1  12.2  12.3  12.4 
##     3     2     3     2     1     1     1     1     1     1     1     1 
##  12.5  14.6  15.8   2.1   2.2   2.5   2.6   2.7     3   3.1   3.4   3.6 
##     2     1     1     2     1     1     1     2     1     1     1     2 
##  3.9*     4   4.1  4.2*   4.3  4.3*   4.4   4.5   4.6   4.8  4.8*   4.9 
##     1     2     1     1     1     1     2     2     1     1     1     1 
##   5.1  5.2*   5.4   5.5   5.6  5.6*   5.7   5.9     6   6.2   6.3  6.3* 
##     1     1     1     1     2     1     2     1     1     1     3     1 
##   6.4   6.5   6.6   6.7   6.8   6.9     7  7.1*   7.2   7.3   7.4   7.7 
##     1     2     1     1     1     1     3     1     2     2     1     2 
##   7.8  7.8S     8   8.1  8.1*   8.2   8.4  8.4*   8.5  8.6*  8.7*   8.8 
##     1     1     2     1     1     2     1     1     1     1     1     1 
##   8.9     9   9.1   9.2  9.2*   9.4   9.5   9.7  9.9* 
##     1     2     1     1     1     2     3     1     1

Use grep() to find the index of things
“[a-zA-Z]” = “find all letter, lower and uppercase”

i.letters <- grep("[a-zA-Z]",milk_raw$protein)

milk_raw$protein[i.letters]

## [1] 7.8S
## 93 Levels: 1.1* 1.3* 1.4 1.4* 1.5 1.5* 1.6 1.6* 1.7* 1.9 10 10.2, ... 9.9*

There’s the “S”.

#these are equivalent for the simple case of a single letter
i.S <- grep("S",milk_raw$protein)  #no brackets
i.S <- grep("[S]",milk_raw$protein)#brackets

milk_raw$protein[i.S]

## [1] 7.8S
## 93 Levels: 1.1* 1.3* 1.4 1.4* 1.5 1.5* 1.6 1.6* 1.7* 1.9 10 10.2, ... 9.9*

Replace unwanted characters with gsub()

milk_raw$protein <- gsub("[a-zA-Z]", #pattern
                    "",         #replace
                    milk_raw$protein)#data

Check

grep("[a-zA-Z]",milk_raw$protein)

## integer(0)

Use the gsub() command to replace “special” characters

Asterisks: “*"

Asterisks are a “special character” for regular expressions
These can be tricky
Have to put \ in front of it

milk_raw$protein <- gsub("\\*","",milk_raw$protein)

Check

grep("\\*",milk_raw$protein)

## integer(0)

COmmas “,”

Commas are also species characters

milk_raw$protein <- gsub("\\,","",milk_raw$protein)

Check

grep("\\,",milk_raw$protein)

## integer(0)

Check for spaces

These are frequent typos in data

grep(" ", milk_raw$protein)

## integer(0)

but luckily none!

Convert character data to numeric data

Because there were non-numeric characters (S, a command, *) in the column, are loaded “protein” as character data; basically it treated it as words/symbols but not numbers.
We now need to convert it back to numbers
This is done with the command as.numeric()
Converting back and forth can sometimes cause problems, so its good to put the converted info in a new column to check against the old

milk_raw$protein.num <- as.numeric(milk_raw$protein)

Compare old and new columns

head(milk_raw[,c("protein.num","protein")])

##   protein.num protein
## 1         6.3     6.3
## 2         5.7     5.7
## 3         4.1     4.1
## 4         4.3     4.3
## 5         5.6     5.6
## 6         8.8     8.8

tail(milk_raw[,c("protein.num","protein")])

##     protein.num protein
## 125         6.4     6.4
## 126         8.1     8.1
## 127         4.4     4.4
## 128         8.1     8.1
## 129        10.7    10.7
## 130         9.4     9.4

Clean other columns

Check other columns

summary(milk_raw[,c("sugar","energy")])

##      sugar        energy   
##  3      : 5   1.44   :  4  
##  4.5    : 5   0.81   :  3  
##  5.3    : 5   1.13   :  3  
##  5.2    : 4   0.49   :  2  
##  6.6    : 4   0.5    :  2  
##  (Other):92   (Other):101  
##  NA's   :15   NA's   : 15

Sugar

The sugar column has “<” and letters.
I can put anything I want removed inside the brackets
- “[a-zA-Z<]”" will get rid of the letters and “<”

summary(milk_raw$sugar)

## <0.025   0.02   0.05    0.1    0.3    0.8      1    1.2    1.3    1.5 
##      1      2      1      1      1      1      1      1      1      1 
##    1.7    1.8   10.9     11   12.5     14    2.2    2.3    2.5    2.6 
##      2      1      1      2      2      1      1      3      2      2 
##    2.7    2.8      3    3.1    3.2    3.3    3.4    3.6    3.7    3.8 
##      3      1      5      1      1      1      3      2      1      2 
##      4    4.1    4.2    4.3    4.4    4.5    4.6    4.8    4.9    5.2 
##      2      1      1      1      3      5      2      1      2      4 
##    5.3   5.3,    5.6    5.7   5.8S    5.9    6.1    6.2    6.4    6.6 
##      5      1      1      3      1      2      1      3      3      4 
##    6.7    6.8    6.9      7    7.2    7.4    7.5    7.7    7.8    7.9 
##      2      2      2      2      1      1      1      1      3      1 
##    8.1    8.4    8.5    8.9   NA's 
##      1      1      2      1     15

Remove letters and “<”

milk_raw$sugar <- gsub("[a-zA-Z<]","",milk_raw$sugar)

Convert and check

milk_raw$sugar.num <- as.numeric(milk_raw$sugar)

## Warning: NAs introduced by coercion

head(milk_raw[,c("sugar.num","sugar")])

##   sugar.num sugar
## 1       5.2   5.2
## 2        NA  <NA>
## 3       5.3   5.3
## 4       4.1   4.1
## 5       4.9   4.9
## 6        NA  <NA>

Energy

#The followign are equivalent
summary(milk_raw[,c("energy")])

##  0.36  0.45  0.46  0.47  0.49   0.5  0.52  0.54  0.57  0.65  0.68  0.71 
##     1     1     1     1     2     2     2     1     2     1     1     1 
##  0.72  0.78  0.81  0.82  0.84   0.9  0.91  0.92  0.93  0.96  0.98  1.03 
##     1     1     3     1     1     1     2     1     1     1     1     2 
##  1.06  1.08  1.09   1.1  1.11  1.13  1.16  1.17  1.19   1.2  1.21  1.23 
##     2     1     1     1     1     3     2     2     1     1     2     1 
##  1.25  1.26  1.27  1.28  1.31  1.36  1.37  1.41  1.42  1.43  1.44  1.48 
##     1     1     1     1     1     2     2     1     1     1     4     1 
##   1.5  1.52   1.6  1.61  1.62  1.64  1.65  1.68  1.71  1.79   1.8  1.83 
##     1     2     1     1     1     1     1     1     1     2     1     1 
##  1.88  1.95  1.96     2  2.06  2.07   2.1  2.18   2.2  2.34  2.45  2.57 
##     1     1     1     1     1     1     2     1     1     1     1     1 
##  2.59  2.61  2.62  2.65  2.76  2.82  2.85  2.94  3.29   3.4  3.49  3.64 
##     1     1     1     1     1     1     1     1     1     2     1     1 
##  3.73  4.47  4.71  5.23  5.36  5.41 5.69*  5.89  NA's 
##     1     1     1     1     1     1     1     1    15

summary(milk_raw$energy)

##  0.36  0.45  0.46  0.47  0.49   0.5  0.52  0.54  0.57  0.65  0.68  0.71 
##     1     1     1     1     2     2     2     1     2     1     1     1 
##  0.72  0.78  0.81  0.82  0.84   0.9  0.91  0.92  0.93  0.96  0.98  1.03 
##     1     1     3     1     1     1     2     1     1     1     1     2 
##  1.06  1.08  1.09   1.1  1.11  1.13  1.16  1.17  1.19   1.2  1.21  1.23 
##     2     1     1     1     1     3     2     2     1     1     2     1 
##  1.25  1.26  1.27  1.28  1.31  1.36  1.37  1.41  1.42  1.43  1.44  1.48 
##     1     1     1     1     1     2     2     1     1     1     4     1 
##   1.5  1.52   1.6  1.61  1.62  1.64  1.65  1.68  1.71  1.79   1.8  1.83 
##     1     2     1     1     1     1     1     1     1     2     1     1 
##  1.88  1.95  1.96     2  2.06  2.07   2.1  2.18   2.2  2.34  2.45  2.57 
##     1     1     1     1     1     1     2     1     1     1     1     1 
##  2.59  2.61  2.62  2.65  2.76  2.82  2.85  2.94  3.29   3.4  3.49  3.64 
##     1     1     1     1     1     1     1     1     1     2     1     1 
##  3.73  4.47  4.71  5.23  5.36  5.41 5.69*  5.89  NA's 
##     1     1     1     1     1     1     1     1    15

#clean
milk_raw$energy <- gsub("[a-zA-Z<]","",milk_raw$energy)

#convert and check
milk_raw$energy.num <- as.numeric(milk_raw$energy)

## Warning: NAs introduced by coercion

head(milk_raw[,c("energy.num","energy")])

##   energy.num energy
## 1       1.21   1.21
## 2         NA   <NA>
## 3       1.13   1.13
## 4       0.91   0.91
## 5       1.31   1.31
## 6         NA   <NA>

Remove unwanted columns with dplyr

Once we’ve checked that our data have converted properly from character to numeric, we can remove the old columns of character data using select() and negative indexing.

milk_raw2 <- milk_raw %>% dplyr::select(-dry.matter,
                                   -protein,
                                   -sugar,
                                   -energy,
                                   -ref,
                                   -lactation.stage.orig)

Note that during initial cleaning I only remove variable that are completely redundant; otherwise I leave everything in.

Change elements common accross column

Remove “.NUM” and “.num” using gsub()

names(milk_raw2) <- gsub(".NUM","",names(milk_raw2))
names(milk_raw2) <- gsub(".num","",names(milk_raw2))

Rename columns

To polish a dataset for analysis, I like to shorten the variable names

milk_raw2 <- milk_raw2 %>% rename(fam = family,
                                  mass.fem = mass.female,
                            gest.mo = gestation.month,
                            lac.mo = lacatation.months,
                            prot = protein,
                            dev.birth = dev.stage.at.birth,
                            ord = order)

Let’s call this cleaned up data just “milk”

milk <- "milk_raw2"

Save the cleaned data

We can save our cleaned data using read.csv()

write.csv(milk, "data/skibiel_mammalsmilk.csv",
          row.names = F)

The data produced by the above workflow are actually internal to the package. However, I recommend saving the .csv file to wherever you keep your R work and to practice reloading it.

You can load it directly, however, using data(“milk”)

Data cleaning

Nathan Brouwer

2018-11-27

Introduction

Important functions used

Original Data

Preliminaries

Loading the mammalsmilk package

Packages

Data preparation

Data cleaning

Mannually fix incorrect NAs

Clean texts from numeric variables

Locate unwanted characters with grep()

Replace unwanted characters with gsub()

Use the gsub() command to replace “special” characters

Asterisks: “*"

COmmas “,”

Check for spaces

Convert character data to numeric data

Clean other columns

Sugar

Energy

Remove unwanted columns with dplyr

Change elements common accross column

Rename columns

Save the cleaned data

Contents