• Data cleaning
  • " />

    Introduction

    In this tutorial we’ll

    • load data from Skibiel et al 2013
    • fix some typos that were introduced during data preparation
    • clean up some columns of numeric data that contain character data
    • save the data as a .csv for future use

    Important functions used

    • read.csv
    • is.na
    • grep
    • gsub
    • as.numeric
    • dplyr::select
    • dplyr::rename
    • write.csv

    Original Data

    Skibiel et al 2013. The evolution of the nutrient composition of mammalian milks. Journal of Animal Ecology 82: 1254-1264. https://doi.org/10.1111/1365-2656.12095

    Preliminaries

    Loading the mammalsmilk package

    If you haven’t already, download the mammalsmilk package from GitHub (note the “S” between mammal and milk). This is commented out in the code below - only run the code if you have never downloaded it before, or haven’t recently (in case its been updated)

    # install_github("brouwern/mammalsmilk")

    You can then load the package with library()

    library(mammalsmilk)

    We’ll start with the data in a raw form. This is most easily accessed using

    data("milk_raw")

    The originally .csv file is called “Skibiel_mammalsmilk_raw.csv”. See the preceding vignette on data access for how to find these files and tips on loading .csv files if you need to practice this.

    Packages

    We’ll use the following packages. You’ll need to download them if you haven’t already. I’d try just loading them with library() first (in the next section), then installing if needed.

    # install.packages("dplyr")
    library(dplyr)
    ## 
    ## Attaching package: 'dplyr'
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union

    Data preparation

    The raw data is from a word file of appendices from the original paper.

    • Table S1. Milk composition of mammals at mid-lactation
    • Table S2. Ecology and life histories of mammals whose milk composition has been described

    I pasted the data by hand into Excel and saved it as a .csv file. Once in .csv form it can be loaded into R and further cleaned up. (In the future I will re-do this all in R).

    Data cleaning

    The data are a bit rough because they were formatted to be a table in a paper. There’s lots of text within cells added as annotations, and probably some errors due to switching from a table in a word doc to .csv.

    Mannually fix incorrect NAs

    There are 2 NA values in my “fat” column that shouldn’t be there; they should be valid numeric entries. I use the following code to

    • Identify the location of the NAs with is.na()
    • Identify the specific row of data by using which() to ID the specific species with an NA for fat
    • Overwrite the NA with the correct data using row indexing
      • ie “dat[i.H.niger,“fat”] <- 5.0”

    Note that is.na() returns TRUE/FALSE values; “TRUE” acts like specific index when you use it for row indexing. So, even though “i.NA” is a vector of TRUES and FALSE, milk_raw$spp[i.NA]" returns just the two species that match the “TRUES”

    (There are probably tidyverse ways of doing this but I haven’t looked into it.)

    # Look at fat column
    summary(milk_raw$fat)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    ##   0.200   4.575   8.550  14.068  17.575  61.100       2
    # Find NAS
    i.NA <- is.na(milk_raw$fat)
    
    #Identify the species w/ NAs
    milk_raw$spp[i.NA]
    ## [1] Hippotragus niger         Cervus elaphus hispanicus
    ## 130 Levels: Acomys cahirinus Alces alces ... Zalophus californianus
    # Extract the indexes of the appropriate species
    i.H.niger   <- which(milk_raw$spp == "Hippotragus niger")
    i.C.elaphus <- which(milk_raw$spp == "Cervus elaphus hispanicus")
    
    #Overwrite the bad values with correct values
    milk_raw[i.H.niger,"fat"] <- 5.0
    milk_raw[i.C.elaphus,"fat"] <- 12.6

    Clean texts from numeric variables

    There are some text annotations from the raw data still hanging out in the dataframe. I will use “regular expressions” to clean these up. Regular expression take practice to use but there are many resources online for learning about them (Cleaning up asterisks (*), as below, is even a little trickier).

    What follows is just a brief snapshot into this feature of R. R is known, however, for not having stellar basic regular expression features. Some tools in the tidyverse are available to make things less painful, but I haven’t switched to them yet.

    Locate unwanted characters with grep()

    There are letters, commas, and asterisks that were annotations in the original datafile. Easy to remove with find-replace, but also very easy with gsub(), with not chance of changing anything else in your spreadsheet you don’t want

    Look at all of the protein column; those asterisks are a problem. There’s also an “s” in one cell.

    summary(milk_raw$protein)
    ##  1.1*  1.3*   1.4  1.4*   1.5  1.5*   1.6  1.6*  1.7*   1.9    10 10.2, 
    ##     1     2     1     1     3     1     2     4     1     1     2     1 
    ##  10.3  10.5  10.7    11  11.1 11.2*  11.8  11.9  12.1  12.2  12.3  12.4 
    ##     3     2     3     2     1     1     1     1     1     1     1     1 
    ##  12.5  14.6  15.8   2.1   2.2   2.5   2.6   2.7     3   3.1   3.4   3.6 
    ##     2     1     1     2     1     1     1     2     1     1     1     2 
    ##  3.9*     4   4.1  4.2*   4.3  4.3*   4.4   4.5   4.6   4.8  4.8*   4.9 
    ##     1     2     1     1     1     1     2     2     1     1     1     1 
    ##   5.1  5.2*   5.4   5.5   5.6  5.6*   5.7   5.9     6   6.2   6.3  6.3* 
    ##     1     1     1     1     2     1     2     1     1     1     3     1 
    ##   6.4   6.5   6.6   6.7   6.8   6.9     7  7.1*   7.2   7.3   7.4   7.7 
    ##     1     2     1     1     1     1     3     1     2     2     1     2 
    ##   7.8  7.8S     8   8.1  8.1*   8.2   8.4  8.4*   8.5  8.6*  8.7*   8.8 
    ##     1     1     2     1     1     2     1     1     1     1     1     1 
    ##   8.9     9   9.1   9.2  9.2*   9.4   9.5   9.7  9.9* 
    ##     1     2     1     1     1     2     3     1     1
    • Use grep() to find the index of things
    • “[a-zA-Z]” = “find all letter, lower and uppercase”
    i.letters <- grep("[a-zA-Z]",milk_raw$protein)
    
    milk_raw$protein[i.letters]
    ## [1] 7.8S
    ## 93 Levels: 1.1* 1.3* 1.4 1.4* 1.5 1.5* 1.6 1.6* 1.7* 1.9 10 10.2, ... 9.9*

    There’s the “S”.

    #these are equivalent for the simple case of a single letter
    i.S <- grep("S",milk_raw$protein)  #no brackets
    i.S <- grep("[S]",milk_raw$protein)#brackets
    
    milk_raw$protein[i.S]
    ## [1] 7.8S
    ## 93 Levels: 1.1* 1.3* 1.4 1.4* 1.5 1.5* 1.6 1.6* 1.7* 1.9 10 10.2, ... 9.9*

    Replace unwanted characters with gsub()

    milk_raw$protein <- gsub("[a-zA-Z]", #pattern
                        "",         #replace
                        milk_raw$protein)#data

    Check

    grep("[a-zA-Z]",milk_raw$protein)
    ## integer(0)

    Use the gsub() command to replace “special” characters

    Asterisks: “*"

    • Asterisks are a “special character” for regular expressions
    • These can be tricky
    • Have to put \ in front of it
    milk_raw$protein <- gsub("\\*","",milk_raw$protein)

    Check

    grep("\\*",milk_raw$protein)
    ## integer(0)

    COmmas “,”

    Commas are also species characters

    milk_raw$protein <- gsub("\\,","",milk_raw$protein)

    Check

    grep("\\,",milk_raw$protein)
    ## integer(0)

    Check for spaces

    These are frequent typos in data

    grep(" ", milk_raw$protein)
    ## integer(0)

    but luckily none!

    Convert character data to numeric data

    • Because there were non-numeric characters (S, a command, *) in the column, are loaded “protein” as character data; basically it treated it as words/symbols but not numbers.
    • We now need to convert it back to numbers
    • This is done with the command as.numeric()
    • Converting back and forth can sometimes cause problems, so its good to put the converted info in a new column to check against the old
    milk_raw$protein.num <- as.numeric(milk_raw$protein)

    Compare old and new columns

    head(milk_raw[,c("protein.num","protein")])
    ##   protein.num protein
    ## 1         6.3     6.3
    ## 2         5.7     5.7
    ## 3         4.1     4.1
    ## 4         4.3     4.3
    ## 5         5.6     5.6
    ## 6         8.8     8.8
    tail(milk_raw[,c("protein.num","protein")])
    ##     protein.num protein
    ## 125         6.4     6.4
    ## 126         8.1     8.1
    ## 127         4.4     4.4
    ## 128         8.1     8.1
    ## 129        10.7    10.7
    ## 130         9.4     9.4

    Clean other columns

    Check other columns

    summary(milk_raw[,c("sugar","energy")])
    ##      sugar        energy   
    ##  3      : 5   1.44   :  4  
    ##  4.5    : 5   0.81   :  3  
    ##  5.3    : 5   1.13   :  3  
    ##  5.2    : 4   0.49   :  2  
    ##  6.6    : 4   0.5    :  2  
    ##  (Other):92   (Other):101  
    ##  NA's   :15   NA's   : 15

    Sugar

    • The sugar column has “<” and letters.
    • I can put anything I want removed inside the brackets
      • “[a-zA-Z<]”" will get rid of the letters and “<”
    summary(milk_raw$sugar)
    ## <0.025   0.02   0.05    0.1    0.3    0.8      1    1.2    1.3    1.5 
    ##      1      2      1      1      1      1      1      1      1      1 
    ##    1.7    1.8   10.9     11   12.5     14    2.2    2.3    2.5    2.6 
    ##      2      1      1      2      2      1      1      3      2      2 
    ##    2.7    2.8      3    3.1    3.2    3.3    3.4    3.6    3.7    3.8 
    ##      3      1      5      1      1      1      3      2      1      2 
    ##      4    4.1    4.2    4.3    4.4    4.5    4.6    4.8    4.9    5.2 
    ##      2      1      1      1      3      5      2      1      2      4 
    ##    5.3   5.3,    5.6    5.7   5.8S    5.9    6.1    6.2    6.4    6.6 
    ##      5      1      1      3      1      2      1      3      3      4 
    ##    6.7    6.8    6.9      7    7.2    7.4    7.5    7.7    7.8    7.9 
    ##      2      2      2      2      1      1      1      1      3      1 
    ##    8.1    8.4    8.5    8.9   NA's 
    ##      1      1      2      1     15

    Remove letters and “<”

    milk_raw$sugar <- gsub("[a-zA-Z<]","",milk_raw$sugar)

    Convert and check

    milk_raw$sugar.num <- as.numeric(milk_raw$sugar)
    ## Warning: NAs introduced by coercion
    head(milk_raw[,c("sugar.num","sugar")])
    ##   sugar.num sugar
    ## 1       5.2   5.2
    ## 2        NA  <NA>
    ## 3       5.3   5.3
    ## 4       4.1   4.1
    ## 5       4.9   4.9
    ## 6        NA  <NA>

    Energy

    #The followign are equivalent
    summary(milk_raw[,c("energy")])
    ##  0.36  0.45  0.46  0.47  0.49   0.5  0.52  0.54  0.57  0.65  0.68  0.71 
    ##     1     1     1     1     2     2     2     1     2     1     1     1 
    ##  0.72  0.78  0.81  0.82  0.84   0.9  0.91  0.92  0.93  0.96  0.98  1.03 
    ##     1     1     3     1     1     1     2     1     1     1     1     2 
    ##  1.06  1.08  1.09   1.1  1.11  1.13  1.16  1.17  1.19   1.2  1.21  1.23 
    ##     2     1     1     1     1     3     2     2     1     1     2     1 
    ##  1.25  1.26  1.27  1.28  1.31  1.36  1.37  1.41  1.42  1.43  1.44  1.48 
    ##     1     1     1     1     1     2     2     1     1     1     4     1 
    ##   1.5  1.52   1.6  1.61  1.62  1.64  1.65  1.68  1.71  1.79   1.8  1.83 
    ##     1     2     1     1     1     1     1     1     1     2     1     1 
    ##  1.88  1.95  1.96     2  2.06  2.07   2.1  2.18   2.2  2.34  2.45  2.57 
    ##     1     1     1     1     1     1     2     1     1     1     1     1 
    ##  2.59  2.61  2.62  2.65  2.76  2.82  2.85  2.94  3.29   3.4  3.49  3.64 
    ##     1     1     1     1     1     1     1     1     1     2     1     1 
    ##  3.73  4.47  4.71  5.23  5.36  5.41 5.69*  5.89  NA's 
    ##     1     1     1     1     1     1     1     1    15
    summary(milk_raw$energy)
    ##  0.36  0.45  0.46  0.47  0.49   0.5  0.52  0.54  0.57  0.65  0.68  0.71 
    ##     1     1     1     1     2     2     2     1     2     1     1     1 
    ##  0.72  0.78  0.81  0.82  0.84   0.9  0.91  0.92  0.93  0.96  0.98  1.03 
    ##     1     1     3     1     1     1     2     1     1     1     1     2 
    ##  1.06  1.08  1.09   1.1  1.11  1.13  1.16  1.17  1.19   1.2  1.21  1.23 
    ##     2     1     1     1     1     3     2     2     1     1     2     1 
    ##  1.25  1.26  1.27  1.28  1.31  1.36  1.37  1.41  1.42  1.43  1.44  1.48 
    ##     1     1     1     1     1     2     2     1     1     1     4     1 
    ##   1.5  1.52   1.6  1.61  1.62  1.64  1.65  1.68  1.71  1.79   1.8  1.83 
    ##     1     2     1     1     1     1     1     1     1     2     1     1 
    ##  1.88  1.95  1.96     2  2.06  2.07   2.1  2.18   2.2  2.34  2.45  2.57 
    ##     1     1     1     1     1     1     2     1     1     1     1     1 
    ##  2.59  2.61  2.62  2.65  2.76  2.82  2.85  2.94  3.29   3.4  3.49  3.64 
    ##     1     1     1     1     1     1     1     1     1     2     1     1 
    ##  3.73  4.47  4.71  5.23  5.36  5.41 5.69*  5.89  NA's 
    ##     1     1     1     1     1     1     1     1    15
    #clean
    milk_raw$energy <- gsub("[a-zA-Z<]","",milk_raw$energy)
    
    #convert and check
    milk_raw$energy.num <- as.numeric(milk_raw$energy)
    ## Warning: NAs introduced by coercion
    head(milk_raw[,c("energy.num","energy")])
    ##   energy.num energy
    ## 1       1.21   1.21
    ## 2         NA   <NA>
    ## 3       1.13   1.13
    ## 4       0.91   0.91
    ## 5       1.31   1.31
    ## 6         NA   <NA>

    Remove unwanted columns with dplyr

    Once we’ve checked that our data have converted properly from character to numeric, we can remove the old columns of character data using select() and negative indexing.

    milk_raw2 <- milk_raw %>% dplyr::select(-dry.matter,
                                       -protein,
                                       -sugar,
                                       -energy,
                                       -ref,
                                       -lactation.stage.orig)

    Note that during initial cleaning I only remove variable that are completely redundant; otherwise I leave everything in.

    Change elements common accross column

    Remove “.NUM” and “.num” using gsub()

    names(milk_raw2) <- gsub(".NUM","",names(milk_raw2))
    names(milk_raw2) <- gsub(".num","",names(milk_raw2))

    Rename columns

    To polish a dataset for analysis, I like to shorten the variable names

    milk_raw2 <- milk_raw2 %>% rename(fam = family,
                                      mass.fem = mass.female,
                                gest.mo = gestation.month,
                                lac.mo = lacatation.months,
                                prot = protein,
                                dev.birth = dev.stage.at.birth,
                                ord = order)

    Let’s call this cleaned up data just “milk”

    milk <- "milk_raw2"

    Save the cleaned data

    We can save our cleaned data using read.csv()

    write.csv(milk, "data/skibiel_mammalsmilk.csv",
              row.names = F)

    The data produced by the above workflow are actually internal to the package. However, I recommend saving the .csv file to wherever you keep your R work and to practice reloading it.

    You can load it directly, however, using data(“milk”)