• Data Exploration for Regression Analyses
  • " />

    Introduction

    The data collated by Skibiel et al contain information on several components of milk, including fat, protein, sugar, and total energetic content (“energy”). Here we’ll look at multple y variables, though for most of the subsequent tutorials we’ll focus on % of fat in milk

    The data also contains several “continous predictors” / covariates (x variables). These include

    • size of mom = “mass.fem”
    • duration of pregnancy/gestation = “gest.mon”
    • how long young fed = “lacat.mo” (“number of lactation months”)
    • how big the babies are = “mass.litter”

    THere are also Several “categorical predictors”

    • “diet” = meat, plants, everything
    • “arid” = desert or other; theory predicts arid environments impact evolution of milk
    • “biome” = aquatic or terrestrial (includes bats as terrestrial)

    Finally, there are columns containing taxonomic information

    • order = order
    • fam = family
    • spp = species

    Preliminaries

    Load packages

    library(ggplot2)
    library(cowplot)
    library(arm)
    library(dplyr)
    #library(GGally)
    library(ggpubr)

    Loading the mammalsmilk package

    If you haven’t already, download the mammalsmilk package from GitHub (note the “S” between mammal and milk). This is commented out in the code below - only run the code if you have never downloaded it before, or haven’t recently (in case its been updated)

    # install_github("brouwern/mammalsmilk")

    You can then load the package with library()

    library(mammalsmilk)

    Load data

    data("milk")

    Data exploration

    Look at the data

    # Hom much data?
    dim(milk)
    ## [1] 130  19
    
    #focal columns
    ##(use negative indexing to drop some)
    summary(milk[,-c(2,3,8,9)])
    ##             ord        mass.fem            gest.mo           lac.mo      
    ##  Artiodactyla :23   Min.   :        8   Min.   : 0.400   Min.   : 0.300  
    ##  Carnivora    :23   1st Qu.:      857   1st Qu.: 1.405   1st Qu.: 1.625  
    ##  Primates     :22   Median :     5716   Median : 5.000   Median : 4.500  
    ##  Rodentia     :17   Mean   :  2229475   Mean   : 5.624   Mean   : 6.092  
    ##  Chiroptera   :10   3rd Qu.:   107500   3rd Qu.: 8.365   3rd Qu.: 8.225  
    ##  Diprotodontia:10   Max.   :170000000   Max.   :21.460   Max.   :42.000  
    ##  (Other)      :25                                                        
    ##   mass.litter               diet     arid            biome    
    ##  Min.   :      0.3   carnivore:32   no :91   aquatic    : 22  
    ##  1st Qu.:     42.0   herbivore:61   yes:39   terrestrial:108  
    ##  Median :    423.5   omnivore :37                             
    ##  Mean   :  52563.8                                            
    ##  3rd Qu.:   7038.2                                            
    ##  Max.   :2272500.0                                            
    ##                                                               
    ##        N           fat          gest.month        lacat.mo     
    ##  4      :13   Min.   : 0.20   Min.   : 0.400   Min.   : 0.300  
    ##  3      :11   1st Qu.: 4.65   1st Qu.: 1.405   1st Qu.: 1.625  
    ##  6      :10   Median : 8.55   Median : 5.000   Median : 4.500  
    ##  7      : 9   Mean   :13.99   Mean   : 5.624   Mean   : 6.092  
    ##  5      : 8   3rd Qu.:16.82   3rd Qu.: 8.365   3rd Qu.: 8.225  
    ##  24     : 5   Max.   :61.10   Max.   :21.460   Max.   :42.000  
    ##  (Other):74                                                    
    ##       prot            sugar           energy     
    ##  Min.   : 1.100   Min.   : 0.02   Min.   :0.360  
    ##  1st Qu.: 4.125   1st Qu.: 3.00   1st Qu.:0.965  
    ##  Median : 6.750   Median : 4.70   Median :1.365  
    ##  Mean   : 6.673   Mean   : 4.94   Mean   :1.680  
    ##  3rd Qu.: 9.200   3rd Qu.: 6.60   3rd Qu.:2.045  
    ##  Max.   :15.800   Max.   :14.00   Max.   :5.890  
    ##                   NA's   :16      NA's   :16

    Focal families in dataset

    This code is dense but let’s us look at the most common families

    Tabulate number of families

    table.fam <- table(milk$fam)

    Determine order in data set (not taxonomic order)

    numeric.order. <- order(table.fam,
                            decreasing = T)

    Largest families, in order

    table.fam <- data.frame(table.fam[numeric.order.][1:13])
    
    names(table.fam) <- c("fam","freq")
    
    table.fam
    ##                fam freq
    ## 1          Bovidae   13
    ## 2  Cercopithecidae    8
    ## 3         Cervidae    7
    ## 4          Muridae    7
    ## 5        Otariidae    7
    ## 6         Phocidae    7
    ## 7        Lemuridae    5
    ## 8     Macropodidae    5
    ## 9     Pteropodidae    5
    ## 10 Balaenopteridae    4
    ## 11         Equidae    4
    ## 12         Canidae    3
    ## 13       Leporidae    3

    Plot as barplot; some added gplot code to rotate the axis lables

    ggbarplot(data = table.fam,
              y = "freq",
              x = "fam") +
      theme(axis.text.x = element_text(angle = 90, 
                                       hjust = 1, 
                                       vjust = 0.5))

    Tabulate number of orders

    table.ord <- table(milk$ord)

    Determine order (numeric order, not taxonomic order)

    numeric.order. <- order(table.ord,
                            decreasing = T)

    Largest orders, in order

    table.ord <- data.frame(table.ord[numeric.order.][1:9])
    
    names(table.ord) <- c("ord","freq")
    
    table.ord
    ##               ord freq
    ## 1    Artiodactyla   23
    ## 2       Carnivora   23
    ## 3        Primates   22
    ## 4        Rodentia   17
    ## 5      Chiroptera   10
    ## 6   Diprotodontia   10
    ## 7 Perrissodactyla    7
    ## 8         Cetacea    6
    ## 9      Lagomorpha    3
    • Artiodactyla = 23; Even-toed ungulate
    • Carnivora = 23
    • Primates = 22
    • Rodentia = 17
    • Chiroptera = 10; bats
    • Diprotodontia = 10; marsupials
    • Perrissodactyla = 7; odd-toed ungulates
    • Cetacea = 6
    • Lagomorpha = 3
    ggbarplot(data = table.ord,
              y = "freq",
              x = "ord") +
      theme(axis.text.x = element_text(angle = 90, 
                                       hjust = 1, 
                                       vjust = 0.5))

    Plotting scatter plots

    To condense my plotting code I will save titles and axis labels to R objects.

    Note that for ylab. I insert a line break by putting a backslash in from of an n, for “new line”

    main. <- "Regression data: continous vs continuous"
    xlab. <- "Continous x: mass of female"
    ylab. <- "Continous y: \nfat contenxt of milk"

    Look at % milk fat vs. mass of female

    Note that the plot is dominated an outlier, which the whales!

    ggscatter(y = "fat",
          x = "mass.fem",
          data = milk,
          main = main.,
          xlab = xlab.,
          ylab = ylab.)

    A similar graph can be made using the stock ggplot function qplot(), except the variables are not quoted

    qplot(y = fat,       # no " " !
          x = mass.fem,
          data = milk,
          main = main.,
          xlab = xlab.,
          ylab = ylab.) 

    Look at milk fat vs. log(mass of female)

    • Logging helps linearize the data
    • NOTE: statiticians typically use natural log = ln()
    • Biologists, especially when allometry is invovled, typically use base 10 = log10()
    • Original authors used log10, but I just realized this, so I’ll stick with ln()
    • I believe (?) give similar answers, but log10 I think (?) makes biological interpretation easier (???)
    • ggpubr requires hard-coded variables and can’t do the log transform “on the fly” in the same way as ggplot/qplot

    We can use the mutate() function in dplyr to take the log.

    milk <- milk %>% dplyr::mutate(mass.fem.log = log(mass.fem))

    Look at transformed data

    summary(milk$mass.fem.log)
    ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    ##   2.079   6.753   8.651   9.105  11.584  18.951

    Plot transformed data

    main.new <-"Regression data: continous vs log(continuous)" 
    
    ggpubr::ggscatter(y = "fat",
          x = "mass.fem.log",
          data = milk,
          main = main.new,
          ylab = ylab.,
          xlab = xlab.)

    Add “smoother”

    Helps visualize trends

    ggscatter(y = "fat",
          x = "mass.fem.log",
          add = "loess",     #smoother
          data = milk,
          main = main.new,
          ylab = ylab.,
          xlab = xlab.)

    Look at % milk fat by a categorical variable (diet)

    Color-code diets. This might take a second to render.

    ggscatter(y = "fat",
          x = "mass.fem.log",
          color = "diet",
          add = "loess",
          data = milk,
          main = main.new,
          ylab = ylab.,
          xlab = xlab.)

    Look at % milk fat by diet and biome using facets

    • “facets” are a powerful tool in ggplot
    • Turns out ALL of our aquatic spp are carnivores (sorry, no manatees)
    • This shows that “aquatic” and “diet” are “Confounded”
    ggscatter(y = "fat",
          x = "mass.fem.log",
          color = "diet",
          facet.by = "biome", # facet command
          data = milk,
          main = main.new,
          ylab = ylab.,
          xlab = xlab.)

    Boxplots

    It can be useful to make boxplot just by categorical variables. For me, this plot highlights that almost all carnivores are aquatic, and they have a wide range of fat contents.

    ggboxplot(data = milk,
              add = "jitter",
              color = "biome",
              y = "fat",
              x ="diet")

    Focal data subset: Primates & Relatives

    Let’s look at just a subset of the data * Primates * Rodents (this is why we use mouse models) * Rabbits

    data("milk_primates")

    Compare original data and our working subset

    • Original = 130 speices
    • Working data = 42
    # Original data
    dim(milk)
    ## [1] 130  20
    
    #out working subset
    dim(milk_primates)
    ## [1] 42 15

    Plot working data: milk fat ~ female size

    Do log transform

    milk_primates <- milk_primates %>% dplyr::mutate(mass.fem.log = log(mass.fem))

    Make scatterplot; ggscatterhist() allows plots to be put in margin

    Add labels

    This code is fairly dense. What I’ll do is annotate some selected data points with the species names.

    First, make some labels.

    #make column to hold particular spp.names
    milk_primates$spp.focal <- ""
    
    #select spp of interest
    ## This is just accross a span of the data
    i.focal <- c(17,18,19,23,30,34,39)
    
    #look at focals
    milk_primates$spp.focal[i.focal]
    ## [1] "" "" "" "" "" "" ""
    
    #add spp names to spp.focal column
    milk_primates$spp.focal[i.focal] <- as.character(milk_primates$spp[i.focal])
    
    milk_primates$spp.focal <- gsub("(^.*)( )(.*)","\\1",milk_primates$spp.focal)

    plot with names using some fancier ggplot code.

    ggplot(data = milk_primates,
           aes(y = fat,
          x = mass.fem.log)) + 
      geom_point()+
      geom_text(aes(label=spp.focal),
                hjust=1, 
                vjust=0)

    Add regression line

    • % Fat generally declines as body mass increases
    • Grey areas is “confidence band”
    ggscatter(y = "fat",
          x = "mass.fem.log",
          add = "reg.line",   #regression line
          conf.int = TRUE,    #confidence int.
          data = milk_primates) 

    Color code by

    ggscatter(y = "fat",
          x = "mass.fem.log",
          color = "ord",
          add = "reg.line",   #regression line
          conf.int = TRUE,    #confidence int.
          data = milk_primates) 

    Alternate predictors

    We can also look at othe predictor besides the size of the mother, such as how many months are spent gestating.

    First, we should log transform them.

    milk_primates <- milk_primates %>% dplyr::mutate(gest.mo.log = log(gest.month),
        lacat.mo.log = log(lacat.mo),
        mass.litter.log = log(mass.litter))

    Duration of gestation

    Let’s look at a relationship fat ~ log(gest.month)

    Compare all data compiled with by order

    #all data combined
    no.order <- ggscatter(y = "fat",
          x = "gest.mo.log",
          data = milk_primates,
          #color = "ord",
          add = "reg.line",
          conf.int = TRUE)
    
    #split by order
    by.order <- ggscatter(y = "fat",
          x = "gest.mo.log",
          data = milk_primates,
          color = "ord",
          add = "reg.line",
          conf.int = TRUE)
    
    plot_grid(no.order,by.order)

    % Fat generally declines as gestation duration increases. So, the longer animal is pregnant, the less fatty the milk is. However, this trend largely disappears when we plot by order. That is because the Lagomorph data is vary noisy, resulting in a flat line. The negative trend seen when looking at all of the data combined is largely drive by the fact that lagomorphs tend to have higher milkfat and shorter gestations, and primates have lower milk fat and longer gestations. This trend is therefore probably driven mostly by differenes between orders, not due to an underlying relationship with gestation.

    Duration of lactation

    Another relationship to consider fat ~ log(lacat.mo)

    Compare all data compiled with by order

    no.order <- ggscatter(y = "fat",
          x = "lacat.mo.log",
          data = milk_primates,
          #color = "ord",
          add = "reg.line",
          conf.int = TRUE)
    
    by.order <- ggscatter(y = "fat",
          x = "lacat.mo.log",
          data = milk_primates,
          color = "ord",
          add = "reg.line",
          conf.int = TRUE)
    
    plot_grid(no.order,by.order)

    Overall, the % Fat generally declines as lactation duration increases. So, the longer time children are dependent, lower milk fat.

    Again, however, there is strong phyolgenetic single. Within

    Mass of litter

    fat ~ log(mass.litter)

    • % Fat generally declines as litter size increases
    • more or bigger babies, less milk fat
    no.order <- ggscatter(y = "fat",
          x = "mass.litter.log",
          data = milk_primates,
          #color = "ord",
          add = "reg.line",
          conf.int = TRUE)
    
    by.order <- ggscatter(y = "fat",
          x = "mass.litter.log",
          data = milk_primates,
          color = "ord",
          add = "reg.line",
          conf.int = TRUE)
    
    plot_grid(no.order,by.order)

    Correlations among predictors

    ALl of the graphs of all of the predictors looked basically the same. One reason is that all of the predictors are highly correlated, so they are all basically indicating something similar about the animals.

    We can get a sense for these correlations using the ggpairs() function in the GGally package.

    vars. <- c("fat",
               "gest.mo.log",
               "mass.fem.log",
               "lacat.mo.log"  , 
               "mass.litter.log")
    
    GGally::ggpairs(data = milk_primates,
            column = vars.,
            diag =NULL )