• Summarizing data with dplyr
  • " />

    Introduction

    This tutorial will demonstrate how to use dplyr’s summarize() command to calculate summary statistics such as the mean, standard deviation (SD) and sample size (n). First we’ll process a dataset of female fruit flies to determine summary stats on their wing phenotype (mean, SD etc.). We’ll use the group_by() command to split the data up by the experimental groups the flies were in. Then we’ll look at a larger dataset with both male and female flies in the two experimental groups and extend the use of the group_by() function to split larger datasets up with multiple grouping variables: sex (male vs. female) and experimental group (control vs. experimental.

    Preliminaries

    Load libraries

    Any libraries not already downloaded can be loaded using the code “install.packages(…)”, such as “install.packages(’plotrix)”.

    library(shroom)  #data
    library(dplyr)   #summarize() and groub_by()
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    library(plotrix) #function for SE

    Load data

    The wingscores_s1F dataset has data for the 24 female flies examined by the 1st student in the dataset (“_s1F“). The”wingscores_s1" dataset has all of the 1st students data. We’ll look at both over the course of this tutorial.

    data("wingscores_s1F")
    data("wingscores_s1")

    Numeric summaries with summarize()

    We can calculate the mean of a column using summarize with the mean() command within it. Note that the format is “summary.name = function(data.column)”. “mean =” generates the label for the summary and “mean(wing.score)” is the summary command.

    wingscores_s1F %>% 
      summarize(mean = mean(wing.score))
    #>   mean
    #> 1    4

    summarize() is most powerful when you want multiple summary statistics worked up for the same set of data. We can ask for both the mean and the standard deviation (SD) like this:

    wingscores_s1F %>% 
      summarize(mean = mean(wing.score),
                SD = sd(wing.score))
    #>   mean        SD
    #> 1    4 0.7801895

    A handy function is n() which counts up the sample size. Note that nothing goes in n()

    wingscores_s1F %>% 
      summarize(mean = mean(wing.score),
                SD = sd(wing.score),
                n = n())
    #>   mean        SD  n
    #> 1    4 0.7801895 24

    Numeric summaries by group

    group_by() splits the dataframe up by a categorical (aka factor) variable. Here, I split by the 2 experimental conditions: E = “experimental” flies and C = “control” flies and the calculate the summary stats.

    wingscores_s1F %>%                   # data
      group_by(E.or.C) %>%               # group_by()
      summarize(mean = mean(wing.score), # summarize()
                n = n(),
                sd = sd(wing.score))
    #> # A tibble: 2 x 4
    #>   E.or.C  mean     n    sd
    #>   <fct>  <dbl> <int> <dbl>
    #> 1 C       4.38    16 0.619
    #> 2 E       3.25     8 0.463

    Cool summarize() tricks

    Here are some demonstrations of the functionality of summarize().

    Using summary functions from other packages

    I can also use functions from other packages. The plotrix package has a standard error function: std.error(). I’ll call it using plotrix::std.error() so its obvious that its from that package.

    I can use this std.error() function within summarize()

    wingscores_s1F %>% 
      group_by(E.or.C) %>% 
      summarize(mean = mean(wing.score),
                n = n(),
                sd = sd(wing.score),
                se = plotrix::std.error(wing.score))
    #> # A tibble: 2 x 5
    #>   E.or.C  mean     n    sd    se
    #>   <fct>  <dbl> <int> <dbl> <dbl>
    #> 1 C       4.38    16 0.619 0.155
    #> 2 E       3.25     8 0.463 0.164

    Doing math within summarize()

    I can even do math within summarize. Here, I calculate an approximate 95% confidence interval using std.error() by multiplying it within summarize() on the fly.

    wingscores_s1F %>% 
      group_by(E.or.C) %>% 
      summarize(mean = mean(wing.score),
                n = n(),
                sd = sd(wing.score),
                se = plotrix::std.error(wing.score),
                CI95 = 1.96*plotrix::std.error(wing.score))
    #> # A tibble: 2 x 6
    #>   E.or.C  mean     n    sd    se  CI95
    #>   <fct>  <dbl> <int> <dbl> <dbl> <dbl>
    #> 1 C       4.38    16 0.619 0.155 0.303
    #> 2 E       3.25     8 0.463 0.164 0.321

    Saving summarize() output to an R object.

    I can save this summary table to an object if I want

    s1_females_summary <- wingscores_s1F %>% 
      group_by(E.or.C) %>% 
      summarize(mean = mean(wing.score),
                n = n(),
                sd = sd(wing.score),
                se = plotrix::std.error(wing.score),
                CI95 = 1.96*plotrix::std.error(wing.score))

    And call it up later

    s1_females_summary
    #> # A tibble: 2 x 6
    #>   E.or.C  mean     n    sd    se  CI95
    #>   <fct>  <dbl> <int> <dbl> <dbl> <dbl>
    #> 1 C       4.38    16 0.619 0.155 0.303
    #> 2 E       3.25     8 0.463 0.164 0.321

    I can even save this to a .csv file

    write.csv(s1_females_summary, file = "s1_summary.csv")

    Numeric summaries by multiple groups

    You can give group_by() multiple categories. The “wingscores_s1” data has both males and females from both experimental conditions. TO make the cod cleaner I’ll make a vector called “columns.” that will store the 3 names of the columns I want to work with.

    #names of columns to examine
    columns. <- c("E.or.C", "sex","wing.score")
    
    # check that everything works
    summary(wingscores_s1[,columns.])
    #>  E.or.C sex      wing.score   
    #>  C:33   F:24   Min.   :3.000  
    #>  E:27   M:36   1st Qu.:4.000  
    #>                Median :5.000  
    #>                Mean   :4.667  
    #>                3rd Qu.:5.000  
    #>                Max.   :7.000

    Calculate the mean for both sexes and for both experimental conditions.

    wingscores_s1 %>% 
      group_by(E.or.C,
               sex) %>% 
      summarize(mean = mean(wing.score))
    #> # A tibble: 4 x 3
    #> # Groups:   E.or.C [?]
    #>   E.or.C sex    mean
    #>   <fct>  <fct> <dbl>
    #> 1 C      F      4.38
    #> 2 C      M      4.88
    #> 3 E      F      3.25
    #> 4 E      M      5.32