d-summarizing_data_with_dplyr.Rmd
This tutorial will demonstrate how to use dplyr’s summarize() command to calculate summary statistics such as the mean, standard deviation (SD) and sample size (n). First we’ll process a dataset of female fruit flies to determine summary stats on their wing phenotype (mean, SD etc.). We’ll use the group_by() command to split the data up by the experimental groups the flies were in. Then we’ll look at a larger dataset with both male and female flies in the two experimental groups and extend the use of the group_by() function to split larger datasets up with multiple grouping variables: sex (male vs. female) and experimental group (control vs. experimental.
Any libraries not already downloaded can be loaded using the code “install.packages(…)”, such as “install.packages(’plotrix)”.
library(shroom) #data
library(dplyr) #summarize() and groub_by()
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(plotrix) #function for SE
We can calculate the mean of a column using summarize with the mean() command within it. Note that the format is “summary.name = function(data.column)”. “mean =” generates the label for the summary and “mean(wing.score)” is the summary command.
wingscores_s1F %>%
summarize(mean = mean(wing.score))
#> mean
#> 1 4
summarize() is most powerful when you want multiple summary statistics worked up for the same set of data. We can ask for both the mean and the standard deviation (SD) like this:
wingscores_s1F %>%
summarize(mean = mean(wing.score),
SD = sd(wing.score))
#> mean SD
#> 1 4 0.7801895
A handy function is n() which counts up the sample size. Note that nothing goes in n()
group_by() splits the dataframe up by a categorical (aka factor) variable. Here, I split by the 2 experimental conditions: E = “experimental” flies and C = “control” flies and the calculate the summary stats.
Here are some demonstrations of the functionality of summarize().
I can also use functions from other packages. The plotrix package has a standard error function: std.error(). I’ll call it using plotrix::std.error() so its obvious that its from that package.
I can use this std.error() function within summarize()
I can even do math within summarize. Here, I calculate an approximate 95% confidence interval using std.error() by multiplying it within summarize() on the fly.
wingscores_s1F %>%
group_by(E.or.C) %>%
summarize(mean = mean(wing.score),
n = n(),
sd = sd(wing.score),
se = plotrix::std.error(wing.score),
CI95 = 1.96*plotrix::std.error(wing.score))
#> # A tibble: 2 x 6
#> E.or.C mean n sd se CI95
#> <fct> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 C 4.38 16 0.619 0.155 0.303
#> 2 E 3.25 8 0.463 0.164 0.321
I can save this summary table to an object if I want
s1_females_summary <- wingscores_s1F %>%
group_by(E.or.C) %>%
summarize(mean = mean(wing.score),
n = n(),
sd = sd(wing.score),
se = plotrix::std.error(wing.score),
CI95 = 1.96*plotrix::std.error(wing.score))
And call it up later
s1_females_summary
#> # A tibble: 2 x 6
#> E.or.C mean n sd se CI95
#> <fct> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 C 4.38 16 0.619 0.155 0.303
#> 2 E 3.25 8 0.463 0.164 0.321
I can even save this to a .csv file
write.csv(s1_females_summary, file = "s1_summary.csv")
You can give group_by() multiple categories. The “wingscores_s1” data has both males and females from both experimental conditions. TO make the cod cleaner I’ll make a vector called “columns.” that will store the 3 names of the columns I want to work with.
#names of columns to examine
columns. <- c("E.or.C", "sex","wing.score")
# check that everything works
summary(wingscores_s1[,columns.])
#> E.or.C sex wing.score
#> C:33 F:24 Min. :3.000
#> E:27 M:36 1st Qu.:4.000
#> Median :5.000
#> Mean :4.667
#> 3rd Qu.:5.000
#> Max. :7.000
Calculate the mean for both sexes and for both experimental conditions.