16.6 dplyr’s group_by() function [_]
For some more info on group_by() see
- https://www.r-bloggers.com/using-r-quickly-calculating-summary-statistics-with-dplyr/
- https://www3.nd.edu/~steve/computing_with_data/24_dplyr/dplyr.html http://www.datacarpentry.org/R-genomics/04-dplyr.html
We can use group_by() to split things up by a categorical variable (sex, color, year). Here, we can say “take my.frogs, split up the data by the sex column, and apply the mean() function to each subset.”
my.frogs %>% #[_] the data
group_by(sex) %>% #the group_by() function applied to the sex column
summarise(mean(mass)) #the mean() function, applied to mass.
This might be a bit abstract when you first do it. Again, where starting with our whole dataframe (my.frogs), then its piped with %>% over to group_by() function, which splits it essentially into a male and a female sub-dataset. Then these two subsets are piped again to mean(). Then the mean() function is applied to the mass column in each of these subsets.
Note that the column heading in the output mean(mass)
, which is what is in summarise().
Also note that the output is a “tibble”, which is a common feature of the tidyverse. One thing that tibbles do is some reasonable rounding for you automatically.
A handy thing about summarise() is you can pass it labels. The following code adds a sensible label by changing “summarise(mean(mass))” to “summarise(mass.mean = mean(mass))”, where “mass.mean = …” defines the label.
my.frogs %>% #[_]
group_by(sex) %>%
summarise(mass.mean = mean(mass))
You can label things anything, eg “puppies”.
my.frogs %>% #[_]
group_by(sex) %>%
summarise(puppies = mean(mass))
When I first started using dplyr I found this syntax confusing because people often use the name of the function for the name of the column heading, like this
my.frogs %>% #[_]
group_by(sex) %>%
summarise(mean = mean(mass))
For some reason this trips me up because the word “mean” as a label here is not quoted; to me the word mean should be reserved for the function mean(). Also, in general you should always make both input and output output self-lableing. If you just look at the output, its not obvious what the “mean” being shown is.
You can pass any summary function to summarise(). We can give it sd() to get the sd of mass by sex. Note that I define the column names using “mass.sd = …”
my.frogs %>% #[_]
group_by(sex) %>%
summarise(mass.sd = sd(mass))
What makes dplyr::group_by and summarize() really powerful is that you can pass it multiple. summary functions at the same time. Here, I’ll pass mean() and sd(), naming both.
my.frogs %>% #[_]
group_by(sex) %>%
summarise(mass.mean = mean(mass), #give me the mean!
mass.sd = sd(mass)) #give me the sd!
dplyr also has a handy function n() for getting your sample size.
my.frogs %>% #[_]
group_by(sex) %>%
summarise(mass.mean = mean(mass),
mass.sd = sd(mass),
n = n())
16.6.1 OPTIONAL: Using novel functions with dplyr [O]
This section is optional
If you have defined the my_sd1() function above you can pass it to summarise() too.
my.frogs %>%
group_by(sex) %>%
summarise(mass.mean = my_sd1(mass))
End optional section