17.2 Data exploration plots

We’ll make some plots to look at the overall structure and distribution of the data. After first looking at a Cleveland dotplot, we’ll focus on boxplots but also consider some others, including violin plots.

17.2.1 Cleveland dot plot

A useful tool for looking at your data is a Clevand dot plot. This is just the variable you are interested in (eg mass) plotted against its rank in the data set or some other organization scheme. This allows you to quickly spot an values that are really weird, such as typos.

ggdotchart(data = my.frogs, 
           x= "i.row", 
           y = "mass", 
           rotate = T)

For more on Cleveland dotplots and data exploraiton see

Zuur et al 2010. A protocol for data exploration to avoid common statistical problems. Methods in E&E. https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/j.2041-210X.2009.00001.x

17.2.2 Boxplots

Basic boxplots are easy to make with ggpubr’s ggboxplot() function. Note that “mass” and “sex” are in quotes.

ggboxplot(data = my.frogs, # the data frame
          y = "mass",      # y-axis: a continous variable; in quotes!
          x = "sex")       # x-axis: a group; in quotes!

17.2.3 Notched boxplot

We’ll use the original frogarms dataframe first for this. These aren’t commonly used; the notches work kind of like confidence intervals to determine if medians are different.

ggboxplot(data = frogarms, #full dataset
          y = "mass",
          x = "sex",
          notch  = TRUE) 

Now try your own subset of the data. The Notch calculations likely get messed up with small samples sizes. R will likely give you several warnings in red.

ggboxplot(data = my.frogs, #my subset
          y = "mass",
          x = "sex",
          notch  = TRUE)

17.2.4 Filled boxplots

Its good practice to accent plot elements that relate to different groups. ggplot and ggpubr make this really eas. We can add colored fill to the box plots using fill = “…”; note that it is “fill” not “color”. (Color changes the color of the lines).

ggboxplot(data = my.frogs,
          y = "mass",      # quotes!
          x = "sex",
          notch  = TRUE,
          fill = "sex")

We can turn off the notching by adding a “#” character before it. This is called commenting out that line of code

ggboxplot(data = my.frogs,
          y = "mass",
          x = "sex",
          #notch  = TRUE,
          fill = "sex")

17.2.5 Boxplots with raw data

Its best to plot your raw data whenever possible. This works best with small datasets. We’ll do this by appending add = “point”.

ggboxplot(data = my.frogs,
          y = "mass",
          x = "sex",
          #notch  = TRUE,
          fill = "sex",
          add = "point")   # add = "point", with quotes!

For more best practices in plotting data see

17.2.6 Boxplots with jittered raw data

This can be helpful, though ggpubr::ggboxplot doesn’t allow much control over the jittering. Jittering is helpful when you have large datasets and want to avoid overlap in the points. We’ll append add = “jitter”.

ggboxplot(data = my.frogs,
          y = "mass",
          x = "sex",
          fill = "sex",
          add = "jitter") # add = "jitter"


17.2.7 OPTIONAL: Jittering with ggplot2 [O]

The following section is opptional

ggpubr helps simplify ggplot2 code, but in doing so adds some constraints. You can combine ggpubr commands with regular ggplot2 code though. We’ll use the code we did above and also add “+ geom_jitter()”

This code should produce a plot similar to the one above. Note that after " fill = “sex”) " there is a “+”, ( eg, " fill = “sex”) + " ) and that on the next line is " geom_jitter()"

ggboxplot(data = my.frogs,
          y = "mass",
          x = "sex",
          fill = "sex") +  # need the plus!
  geom_jitter()            # the jitter command 

We can make the jittering less extreme by adding “width = 0.1” within geom_jitter()

ggboxplot(data = my.frogs,
          y = "mass",
          x = "sex",
          fill = "sex") +  #need the plus!
  geom_jitter(width = 0.1) #reduce the magnitude of the jitter

End optional section


17.2.8 Label ggpubr axes

A graph isn’t done until it has labels. This can get annoying in base R (eg plot()) graphics and ggplot2, but is easy in ggpubr.

17.2.8.1 Adding Axes lables

Adding the arguements “xlab = …” and “ylab = …” adds axes labels. Always add units (eg “g” for grams) when applicable.

ggboxplot(data = my.frogs,
          y = "mass",
          x = "sex",
          fill = "sex",
          xlab = "Sex",      #x axis (horizontal); quotes!
          ylab = "Mass (g)") #y axis (vertical)

17.2.8.2 Plot title

The command “main = …” adds a main title at the top of the graph. This is not usually done for publications but useful for keeping track of things and for presentations.

ggboxplot(data = my.frogs,
          y = "mass",
          x = "sex",
          fill = "sex",
          add = "jitter",
          xlab = "Sex",
          ylab = "Mass (g)",
          main = "Mass of Australian frogs by sex") #Main title

17.2.8.3 Refining ggpubr plots

Move the legend to the bottom using legend = “bottom”.

ggboxplot(data = my.frogs,
          y = "mass",
          x = "sex",
          fill = "sex",
          xlab = "Sex",
          ylab = "Mass (g)",
          main = "Mass of frogs by sex", # main title
          legend = "bottom")             # location of legend

Change the color palette; Note that the colors are within c(…).

ggboxplot(data = my.frogs,
          y = "mass",
          x = "sex",
          fill = "sex",
          xlab = "Sex",
          ylab = "Mass (g)",
          main = "Mass of frogs by sex",
          legend = "bottom",
          palette = c("green","blue"))  # change pallete

17.2.9 Plotting multple plots with cowplot::plot_grid

You can save just about anything as an R object, and we can save a plot to an R object. I will use the assignment operation (<-) to assign the output of ggboxplot() to an object called “gg.my.frogs”. Note that here I am using my.frogs.

gg.my.frogs <- ggboxplot(data = my.frogs,
          y = "mass",
          x = "sex")

Note that the code runs but nothing happens…

We can see what we just made using is()

is(gg.my.frogs)

This indicates 1) gg.my.frogs is there and 2) it is a “gg” type R object.

I can get the graph if I call just the object gg.my.frogs (eg, just type “gg.my.frogs” into the console and press enter, or highlight just the word “gg.my.frogs” in a script and execute the command)

gg.my.frogs

Now, Make an object using the full frogarms data

gg.frogarms <- ggboxplot(data = frogarms, #use original data
          y = "mass",
          x = "sex")

Now plot both using the plot_grid() function from the handy cowplot package.

plot_grid(gg.my.frogs,
          gg.frogarms)

Add labels. Note that alignment is off sometimes.

plot_grid(gg.my.frogs, 
          gg.frogarms,
          labels = c("a)My fogs","b)All the frogs"))

17.2.10 Violin plots

An interesting alternative to boxplot that works well with large data is the violin plot

ggviolin(data = frogarms, #use original data
          y = "mass",
          x = "sex")


17.2.11 Optional: Histograms [0]

The following is optional

Histograms are excellent for data exploration. They generally work best with medium to large datasets.

A basic histogram can be made using gghistogram(). Note that there is “x = …” but no “y = …”; the y-axis is computed by the graphing function.

gghistogram(data = my.frogs, 
            x = "mass")

A key concept for ggplot is faceting. Faceting occurring when a two panels of plots are made from a single dataset, and the panels are split by a categorical variable. We can add the arguement “facet.by.by = sex” to make 2 panels, one for female and one for male. Note that because there are only 10 frogs in each group, the graphs aren’t very useful.

gghistogram(data = my.frogs, 
            x = "mass",
            facet.by = "sex")

Just as we did for histograms we can change the fill, add a title, etc.