R for Ecological Data Science: A Gentle Introduction

18.2 Background: measures of variation [_]

18.2.1 Variance

The variance is a ubiqitous metric which quantifies the amount of variability in a given set of data. The equation in all its glory is:

\[s^2 = {\frac{\sum\limits_{i=1}^{n} \left(Y_{i} - \bar{Y}\right)^{2}} {n-1}}\]

This can be disorienting if you aren’t used to looking at this notation. We’ll go over it for the sake of being thorough; we mostly need a general conceptual understanding of the variance right now.

\(Y_{i}\) represents each data point, where “i” stands for “index”. \(Y_{1}\) is the 1st value in the dataset, \(Y_{2}\) is the second values in the data set etc. The Y with the bar over it, \(\bar{Y}\) , is called “Y bar” and can be typed out as “Y.bar” in R. n is the sample size. \(\sum\) is the summation operator (the Greek letter sigma). \(\left(Y_{i} - \bar{Y}\right)\) is the difference between each value in the data (the Y.i) minus the mean value (the Y.bar).

\((Y_{i} - \bar{Y})^{2}\) indicates that each difference between a value and the mean should be squared. These are called the squared differences or squared deviations. \(\sum(Y_{i} - \bar{Y})^{2}\) indicates that all of these values should be summed together; this is called “the sum of squares”. Finally, the sum of is divided by n-1.

So, the upshot is: we take each value in the dataset, subtract the mean from each value, square each of those differences, add them up, and divide them by the sample size minus 1.

The variance shows up everywhere in stats, but mostly behind the scenes in calculations; less frequently is it reported, though it plays an important role in, among other things, genetics and stochastic demography.

The bigger the variance, the more variability. However, because of the squaring that occurs in the numerator the variance is on scale all its own. The variance is therefore never plotted alongside the datae; this would be meaningless.

The variance of the frogarm mass data is

var(frogarms$mass)

18.2.2 Standard deviation (SD)

The variance is a bit hard to wrap you mind around; easier and much more frequently reported in practice is the standard deviation. The standard deviation is just the square root of the variance.

\[\sigma = \sqrt{\frac{\sum\limits_{i=1}^{n} \left(Y_{i} - \bar{Y}\right)^{2}} {n-1}}\]

Taking the square root of the variance puts it back into the same scale as the raw data. The standard deviation can therefore be plotted alongside the raw data or the mean.

THe standard deviation for the frogarm mass data is

or equivalently

sd(frogarms$mass)

Since the standard deviation is the square root of the variance, it is always smaller than the variance (Eg s < s^2)

The standard error (and the variance) are both measures of variation. They indicate the amount of variation that occurs in the raw data.

18.2.3 Optional: Testing the numerical equivalence of 2 values

The following is optional

We can confirm that 2 values are numerically equivalent using the == operator:

sqrt(var(frogarms$mass)) == sd(frogarms$mass)

End optional section

18.2.4 Standard error (SE)

The standard error is often neglected in elementary stats work but it is a central concept. The standard error is the standard deviation divided by the square root of the same size.

\[SE = \frac{SD}{\sqrt{N}}\]

A very important thing about the standard error: it is not a measure of variation. It does not indicate how much variation there is in the raw data. The SE is an indication of precision. Specifically, it indicates how confident we are about our estimate of mean from the data. The SE does depend on the amount of variation in the data, but also depends on the sample size (n).

Base R does not have a function for the SE. We can calculate it like this

sd(frogarms$mass)/sqrt(length(frogarms$mass))

This is a bit dense, so we can break it up. First, assign the numerator and denominator to seperate R objects using the assignment operator <-.

frog.sd <- sd(frogarms$mass)
frog.n  <- sqrt(length(frogarms$mass))

Then do the division

The standard error is frequently used as error bars for plots of means.

18.2.5 Confidence Interval (CI)

Confidence intervals are a deep topic in stats. We’ll just touch on them briefly. If you have calcualted the standard error (SE) you can approximate the 95% CI for a mean as

\[CI = 1.96*SE\]

That is, 1.96 (sometimes rounded just to 2.0) multipled by the SE. The 95% CI is argueably a better choice for an error bar around a mean than the SE (better yet plot both, which down the road we’ll show how to do).