Data Cleaning: Filtering focal rows with dplyr

Introduction

This vignette walks through the process of isolating a subset of a dataframe using the dplyr command filter(). In particular we’ll use filter() to isolate rows of data that correspond to partiular species of birds from a very large dataset. To do this, we’ll have to learn a bit about row the row indices that identify which row data occur in within a dataframe.

Learning objectives

Determine row indices using which()
Practice using the package dplyr
- use dplyr filter

Packages

dplyr
wildlifeR

R code

Load Libraries

The codes used by the American Ornithological Union (AOU) and BBS data for Pennsylvania are in the wildlifeR package. We’ll use the dplyr package to “reshape” into the proper format for analysis.

library(wildlifeR)
library(dplyr)
library(ggplot2)
library(ggpubr)

Data

We’ll use 2 datasets:

AOU codes for each species of bird in North American
Bird counts from the BBS data for PA, and

You can learn more about these data sets using the help command ?BBS_PA and ?AOU_species_codes

Load both sets of data

## AOU species codes
data(AOU_species_codes)

## BBS data for PA
data(BBS_PA)

BBS bird count data

Look at the BBS_PA data.

head(BBS_PA)

##   countrynum statenum Route RPID Year  Aou StopTotal SpeciesTotal
## 1        840       72     1  101 1970 1940         2            2
## 2        840       72     1  101 1970 2730         1            1
## 3        840       72     1  101 1970 3091         2            2
## 4        840       72     1  101 1970 3131         1            1
## 5        840       72     1  101 1970 3160        21           30
## 6        840       72     1  101 1970 3250         1            2

This contains information on every bird species on every BBS route in Pennsylvania since the 1960s.

Using the dim() command we can see that this is a VERY big file

dim(BBS_PA)

## [1] 244185      8

Almost 1/4 million rows! This would be a very very big spreadsheet. When data files get this big we often do things to save space both in terms of file size and also on the screen when viewing it. Note that even though this datafile is about birds, there are no bird names (nor names of countries or states). Everything is coded using numbers.

To identify birds in the BBS dataset the USGS uses codes designated by the Aou (American Ornithological Union). These are typically 4 number codes unique to each species.

AOU Species code dataframe

The dataset we loaded above with the command data(AOU_species_codes) loaded dataset with all of the Aou numeric codes and the corresponding common and latin nmaes of the speices.

Using dim() we can see that there are >1000 different species identified in this dataset

dim(AOU_species_codes)

## [1] 1175    5

Look at the data themselves

head(AOU_species_codes)

##   spp.num alpha.code                  name                       spp order
## 1      10       WEGR         Western Grebe Aechmophorus occidentalis    96
## 2      11       CLGR         Clark's Grebe      Aechmophorus clarkii    98
## 3      12       WCGR Western/Clark's Grebe                    hybrid    97
## 4      20       RNGR      Red-necked Grebe        Podiceps grisegena    94
## 5      30       HOGR          Horned Grebe          Podiceps auritus    93
## 6      40       EAGR           Eared Grebe      Podiceps nigricollis    95

This dataframe tells us the following thigns

spp.num: a numeric code for each species
alapha.code: a 4-letter alphabetic code for each species
name: the common name for each species
spp: scientific name
order: taxonomic order

We want to use the alphabetic code and/or the common name to indentify the numeric species code. We can then use this numeric code to isoalted data from the BBS dataset.

Determine the code for the European starling

If we know the 4-letter alphabetic code we can use the which() command to locate the numeric code.

For example, the code for the European starling is “EUST”. The follwoing code determine which ROW NUMBER the European startling occur in within the AOU_species_codes dataframe (and saves it to an object “EUST.row.number”)

EUST.row.number <- which(AOU_species_codes$alpha.code == "EUST")

If you google “AOU species codes” you can usually locate a PDF or website with the codes.

What we have retrived with which() is a number that is the ROW that has “EUST” in it is

EUST.row.number

## [1] 643

The number happens to be ‘r EUST.row.number’

We can look at what is in row ‘r EUST.row.number’ like this, but giving R the dataframe and using square brackets [ ]. Note the comma after “EUST.row.number”.

AOU_species_codes[EUST.row.number, ]

##     spp.num alpha.code              name              spp order
## 717    4930       EUST European Starling Sturnus vulgaris   864

Or like this. Again , note the comma after “643”.

AOU_species_codes[643, ]

##     spp.num alpha.code              name              spp order
## 717    4930       EUST European Starling Sturnus vulgaris   864

This tells us that the in row 643 there is data on the European starling, Sturnus vulgaris. In particular, it has a species number of 4930. This number is what we need to extract European starling data from the BBS_PA dataframe.

EUST.numeric.code <- 4940

dplyr and filter()

which() is the classic way of doing this in R. You can do the exact same thing with the dplyr function filter(). Note the use of the “pipe” “%>%” which connects the dataset AOU_species_codes with the filter command.

AOU_species_codes %>% filter(alpha.code == "EUST")

##   spp.num alpha.code              name              spp order
## 1    4930       EUST European Starling Sturnus vulgaris   864

OPTIONAL/ADVANCED: Advanced searching using grep

The above approach requires that you know the alphabetic species code, which we fed to the which(). R has a flexible search tool called grep() that allows you to search for text, parts of words, and parts of sentences. grep() allows us to search using words like this within the “$name” column.

We can get the same result as above like this

grep("European starling", 
     AOU_species_codes$name,
     ignore.case = TRUE)

## [1] 643

If we use just the word starling, though, we can more than one number. Each of these is a row number where the word “starling” occurs.

grep("starling", 
     AOU_species_codes$name,
     ignore.case = TRUE)

## [1]  643 1129 1130 1138

We can save thse numbers to an R object then see what they pull up in the AOU names dataframe

Store the row numbers

i.starling <- grep("starling", 
     AOU_species_codes$name,
     ignore.case = TRUE)

See what they correspond to

AOU_species_codes[i.starling, ]

##      spp.num alpha.code                 name               spp order
## 717     4930       EUST    European Starling  Sturnus vulgaris   864
## 1235    8640       SAST      Samoan Starling Aplonis atrifusca   863
## 1236    8650       POST  Polynesian Starling Aplonis tabuensis   862
## 1244    8720       MIST Micronesian Starling     Aplonis opaca   861

There are 4 species of starling in the AOU_species_codes dataframe.

We could also use “European” with grep. There are 4 birds with this word in thei common name

#store the row numbers
i.European <- grep("European", 
     AOU_species_codes$name,
     ignore.case = TRUE)

#see what they match
AOU_species_codes[i.European, ]

##     spp.num alpha.code                       name                 spp
## 195    1380       EGWT European Green-winged Teal  Anas crecca crecca
## 373    2710       EUGP     European Golden-Plover Pluvialis apricaria
## 717    4930       EUST          European Starling    Sturnus vulgaris
## 781    5261       EUGO         European Goldfinch Carduelis carduelis
##     order
## 195    57
## 373   289
## 717   864
## 781  1154

grep() is very flexible and can take parts of works. Let’s try just “Euor”

#store the row numbers
i.Euro <- grep("Euro", 
     AOU_species_codes$name,
     ignore.case = TRUE)

#see what they match
AOU_species_codes[i.Euro, ]

##     spp.num alpha.code                       name                 spp
## 195    1380       EGWT European Green-winged Teal  Anas crecca crecca
## 373    2710       EUGP     European Golden-Plover Pluvialis apricaria
## 717    4930       EUST          European Starling    Sturnus vulgaris
## 781    5261       EUGO         European Goldfinch Carduelis carduelis
##     order
## 195    57
## 373   289
## 717   864
## 781  1154

It would work also wiht “pean”, “star”, or any other set of letters.

Isolate rows from a large dataframe using dplyr’s filter function

Above we used the which() and dplyr::filter() functiosn to figure out the Aou code for our focal bird. Now we’ll use that infromation to create a more manageable subset of BBS data.

First, load dplyr if you haven’t. THen use the big BBS_PA dataset in conjuction with a pipe and filter() to subset just the European starling data. Note that use the “<-” operation to send all the output to a new dataframe

#load dplyr
 #library(dplyr)

#filter just the Euro starling
BBS_PA_EUST <- BBS_PA %>% filter(Aou == 4940)

Here I gave filter the Aou code 4940. I could also have given it the ojbect I made above that stored the number

#object with the code
EUST.numeric.code

## [1] 4940

Use the object instead of the number 4940

BBS_PA_EUST <- BBS_PA %>% filter(Aou == EUST.numeric.code)

The BBS_PA_EUST contains just rows from BBS_PA that contain the numeric code 4940 in the Aou column.

Let’s look at what we’ve done. Call dim() on the original BBS_PA data and the subset BBS_PA_EUST and see how the See how the dataframe has change.

dim(BBS_PA)

## [1] 244185      8

dim(BBS_PA_EUST)

## [1] 1615    8

We’ve gone froma 1/4 million row dataframe BBS_PA to a 1600 row dataframe BBS_PA_EUST.

summary(BBS_PA_EUST)

##    countrynum     statenum      Route             RPID          Year     
##  Min.   :840   Min.   :72   Min.   :  1.00   Min.   :101   Min.   :1966  
##  1st Qu.:840   1st Qu.:72   1st Qu.: 16.00   1st Qu.:101   1st Qu.:1984  
##  Median :840   Median :72   Median : 35.00   Median :101   Median :1997  
##  Mean   :840   Mean   :72   Mean   : 82.77   Mean   :101   Mean   :1995  
##  3rd Qu.:840   3rd Qu.:72   3rd Qu.: 62.00   3rd Qu.:101   3rd Qu.:2006  
##  Max.   :840   Max.   :72   Max.   :907.00   Max.   :101   Max.   :2016  
##       Aou         StopTotal      SpeciesTotal  
##  Min.   :4940   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:4940   1st Qu.: 1.00   1st Qu.: 2.00  
##  Median :4940   Median : 3.00   Median : 6.00  
##  Mean   :4940   Mean   : 4.21   Mean   :10.75  
##  3rd Qu.:4940   3rd Qu.: 6.00   3rd Qu.:14.00  
##  Max.   :4940   Max.   :20.00   Max.   :78.00

What next?

This vignette shows how to take a really big dataframe and isolate just the subset of data that is required. In the wildlifeR two analayzes with these data are outlined: analysis of change over time, and analysis of abundance versus landcover.

If you want to examine change in a spcies over time, see the vignette … which uses the wildlifeR function sample_BBS_routes() to create a simple dataframe for looking at change over time.
If you want to examine change in abundance versus a landcover variable see “Data Cleaning: Merging 2 large data sets with dplyr”, which is an extended vignette which reviews what we’ve done here and extends it to access and merge-in landcover data.

References

Books

A good introduction to using dplyr is:

Beckerman et al’s book “Getting start with R: An introduction for biologists” 2nd ed.

Websites

“Selecting columns and renaming are so easy with dplyr” https://blog.exploratory.io/selecting-columns-809bdd1ef615

References

Books

A good introduction to using dplyr is:

Beckerman et al’s book “Getting start with R: An introduction for biologists” 2nd ed.

Websites

“Selecting columns and renaming are so easy with dplyr” https://blog.exploratory.io/selecting-columns-809bdd1ef615

Reducing the size of a dataframe to isolate focal rows

Nathan Brouwer brouwern@gmail.com @lobrowR

2018-06-20

Introduction

Learning objectives

Packages

R code

Load Libraries

Data

BBS bird count data

AOU Species code dataframe

Determine the code for the European starling

dplyr and filter()

OPTIONAL/ADVANCED: Advanced searching using grep

Isolate rows from a large dataframe using dplyr’s filter function

What next?

References

Books

Websites

References

Books

Websites

Contents