vignettes/data_cleaning_dplyr_filter.Rmd
data_cleaning_dplyr_filter.Rmd
This vignette walks through the process of isolating a subset of a dataframe using the dplyr command filter(). In particular we’ll use filter() to isolate rows of data that correspond to partiular species of birds from a very large dataset. To do this, we’ll have to learn a bit about row the row indices that identify which row data occur in within a dataframe.
The codes used by the American Ornithological Union (AOU) and BBS data for Pennsylvania are in the wildlifeR package. We’ll use the dplyr package to “reshape” into the proper format for analysis.
library(wildlifeR)
library(dplyr)
library(ggplot2)
library(ggpubr)
We’ll use 2 datasets:
You can learn more about these data sets using the help command ?BBS_PA and ?AOU_species_codes
Load both sets of data
## AOU species codes
data(AOU_species_codes)
## BBS data for PA
data(BBS_PA)
Look at the BBS_PA data.
head(BBS_PA)
## countrynum statenum Route RPID Year Aou StopTotal SpeciesTotal
## 1 840 72 1 101 1970 1940 2 2
## 2 840 72 1 101 1970 2730 1 1
## 3 840 72 1 101 1970 3091 2 2
## 4 840 72 1 101 1970 3131 1 1
## 5 840 72 1 101 1970 3160 21 30
## 6 840 72 1 101 1970 3250 1 2
This contains information on every bird species on every BBS route in Pennsylvania since the 1960s.
Using the dim() command we can see that this is a VERY big file
dim(BBS_PA)
## [1] 244185 8
Almost 1/4 million rows! This would be a very very big spreadsheet. When data files get this big we often do things to save space both in terms of file size and also on the screen when viewing it. Note that even though this datafile is about birds, there are no bird names (nor names of countries or states). Everything is coded using numbers.
To identify birds in the BBS dataset the USGS uses codes designated by the Aou (American Ornithological Union). These are typically 4 number codes unique to each species.
The dataset we loaded above with the command data(AOU_species_codes) loaded dataset with all of the Aou numeric codes and the corresponding common and latin nmaes of the speices.
Using dim() we can see that there are >1000 different species identified in this dataset
dim(AOU_species_codes)
## [1] 1175 5
Look at the data themselves
head(AOU_species_codes)
## spp.num alpha.code name spp order
## 1 10 WEGR Western Grebe Aechmophorus occidentalis 96
## 2 11 CLGR Clark's Grebe Aechmophorus clarkii 98
## 3 12 WCGR Western/Clark's Grebe hybrid 97
## 4 20 RNGR Red-necked Grebe Podiceps grisegena 94
## 5 30 HOGR Horned Grebe Podiceps auritus 93
## 6 40 EAGR Eared Grebe Podiceps nigricollis 95
This dataframe tells us the following thigns
We want to use the alphabetic code and/or the common name to indentify the numeric species code. We can then use this numeric code to isoalted data from the BBS dataset.
If we know the 4-letter alphabetic code we can use the which() command to locate the numeric code.
For example, the code for the European starling is “EUST”. The follwoing code determine which ROW NUMBER the European startling occur in within the AOU_species_codes dataframe (and saves it to an object “EUST.row.number”)
EUST.row.number <- which(AOU_species_codes$alpha.code == "EUST")
If you google “AOU species codes” you can usually locate a PDF or website with the codes.
What we have retrived with which() is a number that is the ROW that has “EUST” in it is
EUST.row.number
## [1] 643
The number happens to be ‘r EUST.row.number’
We can look at what is in row ‘r EUST.row.number’ like this, but giving R the dataframe and using square brackets [ ]. Note the comma after “EUST.row.number”.
AOU_species_codes[EUST.row.number, ]
## spp.num alpha.code name spp order
## 717 4930 EUST European Starling Sturnus vulgaris 864
Or like this. Again , note the comma after “643”.
AOU_species_codes[643, ]
## spp.num alpha.code name spp order
## 717 4930 EUST European Starling Sturnus vulgaris 864
This tells us that the in row 643 there is data on the European starling, Sturnus vulgaris. In particular, it has a species number of 4930. This number is what we need to extract European starling data from the BBS_PA dataframe.
EUST.numeric.code <- 4940
which() is the classic way of doing this in R. You can do the exact same thing with the dplyr function filter(). Note the use of the “pipe” “%>%” which connects the dataset AOU_species_codes with the filter command.
AOU_species_codes %>% filter(alpha.code == "EUST")
## spp.num alpha.code name spp order
## 1 4930 EUST European Starling Sturnus vulgaris 864
More on dplyr and filter() below.
The above approach requires that you know the alphabetic species code, which we fed to the which(). R has a flexible search tool called grep() that allows you to search for text, parts of words, and parts of sentences. grep() allows us to search using words like this within the “$name” column.
We can get the same result as above like this
grep("European starling",
AOU_species_codes$name,
ignore.case = TRUE)
## [1] 643
If we use just the word starling, though, we can more than one number. Each of these is a row number where the word “starling” occurs.
grep("starling",
AOU_species_codes$name,
ignore.case = TRUE)
## [1] 643 1129 1130 1138
We can save thse numbers to an R object then see what they pull up in the AOU names dataframe
Store the row numbers
i.starling <- grep("starling",
AOU_species_codes$name,
ignore.case = TRUE)
See what they correspond to
AOU_species_codes[i.starling, ]
## spp.num alpha.code name spp order
## 717 4930 EUST European Starling Sturnus vulgaris 864
## 1235 8640 SAST Samoan Starling Aplonis atrifusca 863
## 1236 8650 POST Polynesian Starling Aplonis tabuensis 862
## 1244 8720 MIST Micronesian Starling Aplonis opaca 861
There are 4 species of starling in the AOU_species_codes dataframe.
We could also use “European” with grep. There are 4 birds with this word in thei common name
#store the row numbers
i.European <- grep("European",
AOU_species_codes$name,
ignore.case = TRUE)
#see what they match
AOU_species_codes[i.European, ]
## spp.num alpha.code name spp
## 195 1380 EGWT European Green-winged Teal Anas crecca crecca
## 373 2710 EUGP European Golden-Plover Pluvialis apricaria
## 717 4930 EUST European Starling Sturnus vulgaris
## 781 5261 EUGO European Goldfinch Carduelis carduelis
## order
## 195 57
## 373 289
## 717 864
## 781 1154
grep() is very flexible and can take parts of works. Let’s try just “Euor”
#store the row numbers
i.Euro <- grep("Euro",
AOU_species_codes$name,
ignore.case = TRUE)
#see what they match
AOU_species_codes[i.Euro, ]
## spp.num alpha.code name spp
## 195 1380 EGWT European Green-winged Teal Anas crecca crecca
## 373 2710 EUGP European Golden-Plover Pluvialis apricaria
## 717 4930 EUST European Starling Sturnus vulgaris
## 781 5261 EUGO European Goldfinch Carduelis carduelis
## order
## 195 57
## 373 289
## 717 864
## 781 1154
It would work also wiht “pean”, “star”, or any other set of letters.
Above we used the which() and dplyr::filter() functiosn to figure out the Aou code for our focal bird. Now we’ll use that infromation to create a more manageable subset of BBS data.
First, load dplyr if you haven’t. THen use the big BBS_PA dataset in conjuction with a pipe and filter() to subset just the European starling data. Note that use the “<-” operation to send all the output to a new dataframe
#load dplyr
#library(dplyr)
#filter just the Euro starling
BBS_PA_EUST <- BBS_PA %>% filter(Aou == 4940)
Here I gave filter the Aou code 4940. I could also have given it the ojbect I made above that stored the number
#object with the code
EUST.numeric.code
## [1] 4940
Use the object instead of the number 4940
BBS_PA_EUST <- BBS_PA %>% filter(Aou == EUST.numeric.code)
The BBS_PA_EUST contains just rows from BBS_PA that contain the numeric code 4940 in the Aou column.
Let’s look at what we’ve done. Call dim() on the original BBS_PA data and the subset BBS_PA_EUST and see how the See how the dataframe has change.
dim(BBS_PA)
## [1] 244185 8
dim(BBS_PA_EUST)
## [1] 1615 8
We’ve gone froma 1/4 million row dataframe BBS_PA to a 1600 row dataframe BBS_PA_EUST.
summary(BBS_PA_EUST)
## countrynum statenum Route RPID Year
## Min. :840 Min. :72 Min. : 1.00 Min. :101 Min. :1966
## 1st Qu.:840 1st Qu.:72 1st Qu.: 16.00 1st Qu.:101 1st Qu.:1984
## Median :840 Median :72 Median : 35.00 Median :101 Median :1997
## Mean :840 Mean :72 Mean : 82.77 Mean :101 Mean :1995
## 3rd Qu.:840 3rd Qu.:72 3rd Qu.: 62.00 3rd Qu.:101 3rd Qu.:2006
## Max. :840 Max. :72 Max. :907.00 Max. :101 Max. :2016
## Aou StopTotal SpeciesTotal
## Min. :4940 Min. : 1.00 Min. : 1.00
## 1st Qu.:4940 1st Qu.: 1.00 1st Qu.: 2.00
## Median :4940 Median : 3.00 Median : 6.00
## Mean :4940 Mean : 4.21 Mean :10.75
## 3rd Qu.:4940 3rd Qu.: 6.00 3rd Qu.:14.00
## Max. :4940 Max. :20.00 Max. :78.00
This vignette shows how to take a really big dataframe and isolate just the subset of data that is required. In the wildlifeR two analayzes with these data are outlined: analysis of change over time, and analysis of abundance versus landcover.
A good introduction to using dplyr is:
Beckerman et al’s book “Getting start with R: An introduction for biologists” 2nd ed.
“Selecting columns and renaming are so easy with dplyr” https://blog.exploratory.io/selecting-columns-809bdd1ef615
A good introduction to using dplyr is:
Beckerman et al’s book “Getting start with R: An introduction for biologists” 2nd ed.
“Selecting columns and renaming are so easy with dplyr” https://blog.exploratory.io/selecting-columns-809bdd1ef615