Chapter 21 Downloading DNA sequences as FASTA files in R

This is a modification of “DNA Sequence Statistics” from Avril Coghlan’s A little book of R for bioinformatics.. Most of the text and code was originally written by Dr. Coghlan and distributed under the Creative Commons 3.0 license.

NOTE: There is some redundancy in this current draft that needs to be eliminated.

21.0.1 Functions

  • library()
  • help()
  • cat()
  • is()
  • class()
  • dim()
  • length()
  • nchar()
  • strtrim()
  • is.vector()
  • table()
  • write()
  • getwd()
  • seqinr::write.fasta()

21.0.2 Software/websites

  • www.ncbi.nlm.nih.gov
  • Text editors (e.g. Notepad++, TextWrangler)

21.0.3 R vocabulary

  • list
  • library
  • package
  • CRAN
  • wrapper
  • underscore _
  • Camel Case

21.0.4 File types

  • FASTA

21.0.5 Bioinformatics vocabulary

  • accession, accession number
  • NCBI
  • NCBI Sequence Database
  • EMBL Sequence Database
  • FASTA file
  • RefSeq

21.1 Learning objectives

By the end of this tutorial you will be able to

  • Download sequences in FASTA format
  • Understand there format and structure
  • Do basic of FASTA data is stored in a vector using is(), class(), length() and other functions
  • Determine the GC content using GC() and obtain other summary data with count()
  • Save FASTA files to your hard drive

21.1.1 Organisms and Sequence accessions

  • Dengue virus: DEN-1, DEN-2, DEN-3, and DEN-4.

The NCBI RefeSeq accessions for the DNA sequences of the DEN-1, DEN-2, DEN-3, and DEN-4 Dengue viruses are NC_001477, NC_001474, NC_001475 and NC_002640, respectively.

According to Wikipedia

“Dengue virus (DENV) is the cause of dengue fever. It is a mosquito-borne, single positive-stranded RNA virus … Five serotypes of the virus have been found, all of which can cause the full spectrum of disease. Nevertheless, scientists’ understanding of dengue virus may be simplistic, as rather than distinct … groups, a continuum appears to exist.” https://en.wikipedia.org/wiki/Dengue_virus

21.1.2 Preliminaries

Note that seqinr the package name is all lower case, though the authors of the package like to spell it “SeqinR”.

library(rentrez)
library(seqinr)
library(compbio4all) 

21.2 DNA Sequence Statistics: Part 1

21.2.1 Using R for Bioinformatics

The chapter will guide you through the process of using R to carry out simple analyses that are common in bioinformatics and computational biology. In particular, the focus is on computational analysis of biological sequence data such as genome sequences and protein sequences. The programming approaches, however, are broadly generalizable to statistics and data science.

The tutorials assume that the reader has some basic knowledge of biology, but not necessarily of bioinformatics. The focus is to explain simple bioinformatics analysis, and to explain how to carry out these analyses using R.

21.2.2 R packages for bioinformatics: Bioconductor and SeqinR

Many authors have written R packages for performing a wide variety of analyses. These do not come with the standard R installation, but must be installed and loaded as “add-ons”.

Bioinformaticians have written numerous specialized packages for R. In this tutorial, you will learn to use some of the function in the SeqinR package to to carry out simple analyses of DNA sequences. (SeqinR can retrieve sequences from a DNA sequence database, but this has largely been replaced by the functions in the package rentrez)

Many well-known bioinformatics packages for R are in the Bioconductor set of R packages (www.bioconductor.org), which contains packages with many R functions for analyzing biological data sets such as microarray data. The SeqinR package is from CRAN, which contains R functions for obtaining sequences from DNA and protein sequence databases, and for analyzing DNA and protein sequences.

For instructions/review on how to install an R package on your own see How to install an R package )

We will also use functions or data from the rentrez and compbio4all packages.

Remember that you can ask for more information about a particular R command by using the help() function or ? function. For example, to ask for more information about the library(), you can type:

help("library")

You can also do this

?library

21.2.3 FASTA file format

The FASTA format is a simple and widely used format for storing biological (e.g. DNA or protein) sequences. It was first used by the FASTA program for sequence alignment in the 1980s and has been adopted as standard by many other programs.

FASTA files begin with a single-line description starting with a greater-than sign > character, followed on the next line by the sequences. Here is an example of a FASTA file. (If you’re looking at the source script for this lesson you’ll see the cat() command, which is just a text display function used format the text when you run the code).

## >A06852 183 residues MPRLFSYLLGVWLLLSQLPREIPGQSTNDFIKACGRELVRLWVEICGSVSWGRTALSLEEPQLETGPPAETMPSSITKDAEILKMMLEFVPNLPQELKATLSERQPSLRELQQSASKDSNLNFEEFKKIILNRQNEAEDKSLLELKNLGLDKHSRKKRLFRMTLSEKCCQVGCIRKDIARLC

21.2.4 The NCBI sequence database

The US National Centre for Biotechnology Information (NCBI) maintains the NCBI Sequence Database, a huge database of all the DNA and protein sequence data that has been collected. There are also similar databases in Europe, the European Molecular Biology Laboratory (EMBL) Sequence Database, and Japan, the DNA Data Bank of Japan (DDBJ). These three databases exchange data every night, so at any one point in time, they contain almost identical data.

Each sequence in the NCBI Sequence Database is stored in a separate record, and is assigned a unique identifier that can be used to refer to that record. The identifier is known as an accession, and consists of a mixture of numbers and letters.

For example, Dengue virus causes Dengue fever, which is classified as a neglected tropical disease by the World Health Organization (WHO), is classified by any one of four types of Dengue virus: DEN-1, DEN-2, DEN-3, and DEN-4. The NCBI accessions for the DNA sequences of the DEN-1, DEN-2, DEN-3, and DEN-4 Dengue viruses are

  • NC_001477
  • NC_001474
  • NC_001475
  • NC_002640

Note that because the NCBI Sequence Database, the EMBL Sequence Database, and DDBJ exchange data every night, the DEN-1 (and DEN-2, DEN-3, DEN-4) Dengue virus sequence are present in all three databases, but they have different accessions in each database, as they each use their own numbering systems for referring to their own sequence records.

21.2.5 Retrieving genome sequence data using rentrez

You can retrieve sequence data from NCBI directly from R using the rentrez package. The DEN-1 Dengue virus genome sequence has NCBI RefSeq accession NC_001477. To retrieve a sequence with a particular NCBI accession, you can use the function entrez_fetch() from the rentrez package. Note that to be specific where the function comes from I write it as package::function().

dengueseq_fasta <- rentrez::entrez_fetch(db = "nucleotide", 
                          id = "NC_001477", 
                          rettype = "fasta")

Note that the “” in the name is just an arbitrary way to separate two words. Another common format would be dengueseq.fasta. Some people like dengueseqFasta, called camel case because the capital letter makes a hump in the middle of the word. Underscores are becoming most common and are favored by developers associated with RStudio and the tidyverse of packages that many data scientists use. I switch between ”.” and ”” as separators, usually favoring “_” for function names and “.” for objects; I personally find camel case harder to read and to type.

Ok, so what exactly have we done when we made dengueseq_fasta? We have an R object dengueseq_fasta which has the sequence linked to the accession number “NC_001477.” So where is the sequence, and what is it?

First, what is it?

is(dengueseq_fasta)
##  [1] "character"               "vector"                 
##  [3] "data.frameRowLabels"     "SuperClassMethod"       
##  [5] "EnumerationValue"        "character_OR_connection"
##  [7] "character_OR_NULL"       "atomic"                 
##  [9] "vector_OR_Vector"        "vector_OR_factor"
class(dengueseq_fasta)
## [1] "character"

How big is it? Try the dim() and length() commands and see which one works. Do you know why one works and the other doesn’t?

dim(dengueseq_fasta)
## NULL
length(dengueseq_fasta)
## [1] 1

The size of the object is 1. Why is this? This is the genomic sequence of a virus, so you’d expect it to be fairly large. We’ll use another function below to explore that issue. Think about this first: how many pieces of unique information are in the dengueseq object? In what sense is there only one piece of information?

If we want to actually see the sequence we can type just type dengueseq_fasta and press enter. This will print the WHOLE genomic sequence out but it will probably run of your screen.

dengueseq_fasta

This is a whole genome sequence, but its stored as single entry in a vector, so the length() command just tells us how many entries there are in the vector, which is just one! What this means is that the entire genomic sequence is stored in a single entry of the vector dengueseq_fasta. (If you’re not following along with this, no worries - its not essential to actually working with the data)

If we want to actually know how long the sequence is, we need to use the function nchar(), which stands for “number of characters”.

nchar(dengueseq_fasta)
## [1] 10935

The sequence is 10935 bases long. All of these bases are stored as a single character string with no spaces in a single entry of our dengueseq_fasta vector. This isn’t actually a useful format for us, so below were’ going to convert it to something more useful.

If we want to see just part of the sequence we can use the strtrim() function. This stands for “String trim”. Before you run the code below, predict what the 100 means.

strtrim(dengueseq_fasta, 100)
## [1] ">NC_001477.1 Dengue virus 1, complete genome\nAGTTGTTAGTCTACGTGGACCGACAAGAACAGTTTCGAATCGGAAGCTTGCTTAAC"

Note that at the end of the name is a slash followed by an n, which indicates to the computer that this is a newline; this is read by text editor, but is ignored by R in this context.

strtrim(dengueseq_fasta, 45)
## [1] ">NC_001477.1 Dengue virus 1, complete genome\nA"

After the \\n begins the sequence, which will continue on for a LOOOOOONG way. Let’s just print a little bit.

strtrim(dengueseq_fasta, 52)
## [1] ">NC_001477.1 Dengue virus 1, complete genome\nAGTTGTTA"

Let’s print some more. Do you notice anything beside A, T, C and G in the sequence?

strtrim(dengueseq_fasta, 200)
## [1] ">NC_001477.1 Dengue virus 1, complete genome\nAGTTGTTAGTCTACGTGGACCGACAAGAACAGTTTCGAATCGGAAGCTTGCTTAACGTAGTTCTAACAGT\nTTTTTATTAGAGAGCAGATCTCTGATGAACAACCAACGGAAAAAGACGGGTCGACCGTCTTTCAATATGC\nTGAAACGCGCGAGAAA"

Again, there are the \\n newline characters, which tell text editors and word processors how to display the file. (note that if you are reading the raw code for this chapter there will be 2 slashes in front of the n in the previous sentence; this is an RMarkdown thing)

Now that we a sense of what we’re looking at let’s explore the dengueseq_fasta a bit more.

We can find out more information about what it is using the class()command.

class(dengueseq_fasta)
## [1] "character"

As noted before, this is character data.

Many things in R are vectors so we can ask R is.vector()

is.vector(dengueseq_fasta)
## [1] TRUE

Yup, that’s true.

Ok, let’s see what else. A handy though often verbose command is is(), which tells us what an object, well, what it is:

is(dengueseq_fasta)
##  [1] "character"               "vector"                 
##  [3] "data.frameRowLabels"     "SuperClassMethod"       
##  [5] "EnumerationValue"        "character_OR_connection"
##  [7] "character_OR_NULL"       "atomic"                 
##  [9] "vector_OR_Vector"        "vector_OR_factor"

There is a lot here but if you scan for some key words you will see “character” and “vector” at the top. The other stuff you can ignore. The first two things, though, tell us the dengueseq_fasta is a vector of the class character: that is, a character vector.

Another handy function is str(), which gives us a peak at the context and structure of an R object. This is most useful when you are working in the R console or with dataframes, but is a useful function to run on all R objects. How does this output differ from other ways we’ve displayed dengueseq_fasta?

str(dengueseq_fasta)
##  chr ">NC_001477.1 Dengue virus 1, complete genome\nAGTTGTTAGTCTACGTGGACCGACAAGAACAGTTTCGAATCGGAAGCTTGCTTAACGTAGTTCTA"| __truncated__

We know it contains character data - how many characters? nchar() for “number of characters” answers that:

nchar(dengueseq_fasta)
## [1] 10935

21.3 Saving FASTA files

We can save our data as .fasta file for safe keeping. The write() function will save the data we downloaded as a plain text file.

If you do this, you’ll need to figure out where R is saving things, which requires and understanding R’s file system, which can take some getting used to, especially if you’re new to programming. As a start, you can see where R saves things by using the getwd() command, which tells you where on your hard drive R currently is using as its home base for files.

getwd()
## [1] "/Users/nlb24/OneDrive - University of Pittsburgh/0-books/lbrb-bk/lbrb"

You can set the working directory to where a script file is from using these steps in RStudio:

  1. Click on “Session” (in the middle of the menu on the top of the screen)
  2. Select “Set Working Directory” (in the middle of the drop-down menu)
  3. Select “To source file location” (2nd option)

Then, when you things, it will be in that directory.

write(dengueseq_fasta, 
      file="dengueseq.fasta")

You can see what files are in your directory using list.files()

list.files()

21.4 Next steps

FASTA files in R typically have to be converted before being used. I made a function called compbio4all::fasta_cleaner() which takes care of this.

In the optional lesson “Cleaning and preparing FASTA files for analysis in R” I process the dengueseq_fasta object step by step so that we can use it in analyses. If you are interested in how that functions works check out that chapter; otherwise, you can skip it.

21.5 Exercises

Uncomment the code chunk below to download your own sequence of interest. Change the id = to that of your sequence, and change the db = to “protein” if needed. Change the object name to a descriptive name, such as the name of the gene, e.g. shroom.fasta.

# dengueseq.fasta <- rentrez::entrez_fetch(db = "nucleotide",  # set to "protein" if needed
#                           id = "NC_001477",                  # change accession
#                           rettype = "fasta")

Set your working directory then save the FASTA file to your hard drive using. Be sure to change the name of the object and the file name to be appropriate to your gene.

# write(dengueseq.fasta,        # change object name, e.g.  shroom.fasta
#       file="dengueseq.fasta") # change file name, e.g. shroom.fasta

21.6 Review questions

  1. What does the nchar() function stand for?
  2. Why does a FASTA file stored in a vector by entrez_fetch() have a length of 1 and no dimension?
  3. What does strtrim() mean?
  4. If a sequence is stored in object X and you run the code strtrim(x, 10), how many characters are shown?
  5. What is the newline character in a FASTA file?