Chapter 14 Downloading NCBI sequence data by hand

By: Avril Coghlan.

Adapted, edited and expanded: Nathan Brouwer under the Creative Commons 3.0 Attribution License (CC BY 3.0).

14.1 Preface

The following chapter was originally written by Avril Coghlan. It provides brief, basic information for how to access sequences via the internet. Subsequent chapters provide more details and R code.

14.2 Retrieving genome sequence data via the NCBI website

You can easily retrieve DNA or protein sequence data by hand from the NCBI Sequence Database via its website www.ncbi.nlm.nih.gov.

Dengue DEN-1 DNA is a viral DNA sequence and its NCBI accession number is NC_001477. To retrieve the DNA sequence for the Dengue DEN-1 virus from NCBI, go to the NCBI website, type “NC_001477” in the Search box at the top of the webpage, and press the “Search” button beside the Search box.

(While this is the normal workflow, accessions related to well-known organisms can sometimes turn up using a direct Google search. This is the case for NC_001477 - if you Google it you can go directly to the website with the genome sequence.)

On the results page of a normal NCBI search you will see the number of hits to “NC_001477” in each of the NCBI databases on the NCBI website. There are many databases on the NCBI website, for example, PubMed and Pubmed Central contain abstracts from scientific papers, the Genes and Genomes database contains DNA and RNA sequence data, the Proteins database contains protein sequence data, and so on.

Most biologist would do this type of work by hand from within their web browser, but it can also be done by writing small programs in scripting languages such as Python or R. In R, the rentrez package is a powerful tool for intersecting with NCBI resource. In this tutorial we’ll focus on the web interface. Its good to remember, though, that almost anything done via the webpage can be automated using a computer script.

A challenge when learning to use NCBI resources is that there is a tremendous amount of sequence information available and you need to learn how to sort through what the search results provide. As you are looking for the DNA sequence of the Dengue DEN-1 virus genome, you expect to see a hit in the NCBI Nucleotide database. This is indicated at the top of the page where it says “NUCLEOTIDE SEQUENCE” and lists “Dengue virus 1, complete genome.”

When you click on the link for the Nucleotide database, it will bring you to the record for NC_001477 in the NCBI Nucleotide database. This will contain the name and NCBI accession of the sequence, as well as other details such as any papers describing the sequence. If you scroll down you’ll see the sequence also.

If you need it, you can retrieve the DNA sequence for the DEN-1 Dengue virus genome sequence as a FASTA format sequence file in a couple ways. The easiest is just to copy and paste it into a text, .R, or other file. You can also click on “Send to” at the top right of the NC_001477 sequence record webpage (just to the left of the side bar; its kinda small).

After you click on Send to you can pick several options. and then choose “File” in the menu that appears, and then choose FASTA from the “Format” menu that appears, and click on “Create file”. The sequence will then download. The default file name is sequence.fasta so you’ll probably want to change it.

You can now open the FASTA file containing the DEN-1 Dengue virus genome sequence using a text editor like Notepad, WordPad, Notepad++, or even RStudio on your computer. To find a text editor on your computer search for “text” from the start menu (Windows) and usually one will come up. (Opening the file in a word processor like word isn’t recommended).