c 01) Introduction to NCBI databases

This vignette provides an introduction to the general search features of the NCBI databases, including how to search by both accession numbers and other search parameters, such as specific papers. For example, if you don’t know the accession number of a sequence, you can locate it via a paper that worked on that sequence.

To do

replicate searches in R using rentrez

The NCBI Sequence Database

All published genome sequences are available over the internet, as it is a requirement of every scientific journal that any published DNA or RNA or protein sequence must be deposited in a public database. The main resources for storing and distributing sequence data are three large databases: the NCBI database (www.ncbi.nlm.nih.gov/), the *European Molecular Biology Laboratory (EMBL) database (https://www.ebi.ac.uk/ena), and the DNA Database of Japan (DDBJ)** database (www.ddbj.nig.ac.jp/). These databases collect all publicly available DNA, RNA and protein sequence data and make it available for free. They exchange data nightly, so contain essentially the same data.

In this vignette we will discuss the NCBI database. Note however that it contains essentially the same data as in the EMBL/DDBJ databases.

Sequences in the NCBI Sequence Database are identified by an accession number. This is a unique number that is only associated with one sequence. For example, the accession number NC_001477 is for the DEN-1 Dengue virus genome sequence. The accession number is what identifies the sequence. It is reported in scientific papers describing that sequence.

As well as the sequence itself, for each sequence the NCBI database also stores some additional annotation data, such as the name of the species it comes from, references to publications describing that sequence, information on the structure of the protein, etc. Some of this annotation data was added by the person who sequenced a sequence and submitted it to the NCBI database, while some may have been added later by a human curator working for NCBI.

The NCBI Sub-Databases

The NCBI database contains several sub-databases, the most important of which are:

Nucleotide database: contains DNA and RNA sequences
NCBI Protein database: contains protein sequences
EST database: contains ESTs (expressed sequence tags), which are short sequences derived from mRNAs
Genome database: contains DNA sequences for whole genomes
PubMed: contains data on scientific publications

Searching for an accession number in the NCBI database

The FASTA file format is a file format commonly used to store sequence information. The first line starts with the greater than character followed by a name and/or description for the sequence. Subsequent lines contain the sequence itself.

mysequence1 ACATGAGACAGACAGACCCCCAGAGACAGACCCCTAGACACAGAGAGAG TATGCAGGACAGGGTTTTTGCCCAGGGTGGCAGTATG

A FASTA file can contain more than one sequence. If a FASTA file contains many sequences, then for each sequence it will have a header line starting with greater than character followed by the sequence itself.

mysequence1 ACATGAGACAGACAGACCCCCAGAGACAGACCCCTAGACACAGAGAGAG TATGCAGGACAGGGTTTTTGCCCAGGGTGGCAGTATG

mysequence2 AGGATTGAGGTATGGGTATGTTCCCGATTGAGTAGCCAGTATGAGCCAG AGTTTTTTACAAGTATTTTTCCCAGTAGCCAGAGAGAGAGTCACCCAGT ACAGAGAGC

NCBI Sequence Format (NCBI Format)

As mentioned above, for each sequence the NCBI database stores some extra information such as the species that it came from, publications describing the sequence, etc. This information is stored in the NCBI entry or NCBI record for the sequence. The NCBI entry for a sequence can be viewed by searching the NCBI database for the accession number for that sequence. The NCBI entries for sequences are stored in a particular format, known as NCBI format.

To view the NCBI entry for the DEN-1 Dengue virus (which has accession NC_001477), follow these steps:

Go to the NCBI website (www.ncbi.nlm.nih.gov).
Search for the accession number.
On the results page, if your sequence corresponds to a nucleotide (DNA or RNA) sequence, you should see a hit in the Nucleotide database, and you should click on the word “Nucleotide” to view the NCBI entry for the hit. Likewise, if your sequence corresponds to a protein sequence, you should see a hit in the Protein database, and you should click on the word “Protein” to view the NCBI entry for the hit.
After you click on “Nucleotide” or “Protein” in the previous step, the NCBI entry for the accession will appear.

The NCBI entry for an accession contains a lot of information about the sequence, such as papers describing it, features in the sequence, etc. The DEFINITION field gives a short description for the sequence. The ORGANISM field in the NCBI entry identifies the species that the sequence came from. The REFERENCE field contains scientific publications describing the sequence. The FEATURES field contains information about the location of features of interest inside the sequence, such as regulatory sequences or genes that lie inside the sequence. The ORIGIN field gives the sequence itself.

RefSeq

When carrying out searches of the NCBI database, it is important to bear in mind that the database may contain redundant sequences for the same gene that were sequenced by different laboratories (because many different labs have sequenced the gene, and submitted their sequences to the NCBI database).

There are also many different types of nucleotide sequences and protein sequences in the NCBI database. With respect to nucleotide sequences, some many be entire genomic DNA sequences, some may be mRNAs, and some may be lower quality sequences such as expressed sequence tags (ESTs, which are derived from parts of mRNAs), or DNA sequences of contigs from genome projects. That is, you can end up with an entry in the protein database based on sequence derived from a genomic sequence, from sequencing just the gene, and from other routes.

Furthermore, some sequences may be manually curated so that the associated entries contain extra information, but the majority of sequences are uncurated.

As mentioned above, the NCBI database often contains redundant information for a gene, contains sequences of varying quality, and contains both uncurated and curated data.

As a result, NCBI has made a special database called RefSeq (reference sequence database), which is a subset of the NCBI database. The data in RefSeq is manually curated, is high quality sequence data, and is non-redundant; this means that each gene (or splice-form / isoform of a gene, in the case of eukaryotes), protein, or genome sequence is only represented once.

The data in RefSeq is curated and is of much higher quality than the rest of the NCBI Sequence Database. However, unfortunately, because of the high level of manual curation required, RefSeq does not cover all species, and is not comprehensive for the species that are covered so far. To speed up searches and simplify the results in to can be very useful to just search RefSeq. However, for detailed and thorough work the full database should probably be searched and the results scrutinized.

You can easily tell that a sequence comes from RefSeq because its accession number starts with particular sequence of letters. That is, accessions of RefSeq sequences corresponding to protein records usually start with NP_, and accessions of RefSeq curated complete genome sequences usually start with NC_ or NS_.

Querying the NCBI Database

You may need to interrogate the NCBI Database to find particular sequences or a set of sequences matching given criteria, such as:

The sequence with accession NC_001477
The sequences published in Nature 460:352-358
All sequences from Chlamydia trachomatis
Sequences submitted by Caroline Cameron, a sphylis researcher
Flagellin or fibrinogen sequences
The glutamine synthetase gene from Mycobacteriuma leprae
Just the upstream control region of the Mycobacterium leprae dnaA gene
The sequence of the Mycobacterium leprae DnaA protein
The genome sequence of syphilis, Treponema pallidum subspp. pallium
All human nucleotide sequences associated with malaria

There are two main ways that you can query the NCBI database to find these sets of sequences. The first possibility is to carry out searches on the NCBI website. The second possiblity is to carry out searches from R using one of several packages that can interface with NCBI. As of October 2019 rentrez seems to be the best package for this..

Below, I will explain how to manually carry out queries on the NCBI database.

Querying the NCBI Database via the NCBI Website

If you are carrying out searches on the NCBI website, to narrow down your searches to specific types of sequences or to specific organisms, you will need to use “search tags”.

For example, the search tags “[PROP]” and “[ORGN]” let you restrict your search to a specific subset of the NCBI Sequence Database, or to sequences from a particular taxon, respectively. Here is a list of useful search tags, which we will explain how to use below:

[AC] NC_001477[AC] With a particular accession number
[ORGN] Fungi[ORGN] From a particular organism or taxon
[PROP] biomol_mRNA[PROP] Of a specific type (eg. mRNA) or from a specific database (eg. RefSeq)
[JOUR] Nature[JOUR] Described in a paper published in a particular journal
[VOL] 531[VOL] Described in a paper published in a particular journal volume
[PAGE] 27[PAGE] Described in a paper with a particular start-page in a journal
[AU] “Smith J”[AU] Described in a paper, or submitted to NCBI, by a particular author

To carry out searches of the NCBI database, you first need to go to the NCBI website, and type your search query into the search box at the top. For example, to search for all sequences from Fungi, you would type “Fungi[ORGN]” into the search box on the NCBI website.

You can combine the search tags above by using “AND”, to make more complex searches. For example, to find all mRNA sequences from Fungi, you could type “Fungi[ORGN] AND biomol_mRNA[PROP]” in the search box on the NCBI website.

Likewise, you can also combine search tags by using “OR”, for example, to search for all mRNA sequences from Fungi or Bacteria, you would type “(Fungi[ORGN] OR Bacteria[ORGN]) AND biomol_mRNA[PROP]” in the search box. Note that you need to put brackets around “Fungi[ORGN] OR Bacteria[ORGN]” to specify that the word “OR” refers to these two search tags.

Here are some examples of searches, some of them made by combining search terms using “AND”:

NC_001477[AC] - With accession number NC_001477
Nature[JOUR] AND 460[VOL] AND 352[PAGE] - Published in Nature 460:352-358
“Chlamydia trachomatis”[ORGN] - From the bacterium Chlamydia trachomatis
“Berriman M”[AU] - Published in a paper, or submitted to NCBI, by M. Berriman
flagellin OR fibrinogen - Which contain the word “flagellin” or “fibrinogen” in their NCBI record
“Mycobacterium leprae”[ORGN] AND dnaA - Which are from M. leprae, and contain “dnaA” in their NCBI record
“Homo sapiens”[ORGN] AND “colon cancer” - Which are from human, and contain “colon cancer” in their NCBI record
“Homo sapiens”[ORGN] AND malaria - Which are from human, and contain “malaria” in their NCBI record
“Homo sapiens”[ORGN] AND biomol_mrna[PROP] - Which are mRNA sequences from human
“Bacteria”[ORGN] AND srcdb_refseq[PROP] - Which are RefSeq sequences from Bacteria
“colon cancer” AND srcdb_refseq[PROP] - From RefSeq, which contain “colon cancer” in their NCBI record

Note that if you are searching for a phrase such as “colon cancer” or “Chlamydia trachomatis”, you need to put the phrase in quotes when typing it into the search box. This is because if you type the phrase in the search box without quotes, the search will be for NCBI records that contain either of the two words “colon” or “cancer” (or either of the two words “Chlamydia” or “trachomatis”), not necessarily both words.

As mentioned above, the NCBI database contains several sub-databases, including the NCBI Nucleotide database and the NCBI Protein database. If you go to the NCBI website, and type one of the search queries above in the search box at the top of the page, the results page will tell you how many matching NCBI records were found in each of the NCBI sub-databases.

For example, if you search for “Chlamydia trachomatis[ORGN]”, you will get matches to proteins from C. trachomatis in the NCBI Protein database, matches to DNA and RNA sequences from C. trachomatis in the NCBI Nucleotide database, matches to whole genome sequences for C. trachomatis strains in the NCBI Genome database, and so on:

Alternatively, if you know in advance that you want to search a particular sub-database, for example, the NCBI Protein database, when you go to the NCBI website, you can select that sub-database from the drop-down list above the search box, so that you will search that sub-database.

Example: finding the sequences published in Nature 460:352-358

For example, if you want to find sequences published in Nature 460:352-358, you can use the “[JOUR]”, “[VOL]” and “[PAGE]” search terms. That is, you would go to the NCBI website and type in the search box on the top: “Nature”[JOUR] AND 460[VOL] AND 352[PAGE], where [JOUR] specifies the journal name, [VOL] the volume of the journal the paper is in, and [PAGE] the page number.

This should bring up a results page with “50890” beside the word “Nucleotide”, and “1” beside the word “Genome”, and “25701” beside the word “Protein”, indicating that there were 50890 hits to sequence records in the Nucleotide database, which contains DNA and RNA sequences, and 1 hit to the Genome database, which contains genome sequences, and 25701 hits to the Protein database, which contains protein sequences.

If you click on the word “Nucleotide”, it will bring up a webpage with a list of links to the NCBI sequence records for those 50890 hits. The 50890 hits are all contigs from the schistosome worm Schistosoma mansoni.

Likewise, if you click on the word “Protein”, it will bring up a webpage with a list of links to the NCBI sequence records for the 25701 hits, and you will see that the hits are all predicted proteins for Schistosoma mansoni.

If you click on the word “Genome”, it will bring you to the NCBI record for the Schistosoma mansoni genome sequence, which has NCBI accession NS_00200. Note that the accession starts with “NS_”, which indicates that it is a RefSeq accession.

Therefore, in Nature volume 460, page 352, the Schistosoma mansoni genome sequence was published, along with all the DNA sequence contigs that were sequenced for the genome project, and all the predicted proteins for the gene predictions made in the genome sequence. You can view the original paper on the Nature website at http://www.nature.com/nature/journal/v460/n7253/abs/nature08160.html.

Note: Schistmosoma mansoni is a parasitic worm that is responsible for causing schistosomiasis, which is classified by the WHO as a neglected tropical disease.

Avril Coglan, adpated by Nathan Brouwerr

2019-10-10