c01-NCBI_by_GUI2_AC0x.Rmd
This vignette provides an introduction to the general search features of the NCBI databases, including how to search by both accession numbers and other search parameters, such as specific papers. For example, if you don’t know the accession number of a sequence, you can locate it via a paper that worked on that sequence.
All published genome sequences are available over the internet, as it is a requirement of every scientific journal that any published DNA or RNA or protein sequence must be deposited in a public database. The main resources for storing and distributing sequence data are three large databases: the NCBI database (www.ncbi.nlm.nih.gov/), the *European Molecular Biology Laboratory (EMBL) database (https://www.ebi.ac.uk/ena), and the DNA Database of Japan (DDBJ)** database (www.ddbj.nig.ac.jp/). These databases collect all publicly available DNA, RNA and protein sequence data and make it available for free. They exchange data nightly, so contain essentially the same data.
In this vignette we will discuss the NCBI database. Note however that it contains essentially the same data as in the EMBL/DDBJ databases.
Sequences in the NCBI Sequence Database are identified by an accession number. This is a unique number that is only associated with one sequence. For example, the accession number NC_001477 is for the DEN-1 Dengue virus genome sequence. The accession number is what identifies the sequence. It is reported in scientific papers describing that sequence.
As well as the sequence itself, for each sequence the NCBI database also stores some additional annotation data, such as the name of the species it comes from, references to publications describing that sequence, information on the structure of the protein, etc. Some of this annotation data was added by the person who sequenced a sequence and submitted it to the NCBI database, while some may have been added later by a human curator working for NCBI.
The NCBI database contains several sub-databases, the most important of which are:
The FASTA file format is a file format commonly used to store sequence information. The first line starts with the greater than character followed by a name and/or description for the sequence. Subsequent lines contain the sequence itself.
mysequence1 ACATGAGACAGACAGACCCCCAGAGACAGACCCCTAGACACAGAGAGAG TATGCAGGACAGGGTTTTTGCCCAGGGTGGCAGTATG
A FASTA file can contain more than one sequence. If a FASTA file contains many sequences, then for each sequence it will have a header line starting with greater than character followed by the sequence itself.
mysequence1 ACATGAGACAGACAGACCCCCAGAGACAGACCCCTAGACACAGAGAGAG TATGCAGGACAGGGTTTTTGCCCAGGGTGGCAGTATG
mysequence2 AGGATTGAGGTATGGGTATGTTCCCGATTGAGTAGCCAGTATGAGCCAG AGTTTTTTACAAGTATTTTTCCCAGTAGCCAGAGAGAGAGTCACCCAGT ACAGAGAGC
As mentioned above, for each sequence the NCBI database stores some extra information such as the species that it came from, publications describing the sequence, etc. This information is stored in the NCBI entry or NCBI record for the sequence. The NCBI entry for a sequence can be viewed by searching the NCBI database for the accession number for that sequence. The NCBI entries for sequences are stored in a particular format, known as NCBI format.
To view the NCBI entry for the DEN-1 Dengue virus (which has accession NC_001477), follow these steps:
The NCBI entry for an accession contains a lot of information about the sequence, such as papers describing it, features in the sequence, etc. The DEFINITION field gives a short description for the sequence. The ORGANISM field in the NCBI entry identifies the species that the sequence came from. The REFERENCE field contains scientific publications describing the sequence. The FEATURES field contains information about the location of features of interest inside the sequence, such as regulatory sequences or genes that lie inside the sequence. The ORIGIN field gives the sequence itself.
When carrying out searches of the NCBI database, it is important to bear in mind that the database may contain redundant sequences for the same gene that were sequenced by different laboratories (because many different labs have sequenced the gene, and submitted their sequences to the NCBI database).
There are also many different types of nucleotide sequences and protein sequences in the NCBI database. With respect to nucleotide sequences, some many be entire genomic DNA sequences, some may be mRNAs, and some may be lower quality sequences such as expressed sequence tags (ESTs, which are derived from parts of mRNAs), or DNA sequences of contigs from genome projects. That is, you can end up with an entry in the protein database based on sequence derived from a genomic sequence, from sequencing just the gene, and from other routes.
Furthermore, some sequences may be manually curated so that the associated entries contain extra information, but the majority of sequences are uncurated.
As mentioned above, the NCBI database often contains redundant information for a gene, contains sequences of varying quality, and contains both uncurated and curated data.
As a result, NCBI has made a special database called RefSeq (reference sequence database), which is a subset of the NCBI database. The data in RefSeq is manually curated, is high quality sequence data, and is non-redundant; this means that each gene (or splice-form / isoform of a gene, in the case of eukaryotes), protein, or genome sequence is only represented once.
The data in RefSeq is curated and is of much higher quality than the rest of the NCBI Sequence Database. However, unfortunately, because of the high level of manual curation required, RefSeq does not cover all species, and is not comprehensive for the species that are covered so far. To speed up searches and simplify the results in to can be very useful to just search RefSeq. However, for detailed and thorough work the full database should probably be searched and the results scrutinized.
You can easily tell that a sequence comes from RefSeq because its accession number starts with particular sequence of letters. That is, accessions of RefSeq sequences corresponding to protein records usually start with NP_, and accessions of RefSeq curated complete genome sequences usually start with NC_ or NS_.
You may need to interrogate the NCBI Database to find particular sequences or a set of sequences matching given criteria, such as:
There are two main ways that you can query the NCBI database to find these sets of sequences. The first possibility is to carry out searches on the NCBI website. The second possiblity is to carry out searches from R using one of several packages that can interface with NCBI. As of October 2019 rentrez seems to be the best package for this..
Below, I will explain how to manually carry out queries on the NCBI database.
If you are carrying out searches on the NCBI website, to narrow down your searches to specific types of sequences or to specific organisms, you will need to use “search tags”.
For example, the search tags “[PROP]” and “[ORGN]” let you restrict your search to a specific subset of the NCBI Sequence Database, or to sequences from a particular taxon, respectively. Here is a list of useful search tags, which we will explain how to use below:
To carry out searches of the NCBI database, you first need to go to the NCBI website, and type your search query into the search box at the top. For example, to search for all sequences from Fungi, you would type “Fungi[ORGN]” into the search box on the NCBI website.
You can combine the search tags above by using “AND”, to make more complex searches. For example, to find all mRNA sequences from Fungi, you could type “Fungi[ORGN] AND biomol_mRNA[PROP]” in the search box on the NCBI website.
Likewise, you can also combine search tags by using “OR”, for example, to search for all mRNA sequences from Fungi or Bacteria, you would type “(Fungi[ORGN] OR Bacteria[ORGN]) AND biomol_mRNA[PROP]” in the search box. Note that you need to put brackets around “Fungi[ORGN] OR Bacteria[ORGN]” to specify that the word “OR” refers to these two search tags.
Here are some examples of searches, some of them made by combining search terms using “AND”:
Note that if you are searching for a phrase such as “colon cancer” or “Chlamydia trachomatis”, you need to put the phrase in quotes when typing it into the search box. This is because if you type the phrase in the search box without quotes, the search will be for NCBI records that contain either of the two words “colon” or “cancer” (or either of the two words “Chlamydia” or “trachomatis”), not necessarily both words.
As mentioned above, the NCBI database contains several sub-databases, including the NCBI Nucleotide database and the NCBI Protein database. If you go to the NCBI website, and type one of the search queries above in the search box at the top of the page, the results page will tell you how many matching NCBI records were found in each of the NCBI sub-databases.
For example, if you search for “Chlamydia trachomatis[ORGN]”, you will get matches to proteins from C. trachomatis in the NCBI Protein database, matches to DNA and RNA sequences from C. trachomatis in the NCBI Nucleotide database, matches to whole genome sequences for C. trachomatis strains in the NCBI Genome database, and so on:
Alternatively, if you know in advance that you want to search a particular sub-database, for example, the NCBI Protein database, when you go to the NCBI website, you can select that sub-database from the drop-down list above the search box, so that you will search that sub-database.
For example, if you want to find sequences published in Nature 460:352-358, you can use the “[JOUR]”, “[VOL]” and “[PAGE]” search terms. That is, you would go to the NCBI website and type in the search box on the top: “Nature”[JOUR] AND 460[VOL] AND 352[PAGE], where [JOUR] specifies the journal name, [VOL] the volume of the journal the paper is in, and [PAGE] the page number.
This should bring up a results page with “50890” beside the word “Nucleotide”, and “1” beside the word “Genome”, and “25701” beside the word “Protein”, indicating that there were 50890 hits to sequence records in the Nucleotide database, which contains DNA and RNA sequences, and 1 hit to the Genome database, which contains genome sequences, and 25701 hits to the Protein database, which contains protein sequences.
If you click on the word “Nucleotide”, it will bring up a webpage with a list of links to the NCBI sequence records for those 50890 hits. The 50890 hits are all contigs from the schistosome worm Schistosoma mansoni.
Likewise, if you click on the word “Protein”, it will bring up a webpage with a list of links to the NCBI sequence records for the 25701 hits, and you will see that the hits are all predicted proteins for Schistosoma mansoni.
If you click on the word “Genome”, it will bring you to the NCBI record for the Schistosoma mansoni genome sequence, which has NCBI accession NS_00200. Note that the accession starts with “NS_”, which indicates that it is a RefSeq accession.
Therefore, in Nature volume 460, page 352, the Schistosoma mansoni genome sequence was published, along with all the DNA sequence contigs that were sequenced for the genome project, and all the predicted proteins for the gene predictions made in the genome sequence. You can view the original paper on the Nature website at http://www.nature.com/nature/journal/v460/n7253/abs/nature08160.html.
Note: Schistmosoma mansoni is a parasitic worm that is responsible for causing schistosomiasis, which is classified by the WHO as a neglected tropical disease.