Nucleotide Databases- Definition, Types, Examples, Uses

Biological databases store and organize biological data for easy retrieval of information. These centralized resources contain DNA and protein sequences and their associated information. 

Nucleotide databases are a type of biological database containing genetic information, which includes DNA and RNA sequences that come from a variety of sources, including whole genomes, transcriptomes, and individual genes.

Nucleotide Databases
Figure: Nucleotide Databases. Image Sources: Respective database websites.

There are several nucleotide databases. Some of the most popular nucleotide databases are:

International Nucleotide Sequence Database Collaboration

The International Nucleotide Sequence Database Collaboration (INSDC) is a group of three organizations – GenBank, EMBL, and DDBJ – that collect and share nucleotide sequence data.

GenBank

  • GenBank is a sequence database that contains a collection of annotated nucleic acid sequence data.
  • It includes various types of genetic material, such as genomic DNA, messenger RNA (mRNA), complementary DNA (cDNA), expressed sequence tags (ESTs), high-throughput raw sequence data, and sequence polymorphisms.

European Molecular Biology Laboratory (EMBL)

  • The European Molecular Biology Laboratory (EMBL) is another nucleotide database, part of the INSDC. 
  • It is focused on the storage and distribution of nucleotide and protein sequences. 
  • EMBL also develops tools to help researchers analyze and interpret this data.

DNA Data Bank of Japan (DDBJ)

  • The DNA Data Bank of Japan (DDBJ) is another nucleotide database that exchanges data with GenBank and EMBL as a member of INSDC.
  • DDBJ collects and exchanges nucleotide sequence data and manages bioinformatics tools for data submission and retrieval. It also develops tools for biological data analysis and organizes Bioinformatics Training Courses in Japanese.

Genome Sequence Archive (GSA)

  • The Genome Sequence Archive (GSA) is a database that stores raw sequence data and is built based on INSDC data standards and structures. 
  • GSA was developed to complement the existing INSDC member databases and has now become an important tool for archiving and managing the increasing amount of genomic data.
  • GSA accepts raw sequence reads from various sequencing platforms and stores both the sequence reads and metadata submitted by researchers worldwide. 
  • GSA provides free and unrestricted access to all publicly available data to scientific communities globally. 
  • GSA uses four primary data objects, namely BioProject, BioSample, Experiment, and Run, to organize the submitted data.

Single Nucleotide Polymorphism database (dbSNP)

  • The Single Nucleotide Polymorphism database (dbSNP) is a public database that contains information about variations in nucleotide sequences.
  • It is a part of the National Center for Biotechnology Information (NCBI) and is a public database that accepts entries from both public and private organizations.
  • It stores a collection of genetic polymorphisms, including single nucleotide substitutions, deletions or insertions, and microsatellite repeat variations. 
  • Each entry in dbSNP includes the sequence context of the polymorphism, the occurrence frequency, and the experimental method used to detect the variation.
  • dbSNP is open to submissions for variations from any species and genome location. 
  • The database supports a wide range of research areas, including physical mapping, functional analysis, pharmacogenomics, association studies, and evolutionary studies.

Nucleic Acid Database (NDB)

  • The Nucleic Acid Database (NDB) is a collection of three-dimensional nucleic acid structures and their complexes obtained and curated from the Protein Data Bank (PDB). 
  • The database acts as a centralized platform for storing and accessing structural information and annotations related to nucleic acids.
  • NDB includes annotations specific to the structure and function of nucleic acids. It also provides tools that allow users to search the database, download data and structures, analyze nucleic acids, and learn more about them.
  • The database includes RNA and DNA oligonucleotides with two or more bases. It also includes protein-DNA and protein-RNA structures. 

Applications of nucleotide databases

  • Nucleotide databases are used to identify the gene or the function of a particular nucleotide sequence by comparing an unknown sequence with the known sequences in the database.
  • Nucleotide databases can be used to study and examine gene expression by using the sequence information stored in the databases.
  • Nucleotide databases are also used to identify potential drug targets and develop new therapies for genetic diseases.
  • Nucleotide databases also help in identifying genetic variations that may be linked to diseases, which ultimately helps in the development of diagnostic tools and treatments.
  • Nucleotide databases can be used in phylogenetic analysis to analyze the evolutionary relationships between organisms, by comparing and examining their DNA or RNA sequences.

References

  1. Arita, M., Karsch-Mizrachi, I., & Cochrane, G. (2021). The international nucleotide sequence database collaboration. Nucleic Acids Research, 49(D1), D121. https://doi.org/10.1093/nar/gkaa967
  2. Berman, H. M., Gelbin, A., Clowney, L., Hsieh, H., Zardecki, C., & Westbrook, J. (1996). The Nucleic Acid Database: Present and Future. Journal of Research of the National Institute of Standards and Technology, 101(3), 243-257. https://doi.org/10.6028/jres.101.026
  3. Bhagwat, M. (2010). Searching NCBI’s dbSNP Database. Current Protocols in Bioinformatics. doi:10.1002/0471250953.bi0119s32
  4. Birney, E. (2004). An Overview of Ensembl. Genome Research, 14(5), 925–928. doi:10.1101/gr.1860604 
  5. Choudhuri, S. (2014). Data, Databases, Data Format, Database Search, Data Retrieval Systems, and Genome Browsers. Bioinformatics for Beginners, 77–131. doi:10.1016/b978-0-12-410471-6.00005-0
  6. https://www.ncbi.nlm.nih.gov/books/NBK21088/
  7. Sanjeev, A., Mattaparthi, V. S. K., & Kaushik, S. (2018). Nucleic-Acid Structure Database. Reference Module in Life Sciences. doi:10.1016/b978-0-12-809633-8.20285-9
  8. Wang, Y., Song, F., Zhu, J., Zhang, S., Yang, Y., Chen, T., … Zhao, W. (2017). GSA: Genome Sequence Archive *. Genomics, Proteomics & Bioinformatics, 15(1), 14–18. doi:10.1016/j.gpb.2017.01.001
  9. Xiong, J. (2006). Essential Bioinformatics. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511806087

About Author

Photo of author

Sanju Tamang

Sanju Tamang completed her Bachelor's (B.Tech) in Biotechnology from Kantipur Valley College, Lalitpur, Nepal. She is interested in genetics, microbiome, and their roles in human health. She is keen to learn more about biological technologies that improve human health and quality of life.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.