Biological databases store and organize biological data for easy retrieval of information. These centralized resources contain DNA and protein sequences, and their associated information.
Primary databases store and make raw sequence data publicly available. However, primary databases alone may not provide all the necessary information, as they often contain minimal annotation information. This is where secondary databases come into play. Secondary databases provide an added layer of information by curating, processing, and analyzing the raw data from primary databases.
Secondary databases refer to databases that are derived from primary databases, which include manually curated or computationally processed information.
The amount of computational processing work in secondary databases varies greatly, depending on the level of information they provide. Some secondary databases may simply archive translated sequence data, while others may provide extensive annotations and information on structure and function.
There are different secondary databases available that contain information on biological sequences and their attributes, such as expression, structure, function, and interactions. Some examples of secondary databases are SWISS-PROT, PROSITE, Pfam, PRINTS, and BLOCKS.
- SWISS-PROT is a well-known and widely used secondary database of protein sequences that provides detailed annotation, including information on structure, function, and protein family assignment.
- The sequence data is primarily derived from the TrEMBL database, which stores translated nucleic acid sequences.
- SWISS-PROT stands out from other protein databases for its detailed annotations, minimal redundancy, and integration with other databases.
- Annotations in SWISS-PROT provide detailed information on protein function, post-translational modifications, domains and sites, secondary and quaternary structure, similarities to other proteins, diseases associated with deficiencies in the protein, sequence conflicts, and variants.
- Swiss-Prot is popular for its low redundancy and high level of integration with other databases.
- ProSite is a database of protein families, domains, and functional sites that contains manually curated information on amino acid patterns and profiles of proteins.
- It is a secondary protein database that provides tools for the analysis of protein sequences and the identification of motifs.
- The database contains a large collection of signature patterns or profiles that hold biological importance. Each signature is associated with important biological information such as protein family, domain, or functional site.
- ProSite uses two types of signatures, patterns and generalized profiles, to identify conserved regions.
- These signatures can be used to predict the function and structure of proteins and help in the annotation of new protein sequences.
- Pfam is another secondary database of protein families and domains that are represented by multiple sequence alignments, profile hidden Markov models (HMMs), and annotations.
- The database is accessible online and is used by researchers worldwide for a variety of applications, including genome annotation, protein classification, and protein structure prediction.
- Pfam has two components. Pfam-A stores manually curated high-quality entries. Pfam-B stores automatically generated lower-quality entries.
- Pfam provides a platform for the analysis of protein sequence data, which allows researchers to search for related proteins in the database based on the presence of specific protein domains.
- PRINTS database contains protein family fingerprints which are groups of motifs.
- PRINTS is one of several widely-used pattern databases, including PROSITE, BLOCKS, and Pfam, each with different strengths and weaknesses.
- PRINTS uses a fingerprinting method that detects distant relatives of large and highly divergent protein superfamilies by exploiting conserved regions within sequence alignments.
- BLOCKS is a collection of ungapped multiple alignments of segments of related protein sequences, called blocks, that represent the most conserved regions of proteins.
- It contains blocks for a wide variety of protein families, including enzymes, receptors, transporters, and structural proteins.
- Each block is assigned a unique identifier and annotated with information about the proteins it represents, including their names, functions, and structures.
- The database is widely used as a tool for protein family classification, protein structure prediction, and functional annotation.
Applications of Secondary Databases
- Secondary databases can be used to predict the structure and function of proteins by identifying homologous proteins with known structures.
- Secondary databases contain functional annotation information which helps to better understand the roles of proteins in different organisms.
- Secondary databases also help to identify conserved regions within a sequence, which can help to identify important functional domains and motifs.
- Secondary databases also help in evolutionary analysis by comparing protein sequences across different species to study the evolution of proteins.
- Secondary databases can also be used to identify potential drug targets by analyzing protein families and identifying conserved motifs that are essential for protein function.
- Attwood, T. K., R. Croning, M. D., Flower, D. R., Lewis, A. P., Mabey, J. E., Scordis, P., Selley, J. N., & Wright, W. (2000). PRINTS-S: The database formerly known as PRINTS. Nucleic Acids Research, 28(1), 225-227. https://doi.org/10.1093/nar/28.1.225
- Bairoch, A., & Apweiler, R. (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research, 28(1), 45-48. https://doi.org/10.1093/nar/28.1.45
- Choudhuri, S. (2014). Data, Databases, Data Format, Database Search, Data Retrieval Systems, and Genome Browsers. Bioinformatics for Beginners, 77–131. doi:10.1016/b978-0-12-410471-6.00005-0
- Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Castro, E. D., Langendijk-Genevaux, P. S., Pagni, M., & A. Sigrist, C. J. (2006). The PROSITE database. Nucleic Acids Research, 34(Database issue), D227. https://doi.org/10.1093/nar/gkj063
- Mount, D. W. (2001) Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press.
- Primary and secondary databases | Bioinformatics for the terrified (ebi.ac.uk)
- Xiong, J. (2006). Essential Bioinformatics. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511806087