1 Bairoch, A. (2000). Serendipity in bioinformatics: the tribulations of a Swiss bioinformatician through exciting times! Bioinformatics. 16: 48–64. A personal narrative conveying the early history of the development of sequence databases and related software tools, events that set the groundwork for the modern bioinformatics landscape.
2 Green, E.D., Rubin, E.M., and Olson, M.V. (2017). The future of DNA sequencing. Nature. 550: 179–181. An insightful perspective regarding the next several decades of the application of DNA sequencing methodologies in novel contexts and the implications of those applications to issues of data storage and data sharing.
3 Rigden, D.J. and Fernández, X.M. (2018). The 2018 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res. 46: D1–D7. The 25th overview of the annual database issue published by Nucleic Acids Research, capturing the wide variety of publicly available bioinformatic databases available to the community. This overview is updated yearly, and the individual papers describing these database resources are freely available through the Nucleic Acids Research web site.
1 Apweiler, R. (2001). Functional information in Swiss-Prot: the basis for large-scale characterization of protein sequences. Briefings Bioinf. 2: 9–18.
2 Bairoch, A. (2000). Serendipity in bioinformatics: the tribulations of a Swiss bioinformatician through exciting times! Bioinformatics. 16: 48–64.
3 Baxevanis, A.D. and Bateman, A. (2015). The importance of biological databases in biological discovery. Curr. Protoc. Bioinf. 50: 1.1.1–1.1.8.
4 Benson, D.A., Cavanaugh, M., Clark, K. et al. (2018). GenBank. Nucleic Acids Res. 46: D41–D47.
5 Cook, C.E., Bergman, M.T., Cochrane, G. et al. (2018). The European Bioinformatics Institute in 2017: data coordination and integration. Nucleic Acids Res. 46: D21–D29.
6 Dayhoff, M.O., Eck, R.V., Chang, M.A., and Sochard, M.R. (1965). Atlas of Protein Sequence and Structure. Silver Spring, MD: National Biomedical Research Foundation.
7 Gene Ontology Consortium (2017). Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45: D331–D338.
8 Green, E.D., Rubin, E.M., and Olson, M.V. (2017). The future of DNA sequencing. Nature. 550: 179–181.
9 Karsch-Mizrachi, I., Tagaki, T., and Cochrane, G., on behalf of the International Nucleotide Sequence Database Collaboration (2018). The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 46: D48–D51.
10 Kim, H.J., Kim, N.C., Wang, Y.D. et al. (2013). Mutations in prion-like domains in hnRNPA2B1 and hnRNPA1 cause multisystem proteinopathy and ALS. Nature. 495: 467–473.
11 Kodama, Y., Mashima, J., Kosuge, T. et al. (2018). DNA Data Bank of Japan: 30th anniversary. Nucleic Acids Res. 46: D30–D35.
12 Landrum, M.J., Lee, J.M., Benson, M. et al. (2016). ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44: D862–D868.
13 Lee, R.Y.N., Howe, K.L., Harris, T.W. et al. (2018). WormBase 2017: molting into a new stage. Nucleic Acids Res. 46: D869–D874.
14 Lipman, D.J. and Pearson, W.R. (1985). Rapid and sensitive protein similarity searches. Science. 227: 1435–1441.
15 Liu, Q., Shu, S., Wang, R.R. et al. (2016). Whole-exome sequencing identifies a missense mutation in hnRNPA1in a family with flail arm ALS. Neurology. 87: 1763–1769.
16 Rigden, D.J. and Fernández, X.M. (2018). The 2018 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res. 46: D1–D7.
17 Silvester, N., Alako, B., Amid, C. et al. (2018). The European Nucleotide Archive in 2017. Nucleic Acids Res. 46: D36–D40.
18 Smith, C.L., Blake, J.A., Kadin, J.A. et al., and The Mouse Genome Database Group (2018). Mouse Genome Database (MGD)-2018: knowledgebase for the laboratory mouse. Nucleic Acids Res. 46: D836–D842.
19 Suzek, B.E., Wang, Y., Huang, H. et al., and The UniProt Consortium (2015). UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 31: 926–932.
20 UniProt Consortium (2017). UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45: D158–D169.
This chapter was written by Dr. Andreas D. Baxevanis in his private capacity. No official support or endorsement by the National Institutes of Health or the United States Department of Health and Human Services is intended or should be inferred .
2 Information Retrieval from Biological Databases
Andreas D. Baxevanis
On April 14, 2003, the biological community celebrated the achievement of the Human Genome Project's major goal: the complete, accurate, and high-quality sequencing of the human genome (International Human Genome Sequencing Consortium 2001; Schmutz et al. 2004). The attainment of this goal, which many have compared to landing a person on the moon, has had a profound effect on how biological and biomedical research is conducted and will undoubtedly continue to have a profound effect on its direction in the future. The availability of not just human genome data, but also human sequence variation data, model organism sequence data, and information on gene structure and function provides fertile ground for biologists to better design and interpret their experiments in the laboratory, fulfilling the promise of bioinformatics in advancing and accelerating biological discovery.
One of the most important databases available to biologists is GenBank, the annotated collection of all publicly available DNA and protein sequences (Benson et al. 2017; see Chapter 1). This database, maintained by the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH), represents a collaborative effort between NCBI, the European Molecular Biology Laboratory (EMBL), and the DNA Data Bank of Japan (DDBJ). At the time of this writing, GenBank contained over 200 million sequences and over 300 trillion nucleotide bases. The completion of human genome sequencing and the sequencing of an ever-expanding number of model organism genomes, as well as the existence of a gargantuan number of sequences in general, provides a golden opportunity for biological scientists, owing to the inherent value of these data. However, at the same time, the sheer magnitude of data presents a conundrum to the inexperienced user, resulting not just from the size of the “sequence information space” but from the fact that the information space continues to get larger and larger – by leaps and bounds – at a pace that will continue to accelerate, even though human genome sequencing has long been “completed.”
The effect of the Human Genome Project and other systematic sequencing projects on the continued accumulation of sequence data is illustrated by the growth of GenBank, as shown in Figure 2.1; the exponential growth rate illustrated in the figure is expected to continue for some time to come. The continued expansion of not just the sequence space but of the myriad biological data now available because of the expansion of the sequence space underscores the necessity for all biologists to learn how to effectively navigate this information for effective use in their work – even allowing investigators to avoid performing expensive experiments themselves based on the data found within these virtual treasure troves.
GenBank (or any other biological database, for that matter) serves little purpose unless the data can be easily searched and entries retrievable in a useful, meaningful format. Otherwise, sequencing efforts such as those described above have no useful end – without effective search and retrieval tools, the biological community as a whole cannot make use of the information hidden within these millions of bases and amino acids, much less the structures they form or the mutations they harbor. Much effort has gone into making such data accessible to the biologist, and a selection of the programs and interfaces resulting from these efforts are the focus of this chapter. The discussion will center on querying databases maintained by NCBI, as these more “general” repositories are far and away the ones most often accessed by biologists, but attention will also be given to specialized databases that provide information not necessarily found through the use of Entrez, NCBI's integrated information retrieval system.
Читать дальше