The Integration of Biological Data

Computational biology, an interdisciplinary study which integrates biology and computer science, is growing at an unprecedented rate due for the demand from the biology world to automate their common process and persistently store massive amounts of biological data for later analysis. The explosion of the number of bioinformatics system and databases has lead to a variety of non uniform bioinformatics databases. Furthermore, there is a great need for biologist to access many databases, comparing and analyzing volumes of data among them. This need can only be fulfilled if at some level biological database integration can occur; however, the lack of standardization makes the task of integrating the various biological databases, for more comprehensive data analysis, challenging.

The growth of computational biology has always been rapid, even from its early years. For example, in 1980 there was a description of a nucleic acid data bank that was offered free of change and accessible over the telephone in the journal Science.3 By 1981 there were already reports of rapid un-organized growth in this scientific sector, questioning as an article in Nature put it, “Too many databanks?”.3 This growth of non-homogenous biological data has continued to the present driven in recent years by the growth of the internet as well as the general user mentality of the computational biology system and biological databases. General biologist in the early years of bioinformatics rarely used the various bio-centric databases that were present at the time, primarily due to the technical skills involves and the fact that useful results generated from querying these databases were rare.3 In current years, the connectivity of the internet along with “friendlier” interfaces and better search results have allowed the use of biological databases by biologist to increase greatly. It is now common to have biologist search biological databanks on a daily basis to answer “routine questions”.3 This increase in the use of biological databanks in turn drove the development of biological systems and databases.

Biology is a vast area of study and due to this fact there are many types of bioinformatics databases that were developed because of the growing demand from the biological community. Biological databases can be divided into several categories depending on what information they store such as genomic, nucleic acid, protein, factors and motif, enzymes, plasmids, and organism-specific databases.[*] Often these groups of databases can further be broken up into general or special databases. This decomposition is import because special databases tend to be smaller and more rigorously defined, serving only a hand full of people. Due to this fact, specialized biological databases tend to collect more information from primary sources than large primary database and the data tends to be more detailed in the specialty databanks. Primary databases on the other hand are often very generic storing large volumes of atomic data for many people. Their differences make both primary and specific databases very useful to the scientific community.3

Not only are databases becoming larger but they are growing in the complexity of information as well.3 DNA sequencing has allowed scientists to obtain the information for the building block of living organisms—proteins; which when added to research about various cellular structures, made up of these proteins is a phenomenally relationship for biologist to utilize. There are many other relationships like these in biological data, even more complex one such as metabolic pathways and motifs.3 Both the fact that biological data is distributed in primary and specific databases and the fact that pretty much all biological data is related to other biological data in some way points to the growing need to integrate the various biological databases. The integration of biological databases would offer scientists with access to much more specific data provided by the special databanks, while having the volume of data needed by biologist to get accurate results. Furthermore relational ideas, such as going from DNA sequences to cellular structures or metabolic path ways can be implemented much more rapidly and effectively when working with interconnected databases already containing the various data elements needed to express the relationship.

The integration of biological databases is not any easy task, primary due to the number of different DBMS’s and data formats involve, as well as, the lack of well defined standards in bioinformatics. The need for standards in bioinformatics is being currently address and several committees have been set up to work on this need. The Bio-ontology Standards Group is trying standardize “domain-specific ontologies and vocabularies to support interoperability of data and software components,”12 while the Data Model Standards Group is working to “standardize domain-specific analytical data models to help integrate public data with proprietary data across all life science domains in an enterprise.”12 Another standards group is the W3C Semantic Web Health Care and Life Sciences Interest Group which is geared to research “core vocabularies and ontologies to support cross-community data integration and collaborative efforts, develop guidelines and best practices for resource identification to support integrity and version control, and advance the integration of scientific publication with people, data, software, publications, and clinical trials.”2 Biotechnology standards should be developed for the various levels of computational biology; for instance, a standard should be developed for database metadata relating to biological information such as having all of one type of data stored as the same domain while there should also be higher level standards that define common interfaces for application use. Biological data standards should not only be derived with in the biological computing community but committees working on standards should look toward other current and developing standards and utilize existing standards (where applicable) as much as possible. The National Center for Biotechnology Information (NCBI) did this when developing their NCBI Software Development ToolKit by using the ASN.1 standard developed by the International Standards Organization regarding the “format used to achieve interoperability between platforms”.13 By using the ASN.1 standard the NCBI insured the “storage and retrieval of data such as nucleotide and protein sequences, structures, genomes, and MEDLINE records” can be permitted across “computers and software systems of all types” in a reliable fashion.13

Although the development of standards are very beneficial to the unified integration of biological data they are not enough, as they need to be utilized by physical integration of biological databases. There are three main approaches that could be taken to integrate various biological databases. The integration could occur at the application level through the use of wrappers around databases and then building applications on top of this standard software layer. This approach to the integration would allow all the existing bioinformatics systems to remain intact along with their databases making the process of database integration non-intrusive to the normal activities of the system. However this integration method fall short primarily because the additional layers not only add complexity to the system but also add code blot resulting in slower performance.12 Furthermore this approach “does not address issues of data cleaning and transformations”.12 The integration of the databanks could use a data-level integration approach where semantic cleaning is not introduced. Data-level integration involves the integration of data at the data layer though the use of memory-mapped data structures, indices, and database links.12 Memory mapped data structures are subsets of data collected from various sources then normalized and integrated into memory for rapid access.12 On the other hand, indices refer to the indexing of flat files. Indexed flat files allow for robust query performance, but does not allow for relational database data to be naively stored. Because memory-mapped data structures, indices, and database links occur at a low level, they are more efficient than the middleware approach but in the same way as the application layer, there is no cleaning or transforming of the data preformed.12 These operations must be performed on the data before “complex querying, analysis, and visualizations” can be calculated.12 Finally data-level integration approach does not scale well. The third major type of database integration involves a “data-level integration with semantic cleaning” approach.12 This method utilizes a data staging area to which all data from the application is transported. At the data staging area the data is cleaned up and transformed as needed and then linked with the various data form the multiple sources.12 Once the staging has been completed the data is then place in a central unified abstract database most likely composed of smaller databases. The main problems with an approach like this is the time required to “extract, clean, transform, and load” the large volumes of data into the main database.12 This last approach like the other two, rely on the development of one or more standards to insure that all the data will be congruent at, at least, one level of the system.

It is important to remember that any biologist is only as good as the data they possess. The fact being that the primary roles of biologists are to collect and analyze biological data. In the past, traditional methods, of working by hand and using physical paper media was sufficient to store all the data biologist were working with; however with the advent of genetic sequencing and protein analysis the storage of biological data and its analysis could only be done practically using computer technology. Rapid and efficient access to the data stored in biological database is critical to the progression of biological advances; in fact, often scientists will need to be work with several databases holding a wide variety of information. But it is only through the process of developing standards which can be used to integrate databases can this dream of a highly interconnected biological research databases be achieved.

Appendix A – Database Categories and Current Databases[†]

Database Catalogs

DBCAT - public catalog of databases, at Infobiogen
Helix Systems Scientific Databases - locally available databases (NIH only)

Genomic

UCSC Genome Browser - the gold standard of genome browsers
National Center for Biotechnology Information (NCBI) - clearinghouse for all genomic information
Ensembl - joint project between EBI and Sanger Institute
GDB - The Human Genome Database

Nucleic Acids

GCG-Lite Sequence Lookup - search via GCG-Lite, at Helix Systems (NIH only)
Entrez - The Life Sciences Search Engine, at NCBI
SRS - Sequence Retrieval System, at EBI
dbEST - sequence and mapping data on partial, "single-pass" cDNA sequences or ESTs
NDB - Nucleic Acids Database, at Rutgers
Tumor Gene Database - information about targets for cancer-causing mutations

Proteins

GCG-Lite Sequence Lookup - search via GCG-Lite, at Helix Systems (NIH only)
SwissProt - at ExPASy.org
Entrez - The Life Sciences Search Engine, at NCBI
SRS - Sequence Retrieval System, at EBI
Molecules To Go - text-based interface to the PDB on Helix Systems
MEROPS - The Peptidase Database, at the Wellcome Trust Sanger Institute
Human Mitochondrial Protein Database - multiple query types available, at NIST
HIV Protease Database - at NCI-Frederick

Factors and Motifs

Transfac Prosessional 9.2 - transcription activation factor database (NIH only)
BIOBASE - free registration required

Enzymes

REBASE - good old New England Biolabs
ENZYME Database - at ExPASy.org
IntEnz - at EBI
Biology-oriented newsgroups - via the local server at Helix Systems

Plasmids

Genome Database of Naturally Occurring Plasmids - 'nuff said
NCCB - simple search engine, Royal Netherlands Acad. of Arts and Sciences, Utrecht

Organism-Specific Databases

Mammalian

Mouse Genome Database - synoptic descriptions with bibliographic citations
Portable Dictionary of the Mouse Genome - compact, downloadable database

Worm

Worm Base - biology and genome of C. elegans
Caenorhabditis elegans WWW Server - one-stop shopping for C. elegans

Insect

FlyBase - comprehensive database for genetics and molecular biology of Drosophila
Mosquito Genomics WWW Server - clearinghouse for everything about Mosquito genomics

Yeast

Saccharomyces Genome Database - at Stanford

Fungi

Fungal Genetics Stock Center - stocks, sequences, vectors, strains, and gene libraries

Prokaryote

IECA Database Portal - collection of E. coli servers around the world
SubtiList WWW Server - all about B. subtilis, Pasteur Institute
Entrez Microbial Genomes - 196 completed genomes and counting...

Plants

Arabidopsis Information Resource - very nice gateway to our favorite mustard weed
Grain genes - database for triticaea and avena
The Korea Rice Genome Database - at Myongii University
ChromDB - the plant chromatin database, at University of Arizona

Other

Ribosomal Database Project - at Michigan State University
National Human Genome Research Institute - ongoing research into genomics

Bibliography:

 

1. Ramez Elmasri and Shamkant B. Navathe. Fundamentals Of Database Systems: Forth Edition. Pearson Education, Inc. 75 Arlington St., Suite 300 Boston 2004.

 

2. Eric Miller, Tonya Hongsermeier, Eric Neumann, and Brian Gilman.

.W3C Semantic Web Health Care and Life Sciences Interest Group. W3C 1994-2005. <http://www.w3.org/2001/sw/hcls/> (01/27/06).

 

3. Dmitrij Frishnam, Klaus Heumann, Arthur Lesk, and Hans-Werner Mewes. Comprehensive, comprehensible, distributed and intelligent databases: current status. Bioinformatics Review Vol. 14 no.7 1998. Pages 551-561. <http://bioinformatics.oxfordjournals.org/cgi/reprint/14/7/551.pdf> (01/27/06).

 

4. Christian J. Stoeckert Jr1, Helen C. Causton2 & Catherine A. Ball. Microarray databases: standards and ontologies. Nature Genetics 32, 469 - 473 (2002). <http://www.nature.com/ng/journal/v32/n4s/full/ng1028.html> (01/27/06).

 

5. H. Boutselakis, D. Dimitropoulos, J. Fillon, A. Golovin, K. Henrick*, A. Hussain,

            J. Ionides, M. John, P. A. Keller, E. Krissinel, P. McNeil, A. Naim, R. Newman,

            T. Oldfield, J. Pineda, A. Rachedi, J. Copeland, A. Sitnov, S. Sobhany,

            A. Suarez-Uruena, J. Swaminathan, M. Tagari, J. Tate, S. Tromm,

   S. Velankar and W. Vranken. E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. October 3, 2002. <http://nar.oxfordjournals.org/cgi/reprint/31/1/458.pdf> (01/27/06).

 

6. Ardeshir Bayat. Science, medicine, and the future: Bioinformatic. BMJ 2002 April 27; 324(7344): 1018–1022. <http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1122955> (01/27/06).

 

7. Lincoln D. Stein. INTEGRATING BIOLOGICAL DATABASES. Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA. 2003. <http://www.nature.com/nrg/journal/v4/n5/full/nrg1065_fs.html> (01/27/06).

 

8. Chad Creighton and Samir Hanash. Mining gene expression databases for association rules. Bioinformatics Vol. 19 no. 1 2003 Pages 79–86. Bioinformatics Program and 9. Pediatrics and Communicable Diseases, University of Michigan, Ann Arbor, MI 48109, USA. <http://bioinformatics.oxfordjournals.org/cgi/reprint/19/1/79.pdf> (01/27/06).

 

9. Guochun Xie, Reynold DeMarco, Richard Blevins, and Yuhong Wang. Storing biological sequence databases in relational form. Bioinformatics Applications Note. Vol. 16. no. 3 2000 Pages 288 – 289. Department of Bioinformatics, Merck & Co., Inc., WP42-300, West Point, PA 19486, USA and Department of Bioinformatics, Merck & Co. Inc., RY80-A1, Rahway, NJ 07065, USA. < http://bioinformatics.oxfordjournals.org/cgi/reprint/16/3/288.pdf> (01/27/06).

 

10. A. Siepel, A. Farmer, A. Tolopko, M. Zhuang, P.Mendes, W. Beavis and B. Sobral. Bioinformatics. National Center For Genome Research, 2935 Rodeo Park Drive East, Sanata Fe, NM 87505, USA. 2000. <http://bioinformatics.oxfordjournals.org/cgi/screenpdf/17/1/83.pdf> (01/27/06).

 

11. Computational Molecular Biology At NIH. Helix Systems, CIT, NIH. 2006. < http://molbio.info.nih.gov/molbio/db.html> (01/27/06).

 

12. Kenneth Giffiths, Richard Resnick. Approaches to Integrating Biological Data. NetGenics, Inc. 2000. <http://www.iscb.org/ismb2000/tutorials/griffiths.html> (03/18/06).

 

13. ASN.1 Summary. 2003. <http://www.ncbi.nlm.nih.gov/Sitemap/Summary/asn1.html> (03/19/06).



[*] See appendix A for examples of each type of database.

[†] Copied from 11 (see Bibliography).