Biotechnology-Aquaculture Interface: The Site of Maximum Impact Workshop | |
|
Contents
Appendix
|
Workshop Report
-Preface
-Final Report
Comparative Genomics
John F. Heidelberg
The Institute for Genomic Research
9712 Medical Center Drive
Rockville, MD 20850
jheidel@tigr.org
Abstract
Since Haemophilus influenzae was completely sequenced and annotated in 1995, the availability of complete genome sequences of microorganisms has changed the manner in which microbiologists address many hypotheses and the resolution and speed with which these questions are answered. The microorganisms selected to date for genome sequencing projects enable many relevant issues to be addressed including, but not limited to, pathogens (i.e., Borrelia burgdorferi , Mycobacterium tuberculosis, Vibrio cholerae , Xylella fastidiosa ), bioremediation and industrial processes (i.e., Archaeoglobus fulgidus , Deinococcus radiodurans , Bacillus subtilis ), and increasing our current understanding of evolution (i.e., Aquifex aeolicus , Methanobacterium thermoautotrophicum , Thermotoga maritima ). The numerous applications of the information gained from complete genome data of these microorganisms are only now beginning to be appreciated.
In 1995 the first complete genome sequence and annotation of a free-living organism, Haemophilus influenzae, was completed. This accomplishment ushered in the genomic era for microbiology. Currently the range of genome sequencing projects includes representatives from all three domains of life, and provides good coverage of most major groupings within the archaea and eubacteria. However, there is a relative concentration of sequencing projects on well-studied groups (i.e., -Proteobacteria and the low GC Gram-positive bacteria) while other groupings, such as the crenarchaeota, are very under-represented. Another way to consider the type of microorganism being sequenced is by their ecological role. Considered in this manner, pathogenic bacteria and microorganisms from extreme environments are well represented in current genome sequencing efforts. However, organisms of agricultural significance and difficult to culture organisms are currently relatively poorly represented, but with the increasing rate of genome sequencing, it is anticipated that these deficiencies will be temporary.
The diversity in the representative organisms allows for comparative studies of genome composition and gene organization within and across the domains. Insight has been gained into how genes are acquired and shared between organisms , and the ability of bacteria to change their genome composition rapidly by capturing and maintaining megaplasmids . The later events have been suggested to increase the competitive nature of Vibrio cholerae in the aquatic ecosystem .
In these early days of genomics, a major challenge to the scientific community is both keeping up to date with the remarkable amount of genomic data being released, determining the most relevant data for a particular study, and determining how to best apply these data to your science. This chapter will review the current status of the field of microbial genomics, discuss hypotheses being addressed in environmental microbiology by the use of genomic data, and give an overview of where the field of genomics is going.
Comparison of the Transport Capabilities of Microorganism
The availability of complete genome data enables systematic genome wide comparisons in order to provide insight into the overall physiology of each organism. Comparison of 76 families of cytoplasmic membrane transport systems in each completely sequenced microbial genome provides one example of such a comparative genomic The overall transport substrate specificities were found to correlate with the organisms lifestyle, i.e., with the concentration and diversity of nutrients in their particular ecological niche. For instance, organisms from deep marine environments were found to possess only a limited array of transporters for organic nutrients but a preponderance of transporters for inorganic cations and anions, which presumably reflects the availability of these substrates in this environment. Similarly, intracellular parasites, such as the chlamydias and Rickettsia prowazekii, have an extensive set of transporters for amino acids and nucleotides, but little ability to transport free sugars, which almost certainly reflects the relative accessibility of these compounds in an intracellular environment.
Thus comparative analysis of transporters appears to provide insights into both the physiology of the organism and the environment in which it dwells. It is anticipated that such comparative genomic studies will become increasingly pertinent as complete genome data becomes available for representative organisms from different phylogenetic lineages and different lifestyles or environments.
Unknowns and Conserved Hypotheticals
It is important to note that for each completed genome sequence project, an average of 40- 50 % of the ORFs in the genome is either shared with other organisms but has not been previously characterized (conserved hypothetical), or is completely unknown (unique). One major challenge remains in the elucidation of these unknowns and conserved hypotheticals. Hopefully the use of microarrays, in combination with proteomic studies and other functional genomic approaches (see above) will accelerate the rate at which we can begin to characterize these ORFs. Undoubtedly, many of these unknown genes will prove to have interesting or important functions relevant to environmental microbiology.
BIOINFORMATICS APPLICATIONS
Databases
As more genomes are finished the amount of data being generated is enormous. In order to make reasonable use of this data, researchers need to be able to easily compare data across these organisms. To do this, databases that are openly accessible to the research community and easily queryable must be available. A variety of second generation biological databases have been developed which address particular demands resulting from the mountain of genomic data. An increasingly important feature of databases such as the Omnione and Ecocyc is that they incorporate detailed manual curation of the data in addition to sophisticated automated analysis. Examples of such bioinformatic databases are described below.
TIGRs Comprehensive Microbial Resource (CMR) (http://www.tigr.org/) enables inter- and intra-genomic comparisons of microbial genomes (Fig. 1). The CMR contains a variety of tools for querying the Omnione database, which incorporates the detailed curated genome dataset from each of TIGRs microbial genome projects as well as the original annotation and further automated annotation by TIGR for all of the non-TIGR microbial genome sequencing projects. Thus, the CMR allows comparison between genomes based on role categories, protein families, best matches and other criteria and enables complex queries based on a variety of features.
Interpro ( The Ecocyc database (http://ecocyc.PangeaSystems.com/ecocyc/) is an E. coli specific database combining functional and bioinformatic information describing the metabolic pathways, signal transduction, membrane transport and gene regulation of E. coli . Each of the proteins and enzymes in E. coli is annotated in detail, including references to the original literature. Thus Ecocyc acts both as an online review article and a qualitative model of the E. coli biochemical machinery. The Pathway Tools graphical user interface provides a wide variety of query operations and visualization tools. The related MetaCyc database is a more generalized metabolic-pathway database that describes pathways and enzymes of many different organisms . Other metabolic databases include WIT (http://wit.mcs.anl.gov/WIT2) and KEGG ( Orthologues are genes in different organisms that evolved from a common ancestral gene by speciation, and paralogues are homologous genes that diverged by gene duplication and hence may have diverged in function. Accurate functional predictions of protein function are increasingly dependent on methods that take into account assignment of orthologues and paralogues within families of homologous proteins. Phylogenetic approaches are particularly valuable for determining orthology. Other family-based approaches for identifying orthologues and paralogues include the COG (Clusters of Orthologous Groups of proteins) database (http://www.ncbi.nlm.nih.gov/COG) and Pfam database (http://www.sanger.ac.uk/Software/Pfam/) of protein family sequence.
Recommendations
Short-term (1-3 years):
1) Establishment of a program to foster collaborations and cross-disciplinary training between aquaculture researchers and genome scientists. This should include programs to train aquaculture post-docs in genomics and bioinformatics.
2) Determine a realistic priority list of organisms for genome sequencing. This should include bacterial and viral pathogens of economically important aquaculture species, genomes and genome maps of aquaculture species, microbes that may help increase the productivity of aquaculture, etc.
3) Support development of publicly available genome databases (i.e., TIGRs CMR) to include tool valuable to aquaculture researchers.
4) Begin genome sequencing and annotation of the organisms on the priority list.
Mid-term (4-7 years)
5) Continue the first four recommendations.
6) Establish a mechanism to evaluate new genomic technologies and their usefulness in addressing the problems in aquaculture.
7) Develop microarray chips for the sequenced microbes and a mechanism to allow researches access to both the chips and the data generated by other researchers.
Long-term (8-10 years)
8) Develop a comprehensive aquaculture database to allow searching and complex queries that will encompass all genome sequence, functional genomics, proteomics, environmental data, aquaculture strategy, etc.
References
Fig. 1. The TIGR Comprehensive Microbial Resource.