The Seed – Communications of the ACM

Warehouses of genomic data are a fantastic source and archive. However, they do not meet all the needs of scientists working in one of the most important areas in genomics today—genome annotation, the process of identifying the functions of particular regions of genetic sequence data. Scientists need to be able to view and manipulate genomic data, automatically append annotations to it, and exchange information with other researchers. It is especially important that as the fraction of genomic data that is annotated decreases (even as the amount of such data increases), scientists should be able to exchange preliminary data and early thoughts not yet ready to be committed to the various national archives. We have created, are distributing, and are continuing to extend a toolkit, called the SEED, to address this critical need in genomic research.

A genome may be viewed as a set of genes that encode protein sequences. The function of each gene is determined by the activity of the protein it encodes. Genome annotation is the process of assigning functions to genes. Functions are assigned by any of several methods. The most direct involves determining the function of a gene through experiment. Since far more gene sequences are available in genome databases than the number with directly determined functions, most genes are assigned a function via indirect methods. Such methods include: assigning a function to a gene based on sequence similarity to genes with known function; assigning a function to a gene based on its position in a gene cluster through comparative analysis of many genomes; and inferring function via other techniques in order to detect functional coupling.

Genome annotation is an iterative process that exploits a variety of domain knowledge sources (see Figure 1). For genes that code for enzymes involved in core metabolism, much is known about the biochemical reaction networks in which the enzymes participate. The existence of a known reaction pathway (such as those available in biochemistry databases) provide valuable information supporting inference of function through the process of systematic elimination. This approach—based on the use of known reaction pathways—should be much more valuable than simple similarity methods; such methods are often unreliable, particularly in the case of paralogous genes, or those with common ancestry but that have evolved divergent functions.

Good access to and straightforward navigation within, between, and among annotated genomes is critical for life scientists who wish to use genomic data in biomedical research. Scientists studying the structure and function of genes require access to a broad integration of verified, publicly available data (including newly sequenced genomes) in a computing environment that supports comparison of genomes. Additionally, though scientists often wish to include analysis of genomes that have not yet been added to the public integrations, few resources permit integration of data from public sources with data not already in public resources. Software applications for genome comparison and curation typically includes only final public data or are not generally available for use by the scientific community because they are proprietary tools created by commercial firms. Because they are important commercial assets, these tools are not available for use by the general research community.

The individual researcher simply does not have access to an effective framework for comparative analysis of genomes that can incorporate publicly available genomes, nonpublic data to which the researcher has access (under restrictions or license terms), and the researcher’s own private data. Meeting this need, the SEED makes creation and maintenance of new integrations of genomic data feasible by taking advantage of the collective capabilities of many distributed researchers. The SEED is installed at 30 locations worldwide and is being developed by several institutions worldwide.

The name SEED is derived from two sources: One is FIG SEED, a play on the initials of the Fellowship for the Interpretation of Genomes, the company that developed the initial SEED. The other is from a passage in Neil Stephenson’s novel The Diamond Age referring to the nonhierarchical nature of self-reproducing nanotechnology, or biological systems, that requires no centralized infrastructure to function.

Rapid Annotation

The SEED project aims to provide to the biology community a suite of open source tools to enable distributed teams of researchers to rapidly annotate new genomes, particularly microbial genomes. The initial goal (within three years) is to support rapid annotation of the first 1,000 genomes to be sequenced. In particular, the SEED enables researchers to create, collect, and maintain sets of gene annotations organized by groups of related biological and biochemical functions across many organisms. These groups of related functions are subsystems, and each subsystem is a set of biological functions that together implement a specific process. For any organism believed to contain a variant of the process, the subsystem includes a designation of precisely which genes (if any) implement each of the functions. This organizational approach enables biologists to project function assignments to sets of genes in newly sequenced organisms.

The subsystem approach is unique among genome annotation systems. Indeed, most existing annotation systems support annotation of one genome at a time (achieving improvements in speed by accelerating the rate at which each genome is annotated). The SEED supports annotation of a single subsystem over hundreds of organisms simultaneously (annotating one subsystem at a time).

The SEED will allow users to readily examine the way a given gene relates to other genes, exposing the clues relevant to the determination of function. The SEED provides eight key functions for navigating and annotating genes:

Locate and visualize clusters of genes relevant to the analysis of a specific subsystem;
Identify similar genes from other organisms;
Visualize a neighborhood on the chromosome;
Compare one neighborhood of genes with other neighborhoods around corresponding genes in other genomes;
Examine genes that implement closely related metabolic functions;
Add and update function assignments and annotations;
Detect inconsistent representation of function; and
Develop a package of assignments and annotations corresponding to a single subsystem.

Being able to study each gene as it relates to all these sources of clues distinguishes integrations from simple collections of genomic data. As a result, the SEED should dramatically reduce the effort and cost required to construct genome integrations.

The SEED makes creation and maintenance of new integrations of genomic data feasible by taking advantage of the collective capabilities of many distributed researchers.

The SEED is also designed to support the work styles of individual biologists who may collaborate in collections of loosely coordinated distributed teams or in tightly coupled teams located at a single institution. Each SEED instance is a self-contained genome annotation system that permits multiple users to access, update, and extend the annotation database via a Web-based user interface; local developers also have access to a rich API and programming shell in PERL and Python. The system’s design ensures that each user has at hand on a local machine all the data and tools required to do annotations. To support distributed teams, the SEED uses a basic peer-to-peer synchronization facility that supports the sharing of information among ad hoc collections of SEED installations. The SEED’s peer-to-peer computing architecture enables the following data-sharing and synchronization features:

New genomes among servers;
Subsystem definitions and associated gene function assignments;
Gene function assignments (they can be installed selectively);
Gene annotations, including notes, pathways, and diagrams; and
Naming rules, or sets of function translations.

The SEED’s development involves three major software projects: scalable infrastructure to support construction of large-scale integrations of biological data (with genomic data being the primary initial target); peer-to-peer infrastructure to enable the distributed curation of integrated databases and synchronization of the integrations; and software to enable extensions of the integrated database environment to allow adaptation to new biological applications.

Hundreds of Complete and Partial Genomes

The SEED currently includes the RefSeq data (a nonredundant, integrated set of sequence data for several of the most commonly studied research organisms) distributed by the National Center for Biotechnology Information, along with many genomes not yet deposited in the public archives. It contains 295 complete or near-complete genomes and almost 700 partial genomes, offering access to six basic object types:

Genome. One or more contigs (a contiguous bit of genetic information, the unit actually sequenced by an automated sequencing robot) associated with a particular organism;
Contig. Contiguous DNA sequence;
Feature. Region of DNA sequence that has some associated properties;
Protein sequence. Amino acid sequence;
Functional role. Text description of the role of a protein or DNA sequence; and
Subsystem. Collections of functional roles that are related in some way.

All of this information is organized by genome but is searchable by many different indices, including annotation, pathway, homology, and subsystem context. Subsystem maps link to the SEED or to external databases. Since different researchers sometimes use different terminology to indicate the same functionality, the SEED also maintains a set of translation rules to convert between various notations. (The SEED’s data-exchange model is outlined in Figure 2.)

Its most notable benefit will be to enable analysis of newly sequenced genomes in the context of comparative data, leading to far superior initial characterizations.

Since each researcher has a complete copy of the code and database, he or she is free to include any data, from any source, restricted or otherwise, for personal use. The common notion of peer-to-peer communication involves anonymous users participating as both client and server in a common search space; queries are launched into that space and, when a query is satisfied, two peers communicate directly. The SEED has somewhat different requirements. Instead of an anonymous group, annotation teams of SEED users are well known to each other and, in fact, desire to circumscribe membership.

Data exchanges are not based on generic queries; instead, queries for synchronization are directed at specific SEED instances and users. To facilitate this model, the system architecture has a Seed Rendezvous Point where a distributed annotation team is able to meet in cyberspace and exchange data. The first version of the Seed Rendezvous Point is implemented as a simple Web service at a well-known address and is really just a place to register presence and act as a third-party store-and-forward facility to enable communication in the face of local firewall issues.

The SEED’s current architecture will be enhanced in future versions with the addition of more robust security, data-exchange, and collaboration facilities. Using a peer-to-peer model allows distributed annotation teams to be formed where each member works in a particular area of expertise, and only the team is authorized to share that work. By controlling access to data items, researchers using restricted data can still participate in the exchange of data with larger annotation teams while conducting their own research.

As the number of publicly available genomes increases, effective high-throughput annotation will be based on developing consistent annotations of specific subsystems across hundreds of genomes simultaneously. A user of the SEED with knowledge of a specific subsystem will be able to develop a detailed understanding of the variants of the subsystem, the functional roles that make up each variant, and the variant that applies to each of the sequenced organisms.

Conclusion

The SEED represents an ambitious effort to stimulate and support education, research, and collaboration relating to the analysis of genomic data. Its primary objective is to dramatically improve the ability of biologists to construct and propagate large-scale integrations containing hundreds of genomes, expression data, metabolic data, and other forms of biological data needed to support the analysis of organisms. Its most notable benefit will be to enable analysis of newly sequenced genomes in the context of comparative data; such analysis will inevitably lead to far superior initial characterizations.

The SEED leverages a network and functional design incorporating the notion of communicating peers. This design is in keeping with current directions in network services architecture while facilitating collaboration and cooperation in the biological community, including the participation of students and researchers at institutions lacking their own large information technology infrastructures.

Figures

Figure 1. The iterative process of genome annotation.

Figure 2. The SEED data exchange model.

Figure. Artist’s conception of the A. niger beta-D-glucosidase enzyme (foreground) and a yeast cell (background). M. Himmel and S. Decker, National Bioenergy Center, National Renewable Energy Laboratory, Golden, CO; and D. Seely, Pixel Kitchen, Boulder, CO. This work was funded by the U.S. Department of Energy Office of the Biomass Programs.

Figure. Lateral view of the T. reesei cellobiohydrolase I enzyme, which is thought to be the critical agent in the hydrolysis of cellulose in biomass. The flash of light shown emanating from the catalytic domain depicts the energy released as a glycosidic bond is broken (hydrolyzed). M. Himmel and S. Decker, National Bioenergy Center, National Renewable Energy Laboratory, Golden, CO; and D. Seely, Pixel Kitchen, Boulder, CO. This work was funded by the U.S. Department of Energy Office of the Biomass Programs.

Footnotes

This work is supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract W-31-109-ENG-38.