The Biological Digital Library

If the human genome is the book of life, then the data necessary to make sense of it is the library of life. Rather than a traditional library of books, it is a library of microbial, plant, and animal genomes, of 3D protein models, of experimental data, and literature. This information is digital by necessity; the time when all known gene sequences could be published on paper is over. The dozens of gigabytes of sequence and structure will likely never touch paper again—it is truly a digital library. How should we deal with this enormous, heterogeneous mass of data? What tools will librarians need to curate it? Who will interpret the data, and how will they access it?

The accompanying figure shows some of the data types the Human Genome Project and its associated research programs are producing. On the traditional digital library side is MEDLINE, a comprehensive index of biomedical research abstracts. With nine million abstracts, carefully curated keywords and links to sequence and structure databases, it represents an immensely valuable research resource. We understand how to search in large textual databases, but within the biological digital library the information sources we can use for searching text are much richer.

GenBank, the primary repository for DNA sequences, represents a very different challenge to information retrieval. GenBank consists of over eight million DNA sequences of varying quality, from thousands of species, containing important genes and repetitive junk. Within the DNA are coded genes that, when translated, yield proteins, the cogs of the cellular machine. The bag-of-words model that is so successful in text retrieval fails for DNA and protein because symbol order is paramount, and matching must be robust in the presence of mutations. This necessitates a different set of techniques from those developed for text.

Another data type is structure; the 13,000 3D structures for proteins painstakingly determined in laboratories around the world are archived in the Protein Data Bank (PDB). Each structure contains hundreds of atom coordinates, and finding similar structures involves computationally expensive geometrical alignment. The assembly-line processes that have been so successful in producing sequences are only now being successfully applied to structure determination.

In addition to relationships between data of the same type—cosine similarity and citations for text, alignment by dynamic programming for sequences, and geometric alignment for structures—there are several kinds of explicit relationships between different data types. For example, links between MEDLINE abstracts and GenBank or PDB entries specify sequences discussed in a particular paper, or link a structure to a discussion of its interactions with other molecules. Other relationships include shared sequence motifs that indicate common function between protein sequences, or shared Medical Subject Heading (MESH) terms that indicate common subject matter in abstracts. This graph of relationships can be browsed at the National Center for Biotechnology Information’s Entrez system at www.ncbi.nlm.nih.gov/Entrez. However, to go further we need new algorithms to explore this graph autonomously, and to discover new relationships not evident by eye.

As time goes on, more work in medicine will be done in the digital library rather than at the lab bench.

To understand how this graph can be used, consider three problems. The first problem involves finding relevant abstracts given a query. Once one relevant abstract has been identified, others can be identified using the entire graph of relationships rather than just cosine similarity. For example, two abstracts might be related via a shared sequence: in the figure, abstracts A and B are both linked to sequence Y, which implies they both discuss the same gene. A more distant relationship is one between abstracts A and C linked to sequences X and Z, respectively. X and Z are highly similar, which implies they discuss genes that are evolutionarily related, that is, that almost certainly have the same function in the cell.

A second problem, which is central in understanding the human genome, is annotating genes with their probable function. This information retrieval problem begins with a gene and looks for close abstracts in the graph of relationships. Distance is defined in terms of a variety of relationships, including ones that traverse multiple links. For example, sequence Y can be annotated using material from abstracts A, B, and C; C because of the relationship through sequence Z.

Third, determining whether two genes are evolutionarily related usually involves just aligning the sequences. The similarity of two sequences is often too low to classify them as related with certainty, but using other relationships between the two—say, shared terms in related MEDLINE abstracts—it will be possible to gather additional evidence to support their relationship. Expressing these information retrieval problems as graph algorithms allows more powerful computations to detect distant relationships and uncover novel associations. We are evaluating ways to express relationships of different types within a consistent probabilistic framework. The graph thus becomes a probabilistic model, and many existing inference techniques are applicable.

Understanding the processes of life is important. As time goes on, more work in medicine will be done in the digital library rather than at the lab bench. Structuring and providing access to the biological digital library is thus crucial to the future of medicine and science. But in addition to its practical importance, it also represents an exciting challenge: to integrate information of very diverse kinds, and leverage the richness of relationships between sequence, structure, and literature.

Figures

Figure. Relationships between objects in the biological digital library.

Footnotes

This work is supported by NSF CAREER grant IIS-9986085.

Figures

The Biological Digital Library

DOI

May 2001 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Figures

The Biological Digital Library

DOI

May 2001 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.