Epigenomics Now

For nearly a quarter-century, we have had a (mostly) complete listing of the human genome, the three-billion “letter” sequence of DNA, most of which is the same for all of us. This reference copy makes it much easier for scientists to understand biological processes and to identify the individual variations, such as mutations, that contribute to disease. Despite its central role and its extreme usefulness, however, the genome’s impact on health care has been smaller than many proponents had hoped.

Part of the reason is that while most of the cells in your body carry identical DNA, the biological activity of different regions varies widely over time and between different tissues. It is these differences in gene expression that orchestrate the intricate development of tissues and the unique features of various cell types, as well as much of the misbehavior of cells in disease.

Researchers have been refining techniques to map and analyze cellular features that affect gene expression, including chemical modifications of DNA and of the proteins that it wraps around. Although these changes, which are called “epigenetic” because they do not alter the genetic sequence, can persist through cell divisions, and sometimes even transmit altered activity to offspring. (The now-widely-used “-omics” suffix refers to the comprehensive survey of individual “-etic” measurements.)

In contrast with traditional painstaking laboratory techniques, modern biology exploits robotic manipulation of samples and high-throughput data acquisition to assemble enormous data resources. Computer analysis plays a critical role in interpreting these data, at least as important as it has been for DNA sequence data, and the challenges and opportunities are growing.

Sequence Information

The 3 billion DNA letters, chemically known as bases and denoted C, G, T, and A (for cytosine, guanine, thymine, and adenine) are distributed among 23 pairs of chromosomes, each of which contains tens of millions of bases. Mapping the sequence involves chopping each strand into short pieces, chemically determining the order of bases in each piece, and computationally matching up overlapping segments to reconstruct the full sequence. (This process conflates regions that have very similar sequences; only recent technology that sequences longer sections has completed the end-to-end genome.)

The sequence is not enough, however. British biologist C.H. Waddington coined the term “epigenetics” in 1942 to describe how local conditions go beyond genetics to determine cell characteristics. For example, as a single fertilized egg divides and develops, the embryo’s cells become increasingly committed to specialized cell-type identities with distinct patterns of gene expression.

The first step in expressing DNA is its “transcription” into another nucleic acid, RNA, which exactly mirrors the DNA sequence. RNA performs many cellular roles, most familiarly through later “translation” via the genetic code into proteins that form cellular structures or catalyze chemical reactions.

Some proteins, known as transcription factors, bind to target “promoter” sequences in the DNA to regulate further transcription of nearby genes, creating feedback loops that stabilize particular patterns of expression. Although this mechanism matches Waddington’s original concept, epigenetics is now more often used to describe processes that more directly and persistently regulate the activity of specific regions of the genome. The late Nobel medalist Joshua Lederberg once said that epigenetics had already become a “semantic morass” by the late 1950s, and scientists have since identified numerous relevant processes that further complicated the terminology.

One important mechanism is the direct methylation of DNA, the chemical attachment of a methyl group to a C in the DNA chain, which often suppresses expression of a nearby gene. Another epigenetic mechanism involves chemical modification of histone proteins, around which much of nuclear DNA is normally wound tightly. Some of these changes increase gene expression, others decrease it.

To map these modifications, researchers use antibodies that bind to particular “marks.” The DNA is then chopped up and the antibody-bound segments are sequenced. The sequences are then computationally matched to the reference genome to find out which locations were modified.

DNA information is central to its biological significance, preserved through the reliable incorporation of complementary bases (C with G and A with T) during DNA duplication to make new cells or new offspring. Epigenetic information is more fragile; it is largely maintained through specialized enzymes that mirror the marks on new complementary chains.

The epigenetic information and the corresponding patterns of gene expression can signify specific cell types, such as neurons or muscle cells. They also are known signatures for particular cancers (although DNA mutations also play an important role). Epigenomics thus informs cancer prognosis, as well as guiding researchers to potential therapeutic targets.

Going for the Code

For many years, researchers have been developing computer tools to analyze various large-scale datasets built on the human genome. For example, in 2015 researchers adapted hidden Markov models to look for an “epigenetic code” that goes beyond the overall density of marks in particular regions for predicting DNA accessibility and transcription.

“There has been a long debate whether … a code exists in which the combination of these marks is actually something more complex and meaningful than just the sum of them,” said computational biologist Mattia Pelizzola of the Italian Institute of Technology in Milan. He notes that although some marks do have interacting effects, there is little support for a large-scale epigenetic code that had earlier been envisioned.

An important regulator of gene expression is the three-dimensional (3D) structure of the chromatin, the conglomeration of DNA and proteins in the nucleus. It was long clear that there are two distinct states of chromatin, as well as distinct locations in the nucleus, which differ in their gene activity. Recent research has shown much greater complexity, organized in part by the epigenetic modifications.

Researchers have developed tools for probing this 3D structure, for example by chemically cross-linking nearby regions of DNA, which need not even be on the same chromosome. Chopping up the DNA and sequencing the cross-linked segments then provides a comprehensive view of which parts of the DNA are brought into proximity at various length scales by the 3D folding.

These measurements and their computational analysis revealed the formation of loops, and the resulting “topologically associating domains” that bring together genes and regulatory elements. Understanding how epigenetic marks guide the 3D organization is an important, ongoing challenge that is well suited to artificial intelligence. For example, Jie Liu, assistant professor of computational medicine and bioinformatics at the University of Michigan, and his colleagues used a deep learning model to predict 3D structure from available data such as DNA and histone marks and accessibility.

Deep learning “no longer requires humans to annotate the features,” Liu said. “It has the convolution filters to identify the features automatically from the data.” Nonetheless, he thinks the features that emerge can be understood by people, which is important “so that people can trust the model.”

“There are tons of models like this in this domain,” Liu said. “We are trying to integrate everything into one framework for everything,” Liu said, using a transformer-based method inspired in part by large language models. “This is really similar.”

A Rich Future

The methodologies for mapping DNA methylation and histone modifications are well established, as are techniques for mapping DNA accessibility and (to a lesser degree) 3D chromatin structure. However, researchers are also exploring other, less direct epigenetic mechanisms, which are also ripe for computational study.

Some RNA transcripts, for example, can directly influence gene expression without the need to be translated into protein. These RNA molecules, like transcription factors, can transmit their information to daughter cells. “It has been speculated that at the level of transcriptome, there is something that can be inherited and transmitted through cell division,” said Pelizzola. In his own, early-stage research, Pelizzola is also exploring the “epitranscriptome,” a “recently emerging layer of epi-information” involving chemical modifications of transcribed RNA.

Most epigenetic information is reset during reproduction, but there have been some reports that features can be conveyed through multiple generations. Some biologists suspect such preservation could affect long-term evolution by laying the groundwork for related genetic changes in DNA, but this idea is not universally accepted.

Computational methods will continue to play key roles in the ongoing explosion of biological understanding, Pelizzola said. For one thing, “Brute force computer power is more and more needed for dealing with huge and increasing amounts of data” produced by improved measurement tools. Second, he noted that computer science and multidisciplinary teams are critical for precise computational methods, such as those based on differential equations.

Finally, deep learning and artificial intelligence “are very likely to be very disruptive in terms of revealing unexpected connections,” he said. “The key is having enough data” for training.