Learning to See

How do you look for a needle in a haystack, when you are not sure what the needle looks like? This is the problem that faces scientists as they try to deal with increasingly complex datasets. One answer is to turn machine learning loose on the enormous volumes of data they have captured.

The problem of finding relevant data in genetic databases is one that Simon Roux, a researcher working at the U.S. Department of Energy’s Joint Genome Institute, faced when investigating the role that an obscure and little-understood family of viruses plays in the environment.

There are many types of virus, called bacteriophages, that infect bacteria. Many of these either kill their hosts or are themselves rejected by an immune response. The bacteriophages that belong to the family inoviridae can remain in the host for long periods. This property has helped make one such “inovirus,” known as M13, a popular choice among bioengineering researchers. The needle-shaped M13 infects the Escherichia coli (E. coli) bacterium, an organism that is very easy to cultivate under laboratory conditions. When the bacteria expel the virus particles they are forced to make by the viral DNA, the particles are available in large numbers and can be purified easily, chemically treated to sterilize them, and then formed into artificial structures. A team at the Massachusetts Institute of Technology (MIT) led by Angela Belcher has used M13 scaffolds to make electrical batteries. Seung-Wuk Lee of the University of California, Berkeley (UC Berkeley) has used the genetically engineered versions of the same inovirus to create piezoelectric generators.

Figure. Colored transmission electron micrograph of a T4 bacteriophage virus, magnified 100,000 times.

Members of inoviridae can show a much darker side, too. One inovirus has been found to make cholera bacteria much more deadly. Says Roux, “You might think, as we have these viruses with great applications and others with a big impact, we must know a lot about them. But we don’t.”

There are fewer than 100 confirmed species of inovirus. They even seem to elude methods that were developed specifically to find and identify novel species of microorganism. One such technique, the meta-genomic survey, takes advantage of the high-speed “next-generation” gene-sequencing (NGS) hardware now available to biologists. Derived from the “shotgun” sequencing used on the Human Genome Project more than 15 years ago, NGS makes it possible to reconstruct genomes from multiple species that may be contained in a single sample, instead of trying to isolate first the DNA of each organism.

The first step is to shred DNA extracted from a biological sample before using enzymes to make enough copies for sequencing. High-performance computers then attempt to piece together the resulting jigsaw into longer sequences. The algorithms do this by aligning segments that appear to overlap before assembling them into different candidate genomes. Normally, in a metagenomic survey, researchers hand-check the results to try to weed out false matches.

With bacteria and higher organisms, it is relatively straightforward to ensure that each genome represents a single species. One commonly employed technique looks for variations of one or two essential large genes. Because these particular genes are fundamental to the survival of the organism, such genes exhibit relatively minor deviations across species, and organisms from the same family will have common changes that are not seen in more distant relatives.

In some cases, metagenomics has revealed thousands of previously unknown organisms lurking in samples from a single location. A group led by Jill Banfield at UC Berkeley took samples from sediment beds at an abandoned uranium mine in Colorado in 2015. From those samples, NGS and computer analysis coupled with manual curation reconstructed more than 2,500 partial and complete genomes, and found among them were nearly 50 new families of bacteria. Further work led to the team proposing a new “tree of life” they believe better explains the evolutionary relationships between microorganisms than traditional models.

For both bacteria and viruses, metagenomic surveys have produced genomes suitable for study without demanding that each species be cultured in the lab. For many species, that is impossible using current techniques. Viruses present a significant problem as they are closely associated with their hosts and do not grow in isolation.

Siddharth Krishnamurthy, a researcher at the Washington University School of Medicine in St. Louis, says, “Without these large genomic databases and algorithmic approaches to populate them, we would be unaware of whole families of viruses that have never been cultured.”

Yet within these databases, members of the inoviridae family are suspiciously absent. Roux’s hunch was that inoviruses are commonly found in the environment and that detection was the main problem. It seemed traditional genome-identification and binning tactics do not work well on them. One possibility was to use a tool called VirSorter, developed at the University of Arizona when Roux worked there. This software looks for characteristic nucleotide patterns in genomes, such as sequences that code the protein shells in which viruses wrap their DNA payloads for transport to new victims.

“In principle, if we knew the sequence of every virus on the planet, there would be no value in using a machine learning algorithm for virus identification.”

“This work started when we realized that these viruses were missed by the probabilistic techniques used in VirSorter. The short story is that these inovirus genomes are too short and their genes are too variable for a VirSorter-like approach to identify,” Roux says.

One approach that some groups have tried is to look at the statistical composition of the many tiny fragments of DNA that the sequencer reads. Although the reasons why are not yet understood, analysis of known viral genomes has shown that closely related genomes show a bias in the way nucleotides are used even in short sequences, known as k-mers.

The DiscoVir tool developed by Krishnamurthy and colleagues uses machine learning trained on k-mer data to sift, from bacterial and fungal material in metagenomic surveys, the genomes of unidentified viruses that infect plants and animals, rather than bacteria. Machine learning makes it possible to use features that do not rely on similarity to known genetic sequences and apply rules that are more likely to find virus candidates.

“In principle, if we knew the sequence of every virus on the planet, there would be no value in using a machine learning algorithm for virus identification,” Krishnamurthy says. “The greatest asset that I believe machine learning brings to viral identification is the ability of these algorithms to identify different combinations of variables that can lead to the positive prediction of a virus.

“Things like support vector machines and random forests don’t require all viruses to have the same properties. This is an important feature of viral classification because biologically, there are no molecular attributes that are specific to all viruses that are not present in any non-viruses, which is one of the reasons why it’s so hard to give an all-encompassing definition of a virus,” Krishnamurthy adds.

Roux and colleagues used a different set of features in the machine-learning algorithms they used to find their missing inoviruses. He explains, “We manually identified a set of 10 ‘fuzzy’ but distinctive features of inoviridae. None of these features is individually a clear sign, but some combinations of typically four or five of them is usually a great indicator.”

Roux’s group tried a number of machine learning algorithms, but found that a random-forest classifier provided the best results. Deep learning methods could not be used, because the training set was too small.

Armed with a way of finding inoviruses in existing sequence data, the results were startling. The software identified thousands of probable inovirus genomes, many of which could be classified into six broad families. The team removed by hand around 70 sequences that did not seem to be inovirus genomes or were too unusual to considered viable candidates.

“This doesn’t guarantee that every sequence we highlighted is an inovirus genome, but we feel confident that we do not have a large subset of our sequences that do not represent plausible inoviruses,” Roux says. “Beyond these in silico analyses, the real ‘proof’ will have to be done through lab experiments. But having looked at a lot of these sequences for the past year, I can tell that, to my eye, it really looks as though we found several thousand plausible inoviruses that had not been previously reported.”

Some appear to infect bacteria that, up to now, were thought not to have any inoviruses associated with them, and appear to be far more numerous than the dozens listed in existing virus databases imply. “The inoviridae are a full viral order and associated with all types of bacteria. They are basically everywhere,” Roux says.

The software identified thousands of probable inovirus genomes, many of which could be classified into six broad families.

Krishnamurthy sees an important role for machine learning alongside conventional techniques in the continuing quest to map the Earth’s biological diversity. “They have the potential to work synergistically with their alignment-based counterparts. The minute an alignment-independent classifier finds one member virus of a novel family, alignment-based methods can be used to rapidly scan previously sequenced data to find more closely related members, so that the members of the viral family can be expanded.”

In turn, as the genome databases expand and become more accurate, better training data becomes available to the machine learning models, which will let them, in Roux’s view, find “gold in other people’s data surplus.” Similar work is likely to reveal more from the genome and other data that scientists have already obtained. He adds, “The ‘omics approaches generate so much data that no one can look at every potentially interesting piece in their dataset”.

Further Reading

Wooley, J.C., Godzik, A., and Friedberg, I.
A Primer on Metagenomics, PLOS Computational Biology 6(2): e1000667 (2010)

Anantharaman, K., et. al.
Thousands of Microbial Genomes Shed Light on Interconnected Biogeochemical Processes in an Aquifer System, Nature Communications (2016) 7:13219

Krishnamurthy, S. and Wang, D.
Origins and Challenges of Viral Dark Matter, Virus Research, 239 (2017) 136–142

Roux, S., Hallam, S.J. Woyke, T., and Sullivan, M.B.
Viral Dark Matter and Virus-Host Interactions Resolved from Publicly Available Microbial Genomes, ELife (2015) 4:e08490

Learning to See

DOI

December 2018 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Learning to See

DOI

December 2018 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.