Mining scientific data for patterns and relationships has been a common practice for decades, and the use of self-mutating genetic algorithms is nothing new, either. But now a pair of computer scientists at Cornell University have pushed these techniques into an entirely new realm, one that could fundamentally transform the methods of science at the frontiers of research.
Writing in a recent issue of the journal Science, Hod Lipson and Michael Schmidt describe how they programmed a computer to take unstructured and imperfect lab measurements from swinging pendulums and mechanical oscillators and, with just the slightest initial direction and not knowledge of physics, mechanics, or geometry, derive equations representing fundamental laws of nature.
Conventional machine learning systems usually aim to generate predictive models that might, for example, calculate the future position of a pendulum given its current position. However, the equations unearthed by Lipson and Schmidt represented basic invariant relationshipssuch as the conservation of energyof the kind that govern the behavior of the universe.
The technique may come just in time as scientists are increasingly confronted with floods of data from the Internet, sensors, particle accelerators, and the like in quantities that defy conventional attempts at analysis. "The technology to collect all that data has far, far surpassed the technology to analyze it and understand it," says Schmidt, a doctoral candidate and member of the Cornell Computational Synthesis Lab. "This is the first time a computer has been used to go directly from data to a free-form law."
The Lipson/Schmidt work features two key advancements. The first is their look for invariants, or "conservations," rather than for predictive models. "All laws of nature are essentially laws of conservation and symmetry," says Lipson, a professor of mechanical engineering. "So looking for invariants is fundamental."
Given crude initial conditions and some indication of what variables to consider, the genetic program churned through a large number of possible equations, keeping and building on the most promising ones at each iteration and eliminating the others. The project's second key advance was finding a way to identify the large number of trivial equations that, while true and invariant, are coincidental and not directly related to the behavior of the system being studied.
Lipson and Schmidt found that trivial equations could be weeded out by looking at ratios of rates of change in the variables under consideration. The program was written to favor those equations that were able to use these ratios to predict connections between variables over time. "This was one of the biggest challenges we were able to overcome," Lipson says. "There are infinite trivial equations and just a few interesting ones."
Like human scientists, the software favors equations with the fewest terms. "We want to find the simplest equation powerful enough to predict the dynamics of the system," Schmidt says.
Scientific data has become so voluminous and complex in many disciplines today that scientists often don't know what to look for or even how to start analyzing it. So they are applying artificial intelligence, via machine learning, to giant data sets without precisely specifying in advance a desired outcome. Unlike AI systems of the past, which were usually driven by hard-coded expert rules, the idea now is to have the software evolve its own rules primed with an arbitrary starting point and a few simple objectives.
Automating the discovery of natural laws marks a major step into a realm that was previously inhabited solely by humans.
Eric Horvitz, an AI specialist at Microsoft Research, says it's only the beginning. "Computers will grow to become scientists in their own right, with intuitions and computational variants of fascination and curiosity," says Horvitz. "They will have the ability to build and test competing explanatory models of phenomena, and to consider the likelihoods that each of the proposed models is correct. And they will understand how to progress through a space of inquiry, where they consider the best evidence to gather next and the best new experiments to perform."
A major challenge facing the European Organization for Nuclear Research (CERN) is how to use the 40 terabytes of data that its Large Hadron Collider is expected to produce every day. Processing that amount of data would be a challenge if scientists knew exactly what to look for, but in fact they can hardly imagine what truths might be revealed if only the right tests are performed. CERN researchers have turned to Lipson and Schmidt for help in finding a way to search for scientific laws hidden in the data. "It could be a killer app for them," Schmidt says.
Indeed, Lipson and Schmidt have received so many requests to apply their techniques to other scientists' data that they plan to turn their methodology and software into a freely distributed tool.
Josh Bongard, a computer scientist and roboticist at the University of Vermont, says the Lipson/Schmidt approach is noteworthy for its ability to find equations with very little assumed in advance about their form. "That gives the algorithm more free rein to derive relationships that we might not know about," he says.
Bongard says earlier applications of machine learning to discovery have not scaled well, often working for simple systems, such as a single pendulum, but breaking down when applied to a chaotic system like a double pendulum. Further evidence of the scalability of the Lipson/Schmidt algorithm is its apparent ability to span different domains, from mechanical systems to very complex biological ones, he says.
The Lipson/Schmidt work takes search beyond "mining"where a specific thing is soughtto "discovery," where "you are not sure what you are looking for, but you'll know it when you find it," Bongard says. A key to making that possible with large stores of complex data is having algorithms that are able to evolve building blocks from simple systems into successively more complex models.
Such methods aim to complement the efforts of scientists but not replace them, as some critics have suggested. "These algorithms help to bootstrap science, to help us better investigate the data and the models by acting like an intelligent filter," says Bongard.
Scientific research for decades has followed a well-known path from data collection (observation) to model formulation and prediction, to laws (expressed as equations), and finally to a higher-level theoretical framework or interpretation of the laws. "We have shown we can go directly from data to laws," says Schmidt, "so we are wondering if we can go from laws to the higher theory."
He and Lipson are now trying to automate that giant last step, but admit they have little idea how to do it. Their tentative first step uses a process of analogy; a newly discovered but poorly understood equation is compared with similar equations in areas that are understood.
For example, they recently mined a large quantity of metabolic data provided by Gurol Suel, assistant professor at the University of Texas Southwestern Medical Center. The algorithm came up with two "very simple, very elegant" invariantsso far unpublishedthat are able to accurately predict new data. But neither they nor Suel has any idea what the invariants mean, Lipson says. "So what we are doing now is trying to automate the interpretation stage, by saying, 'Here's what we know about the system, here's the textbook biology; can you explain the new equations in terms of the old equations?'"
Lipson says the ultimate challenge may lie in dealing with laws so complicated they defy human understanding. Then, automation of the interpretation phase would be extremely difficult. "What if it's like trying to explain Shakespeare to a dog?" he asks.
"The exciting thing to me is that we might be able to find the laws at all," says Lipson. "Then we may have to write a program that can take a very complicated concept and break it down so humans can understand it. That's a new challenge for AI."
Lipson, H. and Schmidt, M. Distilling free-form laws from experimental data. Science 234 (Apr. 3, 2009), 8185.
Waltz, D. and Buchanan, B. Automating science. Science 234 (Apr. 3, 2009), 4344.
King, R., Whelan, K., Jones, F., Reiser, P., Bryant, C., Muggleton, S., Kell, D., Oliver, S. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427, 6971 (Jan. 15, 2004), 247252.
Bongard, J. and Lipson, H. Automated reverse engineering of nonlinear dynamical systems. In Proceedings of the National Academy of Sciences 104, 24 (June 6, 2007), 99439948.
Koza, J. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, 1992.
©2009 ACM 0001-0782/09/1100 $10.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2009 ACM, Inc.