Training Neural Networks to Grok

Several years ago, a group of researchers from OpenAI, one of the leading artificial intelligence research labs in the world, noticed a surprising phenomenon when they were training a neural network. These models typically gradually learn general features and relationships from the data they are trained on, called generalization. After a recommended amount of training time, a neural network is thought to have reached its peak performance, after which it starts memorizing the training data and performing worse on previously unseen data. After accidentally leaving their model to train for much longer, however, the team found it suddenly had a lightbulb moment, showing a near-perfect ability to predict patterns in all types of test data.

The OpenAI researchers called the phenomenon ‘grokking,’ a term coined by science-fiction writer Robert A. Heinlein in his 1961 novel Stranger in a Strange Land to describe an understanding of something that is so profound, you essentially merge with it.

Now other teams are trying to better understand it.

“This phenomenon is very counterintuitive in terms of how we classically think of neural network training,” said Ahmed Imtiaz Humayun, a Ph.D. student working on deep learning theory and generative modeling at Rice University in Houston, and a student researcher at Google. “That’s why we want to know what is happening.”

Grokking was first observed by the OpenAI team in a specific setting: a deep neural network was being trained to perform a mathematical task which involved adding two numbers, dividing it by a prime number and then outputting the remainder, or the amount left over. They also used a small dataset, which usually results in a model taking more time to generalize to a new task compared to when more data is used. In recent work, Humayun and his colleagues wanted to learn whether grokking was limited to some particular instances, or if it is more widespread.

“It could be that this is a more general phenomenon that happens for a lot of different cases,” said Humayun.

In their experiments, the team examined whether deep neural networks trained on widely-used large image datasets would grok. When using standard test datasets, subsets of the training dataset that are not shown to the model and held aside for this purpose, the phenomenon had not been observed. They wanted to see if using a more extreme, alternative test dataset could induce grokking: they added some noise to images in the standard test dataset to alter them slightly. Called adversarial examples, the changes are imperceptible to the human eye but can confuse neural networks. “It’s a way of creating worst-case examples,” said Humayun.

A neural network’s performance with adversarial examples does not always improve with generalization, so Humayun and his colleagues were not sure what to expect. They created adversarial examples for several large test datasets, collections of images such as CIFAR10 and CIFAR100, as well as one with handwritten digits, MNIST. They found that grokking occurred in all their tests with them. The team also witnessed the phenomenon with a large language model (LLM) tested with adversarial examples based on text from Shakespeare.

“It’s very intriguing and inspiring that grokking can happen for these large settings as well,” said Humayun. “This might be answering some very fundamental questions about how neural networks learn in general: grokking (may be) a symptom of a neural network and how it learns.”

It also could point to parallels with human learning. Richard Baraniuk, a professor of electrical and computer engineering at Rice University and one of Humayun’s co-authors, mentions the oft-cited rule that it takes 10,000 hours of practice for a person to become an expert in any given area. Grokking, he said, suggests that a neural network may need more repetition to truly master a task. “(Perhaps) after this long time period, stable expertise or the ability to apply what it has learned in these new situations really emerges,” he said.

The fact that grokking seems to be widespread suggests that understanding it better could lead to improvements in how neural networks are trained. There are currently computationally expensive techniques that can be used to improve the stability and robustness of their performance in a shorter training time. Yet the discovery of grokking shows that neural networks can reach this type of performance naturally. The challenge is to come up with a reliable way of inducing this lightbulb moment earlier on. “There could be a unique and different pathway compared to the direction in which current research is (going),” said Humayun.

Another team has been examining whether grokking can be predicted in early stages of training. In recent work, Irina Rish, a professor in the department of computer science at the Université de Montréal in Canada, and her team analyzed patterns in the learning curves of a neural network to see if they could provide clues about whether grokking would happen later on.

“There was intuition coming from the fact that before a major transition in complex systems, in physics and climate and in the brain, like epileptic seizures, some quantity of interest usually tends to have high variance and oscillates,” said Rish.

The researchers ran experiments that involved training a transformer model—a type of neural network that learns relationships and context from sequences of data—with arithmetic data. They used settings where grokking occurred and others where it was absent to investigate whether there were earlier patterns that correlated with the phenomenon. They found that in models that grokked, there were oscillations in early stages of training. These fluctuations were related to errors in how well a model is able to match the data it is being trained on, called training loss.

Rish thinks these results can have implications for making neural network training more efficient. If a similar network is being trained and the behavior they identified is not present, perhaps there is no need to continue training that particular process since it would mean that grokking will not occur. Screening out runs that aren’t promising would save time and compute, which has cost implications, she said.

Although there has recently been a surge in research into grokking, Rish thinks cross-disciplinary collaborations could help provide new insight since similar behavior is exhibited in other complex natural and artificial systems. Her team is interested in working with statistical physicists, for example. “It will take a real edge to understand the general behavior of complex dynamical systems,” said Rish. “We need different perspectives that artificial intelligence people may not have the proper background for.”

Sandrine Ceurstemont is a freelance science writer based in London, U.K.

Training Neural Networks to Grok

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.