Deep learning should not work as well as it seems to: according to traditional statistics and machine learning, any analysis that has too many adjustable parameters will overfit noisy training data, and then fail when faced with novel test data. In clear violation of this principle, modern neural networks often use vastly more parameters than data points, but they nonetheless generalize to new data quite well.
The shaky theoretical basis for generalization has been noted for many years. One proposal was that neural networks implicitly perform some sort of regularization—a statistical tool that penalizes the use of extra parameters. Yet efforts to formally characterize such an “implicit bias” toward smoother solutions have failed, said Roi Livni, an advanced lecturer in the department of electrical engineering of Israel’s Tel Aviv University. “It might be that it’s like a needle in a haystack, and if we look further, in the end we will find it. But it also might be that the needle is not there.”
A Profusion of Parameters
Recent research has clarified that learning systems operate in an entirely different regime when they are highly overparameterized, such that more parameters let them generalize better. Moreover, this property is shared not just by neural networks but by more comprehensible methods, which makes more systematic analysis possible.
“People were kind of aware that there were two regimes,” said Mikhail Belkin, a professor in the Halicioǧlu Data Science Institute of the University of California, San Diego. However, “I think the clean separation definitely was not understood” prior to work he and colleagues published in 2019. “What you do in practice,” such as forced regularization or early stopping of training, “mixes them up.”
Figure. Kernel machines like the one above are used to compute non-linearly separable functions into a higher-dimension linearly separable function.
Belkin and his co-authors systematically increased the complexity of several models and confirmed the classical degradation of generalization. Their analysis revealed a sharp peak in the prediction error as the number of model parameters became high enough to fit every training point exactly. Beyond this threshold, however, generalization improved again, so the overall curve showed what they called “double descent.”
A highly overparameterized model—beyond the peak—has a huge, complex manifold of solutions in parameter space that can fit the training data equally well—in fact perfectly, explained Andrea Montanari, Robert and Barbara Kleist Professor in the School of Engineering and professor in the departments of electrical engineering, statistics, and mathematics of Stanford University. Training, which typically starts with a random set of parameters and then repeatedly tweaks them to better match training data, will settle on solutions within this manifold close to the initialization point. “Somehow these have the property, a special simplicity, that makes them generalize well,” he said. “This depends on the initialization.”
Quantitative metrics of generalization are challenging, though, cautions Gintare Karolina Dziugaite of Google Brain in Toronto, and there are limits on what we should expect from “explanations” for it. One obvious measure is the performance of a trained model when faced with held-out data. “It will be quite precise, but from the explanation perspective, it is essentially silent,” she said. General theories, by contrast, do not depend on the details of the data, but “it’s well-appreciated at this point that such theories cannot explain deep learning in practice.” Dziugaite said. “Any satisfactory theory of generalization should lie between those two regimes.”
Dziugaite also noted that memorizing the training set, as overfitting does, could actually be useful in some situations, such as when a dataset includes small subpopulations. A tool that seems to generalize well on average might miss underrepresented examples, such as dark-skinned people in facial recognition data.
Boaz Barak, a professor of computer science at Harvard University, regards generalization as only one aspect of the power of neural networks. “If you want to talk about generalization in a mathematically well-defined way, you need to think of the situation where you have some distribution over the population and you’re getting samples from that distribution,” he said. “That’s just not how things work” for real-world datasets.
Good generalization, on average, also does not address the “fragility” problem, in which neural networks sometimes make inexplicable, egregious errors in response to novel inputs. However, “We are still far from having a way to fix that problem in a principled way,” said Montanari.
Kernel Machines
Belkin’s “most important discovery was that [the overparameterized regime] is really general,” Montanari said. “It’s not limited to neural networks.” As a result, “people started looking at this phenomenon in simpler models.”
Belkin, for example, has championed the venerable kernel machines for both their explanatory and practical power. When used as binary classifiers, kernel machines search in a very high-dimensional feature space for simple surfaces that separate two groups of data points that are intermingled when they are projected into fewer dimensions. To perform this separation, they exploit a mathematical “kernel trick” that computes the distances between pairs of points in the high-dimensional space without the need to compute their actual coordinates.
Kernel machines include support vector machines, which were widely explored for machine learning before the recent ascendence of deep learning. “It’s in a sense a simpler model,” Belkin said. “If you cannot even understand what’s going on with them, then you cannot understand neural networks.”
Furthermore, Belkin has come to believe that kernel machines may already contain the most important features of deep learning. “I don’t want to say everything about neural networks can be explained by kernels,” he said, “but I think that maybe the interesting things about neural networks are representable by kernels now.”
In some limiting cases, the connection can be made mathematically precise. One important limit is when a neural network has layers of infinite “width” (as contrasted with the “depth”—the number of layers—that gives deep learning its name). It has long been known that such wide networks, when randomly initialized, can be described as a Gaussian process, which is a type of kernel.
The connection persists during training, as shown in a highly cited 2018 NeurIPS presentation by Arthur Jacot, a graduate student at Switzerland’s École Polytechnique Fédérale de Lausanne, and his colleagues. “We approximate the nonlinear model of neural networks by a local linear model,” he said. This Neural Tangent Kernel, or NTK, determines precisely how the solution evolves during training.
For infinitely wide networks, the authors showed the NTK does not depend on the training data and does not change during training. Jacot said they are still examining other conditions for a neural network to be in this “NTK regime,” including having a large variance in the initial parameters.
“I became more committed to the kernel thing after this NTK paper,” said Belkin, “because they essentially showed that wide neural networks are just kernels,” which makes generalization easier to model.
Feature Learning
Kernels do not automatically do everything, though. “The main difference between kernel machines and neural networks is that neural networks learn the features from the data,” said Barak. “Learning from the data is an important feature of the success of deep learning, so in that sense, if you need to explain it, you need to go beyond kernels.” Optimizing feature recognition might even push neural network designers to avoid the NTK regime, he suggested, “because otherwise they may degenerate into kernels.”
“It’s very easy to come up with examples in which neural networks work well and no kernel methods work well,” said Montanari. He suspects the practical success of neural networks is “probably due to a mixture” of the linear part, which is embodied in kernels, with feature learning, which is not.
For his part, Belkin remains hopeful—although not certain—that kernels will be able to do it all, including feature identification. “There are mathematical results showing that neural networks can compute certain things that kernels cannot, he said, but “that doesn’t actually show me that real neural networks in practice compute those things.”
“It’s not always true that neural networks are close to kernel methods,” Jacot acknowledged. Still, he emphasized that the NTK can still be defined and describe network evolution even outside of the NTK regime, which makes it easier to analyze what the networks are doing. “With NTK you can really compare different architectures” to see whether they are sensitive to particular features,” he said. “That’s already very important information.”
Convolutional neural networks have proven powerful in image recognition, for example, in part because their internal connections make them insensitive to displacement of an object. “Even though these are not learned features, they are still quite complex, and result from the architecture of the network,” Jacot said. “Just having these kind of features leads to a huge improvement in performance” when they built into kernel methods.”
For other tasks, though, the features that neural networks identify may be difficult for designers to recognize. For such tasks, Barak suggested, one approach “would be kind of a merging of neural networks and kernels, in the sense that there is the right kernel for the data, and neural networks happen to be a good algorithm to successfully learn that kernel.” In addition, “We have some evidence for universal features that depend on the data, not on any particular algorithm that you’re using to learn it. If we had a better understanding of that, then maybe generalization would come out the side from that.”
Belkin, M., Hsu, D., Ma, S., and Mandal, S.
Reconciling modern machine-learning practice and the classical bias–variance trade-off, Proc. Nat. Acad. Sci. 116, 15849 (2019), https://bit.ly/3EgkBYb
Belkin, M.
Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation, Acta Numerica, 30, 203 (2021), https://bit.ly/3GUJvhq.
Jacot, A., Gabriel, F., and Hongler, C.
Neural Tangent Kernel: Convergence and Generalization in Neural Networks, 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), https://bit.ly/32bQmo0
Join the Discussion (0)
Become a Member or Sign In to Post a Comment