GLOM: Teaching Computers to See the Way(s) We Do

Turing Award recipient and Google vice president and Research Fellow Geoffrey Hinton. — Hinton wants to design neural networks so they have different ways of seeing the same thing, which "will make them far more interpretable, and far less likely to make crazy errors."

At the virtual Collision technology conference earlier this year, deep learning pioneer Geoffrey Hinton explained how he conceived of a new type of neural network that, he said, would be able to perceive things the way people do.

Hinton, an emeritus distinguished professor in the department of computer science of the Faculty of Arts & Science at Canada's University of Toronto, and also an Engineering Fellow at Google, is responsible for some of the biggest breakthroughs in deep learning and neural networks. He was honored as co-recipient of the 2018 ACM A.M. Turing Award, along with Yoshua Bengio and Yann LeCun, for conceptual and engineering breakthroughs that have made deep neural networks a critical component of modern computing.

In Hinton's Collision talk, he pointed out that the representations used by most neural networks performing object classification are produced by convolutional neural networks, which work well at classifying objects such as images or words, even winning competitions such as the ImageNet Large Scale Visual Recognition Challenge, but they perceive images in a very different way than people do, which can sometimes lead to "crazy errors."

"They use lots of texture information, which people are insensitive to," Hinton said, "but they fail to use a lot of shape information, which people are very sensitive to."

Convolutional neural nets also are not very good at extrapolating to new viewpoints. "People can see an object from one viewpoint and then extrapolate to many different viewpoints," Hinton said, "but convolutional nets need to see many different viewpoints in order to understand them."

To illustrate this, Hinton showed images of a school bus that had been partially obscured with visual noise. Humans can still easily recognize the images as depicting school buses, Hinton said, while convolutional neural networks mistook them for ostriches.

Hinton wants to design neural networks so they have different ways of seeing the same thing, just as humans do. "That will make them far more interpretable, and far less likely to make crazy errors like that school bus," Hinton said.

GLOM

Hinton's February 2021 paper "How to Represent Part-Whole Hierarchies in a Neural Network" describes the design of an artificial intelligence system that models human perception. The paper does not describe a working system, Hinton said, but a single idea about representation, which would permit advances made by different groups (including transformers, neural fields, contrastive representation learning, distillation, and capsules) to be combined into a single imaginary system that he calls GLOM, short for "agglomeration." The appellation comes from the way in which the model would process information; as similarities or patterns emerge in the model, they begin to work in parallel and "glom together," Hinton said.

Hinton explained that GLOM answers the question of how neural networks with fixed architectures can parse an image into a part-whole hierarchy that has a different structure for each image. The idea is to simply use islands of identical vectors to represent the nodes in the parse tree.

According to Hinton, human vision is a sampling process, with sharp resolution at the center of the cornea and lower resolution around the edges. When viewing objects, people have the innate ability to instantaneously determine what an object is, based on a partial view. However, traditional AI vision systems see a static image at uniform resolution and process the entirety of the uniform image. With GLOM, Hinton's idea is to create a system that is able to recognize images based on the relationships between parts of an image — as humans do — instead of a static image, as today's AI systems do.

Hinton says if GLOM can be made to work, it should significantly improve the interpretability of the representations produced by transformer-like systems when applied to vision or language.

While he clearly intended GLOM to be a biologically realistic model of perception, Hinton acknowledges he does not yet have a learning procedure for GLOM that is any more biologically plausible than backpropagation, using an algorithm to train neural nets to learn more efficiently.

Hinton said he started thinking about what eventually became GLOM when he began writing a design document for a software engineer for a system to be developed. He realized that he ought to explain the reasoning behind some of the design decisions, and as he added them in, the document got longer and longer. "Then I thought, 'if I'm going to write this big design document and it is going to have the reasons behind the design, that should be interesting to other people'."

"I wanted to share those design decisions because lots of people are thinking about systems like this," Hinton said, "and they might want to build systems that use some of these ingredients."

John Delaney is a freelance writer based in Woodstock, NY, USA.