Seeking Artificial Common Sense

robot posed as 'The Thinker,' illustration

Although artificial intelligence (AI) has made great strides in recent years, it still struggles to provide useful guidance about unstructured events in the physical or social world. In short, computer programs lack common sense.

“Think of it as the tens of millions of rules of thumb about how the world works that are almost never explicitly communicated,” said Doug Lenat of Cycorp, in Austin, TX. Beyond these implicit rules, though, commonsense systems need to make proper deductions from them and from other, explicit statements, he said. “If you are unable to do logical reasoning, then you don’t have common sense.”

This combination is still largely unrealized; in spite of impressive recent successes of machine learning in extracting patterns from massive data sets of speech and images, they often fail in ways that reveal their shallow “understanding.” Nonetheless, many researchers suspect hybrid systems that combine statistical techniques with more formal methods could approach common sense.

Importantly, such systems could also genuinely describe how they came to a conclusion, creating true “explainable AI” (see “AI, Explain Yourself,” Communications 61, 11, Nov. 2018).

Elusive Metrics

It can be difficult to determine whether a system really has common sense, or is just faking it. “That’s a classic issue with how AI is perceived versus what it’s doing,” said David Ferrucci of Elemental Cognition in Wilton, CT, who previously led IBM’s Watson project, the results of which displayed superhuman performance on the television quiz show Jeopardy! “One of the problems with AI is that we project,” Ferrucci said, offering the example of thinking that, because human contestants understood what they read, Watson must understand, too. “That’s not the way it’s solving the problem.” he said.

Because tasks and assessments of common sense often are formulated linguistically, they become “tied up with issues of representation and how we reason in language,” said Ellie Pavlick of Brown University in Providence, RI. “Does succeeding on [a specific] test mean you’re succeeding at language, does it mean you’re succeeding at common sense, or somewhere weird in between?”

One revealing example is the “Winograd Schema Challenge,” proposed in 2012 as an improvement on the venerable Turing test. This task requires a system to resolve what a pronoun refers to, like “they” in the sentence “The city councilmen refused the demonstrators a permit because they {feared, advocated} violence.” The sensible answer depends on whether the verb is “feared” (councilmen) or “advocated” (demonstrators). Nothing in the syntax dictates the choice. “We believed that in order to solve the Winograd Schema challenge, a system would need commonsense reasoning,” said Leora Morgenstern, now at PARC in Palo Alto, CA. “This is not the case.”

In constructing such ambiguous sentences, Morgenstern and others seek to avoid answers in which the alternative word choices would be statistically correlated with the correct pronoun assignment. Such statistical analysis, trained on enormous databases, is used in large-scale machine-learning language models like BERT, developed by Google, which are extremely good at exploiting correlations in the occurrence and arrangement of words in written language.

Contrary to their initial expectation, however, “When you look at this vast trove of sentences that have been collected, the statistical information has been captured” that is needed for disambiguating pronouns, Morgenstern said. “It reflects, in a way, some part of the commonsense knowledge that exists.” Still, she says the challenge fell short because these systems clearly do not have common sense, which would connect words or concepts that are widely separated in the text, or perhaps not even written down.

Although researchers have developed lots of “challenge” databases to try to measure common sense, “I don’t think we have very good metrics,” said Ernest Davis of New York University. “That’s part of the problem with this research area.”

Developing assessments is an important goal, agreed Matt Turek, who runs the Machine Common Sense program at the U.S. Defense Advance Research Projects Agency (DARPA). Just a year or so into the four-year program, some participants had already achieved most of the program goals for existing benchmarks for common sense. “What’s really driven that level of results is the rapid increase in the capability of these large-scale machine-learning-based language models,” he said. “That doesn’t mean that they’ve learned common sense.”

“We believed that in order to solve the Winograd Schema challenge, a system would need commonsense reasoning. This is not the case.”

Turek said multiple-choice or true/false questions are not as informative as tasks that require generating new answers. It is hard to score such unstructured responses at large scale, however, which is critical for providing feedback for machine learning.

Competing Representations

The longest-standing approach to embodying common sense does not depend on automated training, but on explicit, symbolic rules. These relationships often are represented as knowledge graphs in which nodes describing a concept are connected by arrows describing their relationship, such as “Napping”➔”Causes”➔”Energy.”

The CYC system, which Lenat has been working on for decades, extends this tradition of formal representation. He stresses, however, that binary relationships, and even third- and fourth-order relationships, are not rich enough. For example, in a Communications article, “You have no trouble following five, six, seven-deep nested modals about expectation, belief, desire, opposition to, and so on,” Lenat said. “Those are exactly the kinds of things that, kicking and screaming, we were led to represent in building the CYC system in the 1980s,” using formal representations of higher-order and modal logic. “The rest of the world is sort of stuck back where Marvin Minsky was in 1965,” he said.

Drawing inferences from these complex relationships is computationally challenging, however. “Over the last 35 years, we’ve identified about 1,000 different ways of speeding up logical inference so that our systems can do complicated reasoning in real time,” Lenat said. His company boasts clients in specialized applications like military and medical uses.

Because these projects are proprietary, however, other researchers cannot easily assess them, said Yejin Choi of the University of Washington and the Allen Institute for Artificial Intelligence. Choi noted that representing knowledge in such abstract constructs makes it hard for others to interpret or augment it, a possibility she exploits by crowdsourcing her work using Amazon Turk. “Natural language is far more expressive than what we know how to describe only using logical forms,” she said.

In contrast, Pavlick notes that “you can have common sense without any ability to speak language.” Despite her background studying natural language, she has begun exploring “reference-heavy” systems that use virtual worlds to develop “the notion of the things that language refers to and then learn language to refer to those things.”

Prospects

Despite their differing preferred representations, many researchers agree that success in machine common sense will depend on combining different approaches. In the DARPA program, Turek said, “Some of the interesting work is on the interplay between graph representations—which have been around for a while and are quite rich, and allow you to do various types of reasoning—and deep learning, which might give you a good feature representation for text or images.”

Similarly, the human brain’s two hemispheres combine intuitive heuristics with formal reasoning, Lenat said. “Most AI systems in the future, even in the near future, will have the same kind of architecture.”

A recent project from Choi’s team called COMET (for Commonsense Transformers for Automatic Knowledge Graph Construction) incorporates a semi-structured knowledge graph with self-supervised machine learning, like that underlying large-scale language models. COMET has earned praise for plausibly completing sentences, which Choi said is a better task than teaching AI systems “to cheat better on multiple-choice questions.”

“One of the things that’s really inspiring about human infants is their ability to get broad general knowledge and then apply that successfully to specific challenges.”

Elemental Cognition also is trying “to build a machine that does both” statistical language generation and capturing a “fundamental reality that causes the language,” Ferrucci said. In contrast to Watson, “I ultimately want to produce a model that is aligned with how humans acquire, structure, and communicate information.” Part of that process is using interactive exchanges with human users to “develop and maintain a shared understanding of the world,” he said.

Such interactive learning, inspired by the way children learn, is also one of two thrusts of the DARPA program, Turek noted. “There are really fundamental questions at play, both for the child developmental psychology community and how we might apply to that to inspire a new generation of artificial intelligence techniques,” he noted. “One of the things that’s really inspiring about human infants is their ability to get broad general knowledge and then apply that successfully to specific challenges.”

“Humans are often very good, if they have the right kind of framework, at learning from a single example,” Davis agreed, unlike deep-learning systems, which demand enormous datasets. A child exposed to an iguana, for example, can immediately deduce that, like other animals, an iguana is born small and will eventually die.

Pavlick notes that better training algorithms could help machine-learning systems generalize. “Their training isn’t really set up to incentivize them to learn the kind of representations … that would allow them to do this kind of quick generalization.”

She warns, however, that training can easily encode prejudice. “It’s hard to come up with a clear way of differentiating between the OK kinds of probabilistic associations and inferences that people make in common sense and ones that are just bad stereotypes.”

One advantage of including a formal component is that it inherently produces a rigorous explanation for its conclusions, Lenat said, not just a post-hoc rationalization. “It’s been understood for a very long time that common sense involves both the knowledge base and the inference ability,” agreed Morgenstern. “There has been a shift in emphasis,” she said, but “you need both.”

Further Reading

Davis, E. and Marcus, G.
Commonsense reasoning and commonsense knowledge in artificial intelligence, Communications of the ACM 58, No. 9, August 2015, https://dl.acm.org/doi/10.1145/2701413

Davis, E. and Marcus, G.
Rebooting AI: Building Artificial Intelligence We Can Trust, Pantheon Books, New York, 2019, https://amzn.to/37YBkke

Kocijan, V., Lukasiewicz, T., Davis, E., Marcus, G., and Leora Morgenstern, L.,
A Review of Winograd Schema Challenge Datasets and Approaches, April 23, 2020, https://arxiv.org/abs/2004.13831

Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A., and Choi, Y.
COMET: Commonsense Transformers for Automatic Knowledge Graph Construction, https://arxiv.org/abs/1906.05317

Suggested Videos

CACM Sept. 2015 – Commonsense Reasoning and Commonsense Knowledge in Artificial Intelligence

David Ferrucci, Machines As Thought Partners

Elusive Metrics

Competing Representations

Prospects

Seeking Artificial Common Sense

DOI

November 2020 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Elusive Metrics

Competing Representations

Prospects

Seeking Artificial Common Sense

DOI

November 2020 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.