Opinion
Artificial Intelligence and Machine Learning

Between the Booms: AI in Winter

After people stopped caring, artificial intelligence got more interesting.

Posted
fir sapling in winter

Observing the tsunami of artificial intelligence (AI) hype that has swept over the world in the past few years, science fiction writer Ted Chiang staked out a contrarian position. “Artificial intelligence,” he insisted, was just a “poor choice of words … back in the ’50s” that had caused “a lot of confusion.” Under the rubric of intelligence, verbs such as “learn,” “understand,” and “know” had been misappropriated to imply sentience where none existed. The right words, he suggested, would have been “applied statistics.” Chiang was correct that AI has always been a fuzzy term used to market specific technologies in a way that has little inherent connection to cognition. It is also true that most current AI-branded technologies work by modeling the statistical properties of large training datasets.

But Chiang’s implication that AI has been consistently and uniformly statistical since the 1950s is quite wrong. The approaches that dominated the field from the 1960s to the 1980s owed nothing whatsoever to statistics or probability. In this column, I look at the shift of artificial intelligence research toward probabilistic methods and at the revival of neural networks. It is a complicated story, because the shift toward probabilistic methods in artificial intelligence was not initially driven by neural networks, and the revival of neural networks was until recently more likely to be branded as machine learning than as AI.

As I explained in my last column, 20th-century interest in artificial intelligence peaked in the 1980s, driven by enthusiasm for expert systems and a flood of public money. AI was moving for the first time beyond the laboratory and into a swarm of startup companies and research groups in large companies. Then the bubble burst and the famous AI winter set in.

The shift was brutal, as changes in technological fashion often are. Nobody wanted to fund startups anymore as sales of their products and services slumped. System development groups inside companies could no longer expect that associating themselves with AI would win resources and respect, though some continued under other names. In the 1980s, anything that automated complex processes by applying encoded rules had been called an expert system. The same basic idea was rebranded as business logic during the 1990s as part of the push for distributed computer architectures. Rule-based automation was also central to the emerging field of network security.

Universities shift more slowly. I have seen no evidence that courses in AI disappeared from the curriculum or that established AI faculty decamped in large numbers for other areas of computer science. But once it was no longer a hot area grants for AI research became harder to get, which in turn impacted opportunities for graduate students and postdocs. Combined attendance at AI conferences dropped sharply from its 1986 peak, finally bottoming out around 1999.a

New Approaches

During its time out of the spotlight AI became more pluralistic as loss of faith in formerly dominant approaches created space for new ideas. By the time conference attendance fully recovered in the early 2010s the AI community spread itself over about a dozen different conferences focused on areas like robots, neural networks, and computer vision instead of the two generalist conferences that formerly dominated (the meeting of the American Association for the Advancement of Artificial Intelligence, and the International Joint Conference on Artificial Intelligence).

Rodney Brooks, a charismatic builder of robots, succeeded Patrick Winston as head of MIT’s Artificial Intelligence Laboratory in 1997. Brooks felt that traditional AI had run out of steam: “Nobody talks about replicating the full gamut of human intelligence any more. Instead we see a retreat into specialized subproblems.”1 To reinvigorate AI, Brooks championed embodied intelligence. Autonomous robots of the 1960s, such as SRI’s Shakey, translated sensor inputs into a model of the environment, planned a course of action within that model, and then tried to carry out the plan in the real world.3 Insects have tiny brains but are nevertheless capable of carrying out fluid movements and collaborating on complex tasks. Robots, argued Brooks, should likewise achieve intelligent behavior by interacting dynamically with their environment.

By the early 2000s, millions of robotic vacuum cleaners produced by iRobot, a company founded by Brooks and his students, were roaming along walls and across floors. Sony’s Aibo series of artificial dogs, introduced in 1999, were intended as surrogate pets but found a following in universities where students programmed them to compete in robotic soccer leagues.

Genetic algorithms were another hot area. Rather than relying on experts to encode knowledge in rules, systems would begin by trying out a set of candidate methods and seeing which worked best. Different possible elements of a solution, the metaphorical genes, were recombined repeatedly with more weight given to those that performed better. Performance gradually improved in a metaphorical process of evolution. The appropriation of biological terminology added excitement to what was, in essence, an iterative optimization approach based on the well-established principle of hill climbing.

The appropriation of biological metaphors deepened with artificial life, a concept originated by Christopher Langton and promoted by the Santa Fe institute which was attempting to build an interdisciplinary science around complexity theory.7 Computing pioneers John von Neumann and Alan Turing had both been captivated by the idea of mathematically modeling self-reproducing systems. Artificial life revived the spirit of cybernetics, which sought common mechanisms across real and simulated biological systems. Interest blossomed in the early 1990s when Tom Ray, an ecologist, developed the Tierra system in which self-reproducing programs ran on virtual computer prone to mutating them. Ray claimed that the evolutionary dynamics observed as these programs competed for processor time and storage followed the same patterns observed in biological populations but at much higher speeds.

For a while, perhaps because both included the word artificial, artificial life was seen by some as closely connected to artificial intelligence. In a 1996 AI textbook, philosopher Andy Clark positioned artificial life (broadly defined to include embodied intelligence and genetic algorithms) alongside symbolic AI and connectionism as the three main approaches to the field.2 Artificial life fizzled as a meta discipline, just as cybernetics had done decades earlier, though the simulation of biological processes has developed further under the flag of systems biology as has interest in designing synthetic DNA.

The Probabilistic Turn

A long drought of Turing awards for AI was broken in 2011 when Judea Pearl of the University of California at Berkeley was recognized for his work rebuilding artificial intelligence on a new foundation of statistical reasoning. Pearl began his computer science research career as an expert on heuristic search, but by the mid-1980s was developing a new form of knowledge representation for AI: Bayesian networks. In Bayesian statistics probabilities are conventionally expressed as degrees of belief in a hypothesis. This created a natural affinity with the preoccupation of AI researchers with knowledge and reasoning but clashed with the reliance of early AI systems on logics where statements were known with certainty to be true or false. Some expert systems included estimates of confidence along with their conclusions, but these were calculated from arbitrarily assigned rules.

Pearl’s networks connected beliefs to observations, updating their estimates as new data arrived. This research culminated in 1988 with the publication of Probabilistic Reasoning in Intelligent Systems.15 Pearl’s work had implications for social science as well as computer science, allowing researchers to work backward from naturally occurring data to probable causes, rather than forward from controlled experiments to results.

According to AI researcher Stuart J. Russell, “Pearl’s Bayesian networks provided a syntax and a calculus for multivariate probability models, in much the same way that George Boole provided a syntax and a calculus for logical models.” Its impact, Russell insisted, was profound: “Within just a few years, leading researchers from both the logical and the neural-network camps within AI had adopted a probabilistic—often called simply the modern—approach to AI.”b

Russell himself cemented that association by choosing Artificial Intelligence: A Modern Approach as the title of the textbook he wrote with Peter Norvig.17 It covered various approaches to AI, befitting the new era of pluralism, but paid special attention to probabilistic methods. The first of what are now four editions appeared in 1995, quickly becoming the standard undergraduate text at universities around the world.

Russell and Norvig took intelligent agents as “the unifying theme” of their textbook, in another departure from the orthodoxy of the 1980s. The agent concept harkened back to AI’s roots in cybernetics and the interaction of organisms with environments. Even something like a thermostat switch is an agent, according to Russell and Norvig, but only in the lowest of four classes in a hierarchy of intelligence that runs from reflex agents to utility-based agents. Ideal agents worked rationally but often with incomplete information to maximize likely returns based on estimated probabilities.

The agent concept aligned them with work in economics and operations research that shared AI’s focus on search and optimization but had not, despite Herb Simon’s preoccupation with organizational decision making, previously been discussed in AI textbooks. By the early 2000s a growing stream of work on constraint satisfaction problems in AI journals was yielding techniques with broad conceptual and industrial applicability.

The agent concept resonated outside academia too. People were beginning to conduct business over computer networks so it seemed plausible that software agents would roam cyberspace for them searching for relevant information or great deals. A similar vision of an intelligent personal assistant guided early handheld digital devices, such as Apple’s ill-fated Newton.

Big Data, Small Program

A central challenge in natural language processing is correctly parsing sentences into nouns, verbs, and other parts of speech. Parsing human languages, English in particular, is enormously hard to automate even when given written text. The parsing process is essential to machine translation but, less obviously, it is also central to automatic transcription of spoken words. Trying to recognize individual phonemes (distinct sounds) in speech is inherently error prone. Turning sounds into words is much easier when the structure of the sentence can be used to prioritize word choices that make sense.

Researchers collected pairs of sentences with similar forms but entirely different grammatical structures to illustrate the impossibility of writing parsing simple rules. Thirty years on one such pair, credited to Anthony Oettinger, is the thing I remember most clearly from my class in natural language processing: “Time flies like an arrow./Fruit flies like a banana.” In 1977, computational linguist Yorick Wilks observed in this context that “what almost all AI language programs have in common … is strong emphasis on the role of knowledge.”18 Only a system wise enough to know that fruit flies exist but time flies do not would be able to recognize “flies” as a verb in the first sentence and part of a noun phrase in the second.

By this logic we would still be waiting for viable speech recognition systems because efforts to systematically encode background knowledge have failed. As media scholar Xiaochang Li has shown, the groundwork for modern natural language processing instead was laid by research that took a fundamentally different approach. During the 1970s IBM was drawn to speech recognition research as a potential new market for powerful mainframe systems in the face of increasing competition from minicomputers. It turned out that deep understanding was not required to parse text, just vast amounts of training data and computer time. Li notes that the IBM group’s “director Fred Jelinek credited their success to a conceptual shift away from the fixation on human language faculties and expertise, infamously joking that the systems improved every time he fired a linguist.”10

Rather than encode linguistic knowledge in explicit rules, Jelinek aimed to train models automatically. The mathematical core of the model was a hidden Markov model, essentially a system for using transition probabilities from one word to the next in English text to guess what word is intended by the speaker. In an era before the Web or other mainstream electronic publishing finding a large corpus of English sentences on which to train the system was not easy. IBM scraped it together from various places, finding the motherlode in the massive collection of legal documents the firm had transcribed into 100 million machine-readable words while defending itself from antitrust actions. This was the origin of what would soon be called the big data approach to machine learning, though Li points out that Wilks termed this the “big data, small program” approach, a term I find more meaningful because it captures both halves of the shift from elaborate programming to extensive training data as the foundation of superior performance.

By 1984 IBM had a system working that could reliably transcribe words chosen from a five thousand work vocabulary, though as a mainframe had to work overnight to process each sentence the technology was hard to demonstrate.11 In the late 1980s IBM extended its statistical techniques to machine translation, achieving equally spectacular results.16

ARPA had categorized its speech recognition program of the 1970s as an artificial intelligence initiative. Within IBM, however, the work of Jelenik’s group was not conceptualized as AI, another example of the instability over time of artificial intelligence as an analytical category. By about 1992 statistical parsing had moved into the mainstream of natural language processing research, displacing systems based on hand crafted rules. Mainframes were no longer required as the cost of processor power and storage was dropping rapidly, putting systems with gigabytes of memory within reach. Mitch Marcus, a computer scientist at the University of Pennsylvania, led an effort to put together a huge training corpus of text.

The new technology was commercialized with products such as Dragon Naturally Speaking, produced by veterans of speech recognition teams at IBM and Carnegie Mellon. By the end of the 1990s it was delivering usable transcription of continuous speech on consumer personal computers. The same basic approach came to smartphones in 2011 when Apple integrated the Siri assistant in its iPhones. Its speech recognition was initially performed on powerful computers running in cloud data centers rather than in the phone itself.

Another large-scale language processing effort was launched at IBM around 2005 to produce a computer cluster able to retrieve plausible answers to natural language questions based on a repository of millions of documents. The resulting system, Watson, was branded as an example of artificial intelligence. It won widespread attention when it dethroned human champions on the television quiz show “Jeopardy!” in 2011. Watson worked by locating phrases in its database that were statistically similar to the question posed (or rather, in the quirky format of “Jeopardy!”, to find the question that matched the answer).

Big data technicians are probalisitc and rooted in Bayesians statistics but do not align fully with the kind of belief networks discussed by Judea Pearl. The mathematical weights being adjusted and the connections between them have no documented correspondence to specific pieces of knowledge. The systems are thus unable to explain their reasoning. Pearl, in contrast, imagined probabilistic inference performed by models that implemented logical relationships defined by human experts. Such a system could support its output by pointing to chains of meaningful relationships connecting conclusions to evidence. His recent work has centered on causal reasoning. From that viewpoint Pearl has critiqued current approaches as “just fitting a curve to data.”5

The Return of Neural Networks

Most of the systems currently branded as AI are based around the training of simulated neural networks. These systems also take big data approaches, though the mathematical nature of the model is different. In 1994, as a student, I coauthored a profile of the artificial intelligence research group at Manchester University. The group’s professor, David Brée, described AI as mostly just “thrashing around” with perhaps three “islands of success” of which “most recent” was neural nets. Two of its seven members worked with neural networks. During our interview they debated another member, whose training in philosophy at Princeton left him so committed to formal logic that he insisted AI should really stand for “automated inference.”4

Rereading Brée’s suggestion that neural networks as recent arrivals startled me, because I now know that they go back to the earliest days of artificial intelligence. Sometimes they were implemented as special purpose electronic devices. The most famous, the Mark I Perceptron designed by Frank Rosenblatt in the late-1950s, was funded by the U.S. military as an experiment in image recognition. It was intended to mimic the functioning of the human eye. A crude, 400-pixel camera, analogous to a retina, generated inputs that were processed through a layer of 512 adjustable weights, analogous to neurons, to drive eight outputs. As the machine was trained the weights assigned to possible connections between input and output automatically strengthened and weakened to improve the fit of the model with the sample data. Dressing up these weights with the cybernetic language of neurons implied a direct parallel with the workings of brains, harking back to the foundational work of Warren McCulloch and Walter Pitts.

Rosenblatt was initially as prominent as the four men (Herb Simon, Allen Newell, Marvin Minsky, and John McCarthy) subsequently remembered as the founders of artificial intelligence. Back in the 1950s, “neuron nets” had been listed as an area of interest in the funding proposal for the Dartmouth summer school that launched the “artificial intelligence” brand.6 Minsky was invited because of his early work in that area. But by the mid-1970s Rosenblatt was dead while Simon and Newell had successfully positioned the idea that human brains and computers were both symbol processing machines at the heart of artificial intelligence.

Minsky himself had helped them to dismantle Rosenblatt’s legacy. Minsky showed in an influential 1969 book with Seymour Papert that having only one layer of weights between input and output greatly limited the range of patterns that a Perceptron could recognize.14 As I demonstrated in my last column, by the time AI moved into the mainstream of computer science education in the 1980s all discussion of neural networks had vanished from its textbooks.

The scope of artificial intelligence had been redefined to exclude connectionism. That banished connectionism from most computer science departments, but it did not kill it. Historian Aaron Mendon-Plasek has emphasized the extent to which, even in the 1950s and 1960s, many researchers defined themselves around pattern recognition or machine learning.12,13 Rather than being a mere subfield of AI, and hence of computer science, Mendon-Plasek insists that pattern recognition had always defined a largely distinct research community. Much of this work took place in engineering schools. For example, Manchester’s neural network specialists had backgrounds in electrical engineering and physics.

Limiting the focus of mainstream AI to symbolic approaches also created opportunities for the emerging interdisciplinary field of cognitive science, which aimed to model and explain the functioning of human brains with input from fields such as linguistics, psychology, and neuroscience. Expanding simple Perceptron-style networks with additional hidden layers between inputs and outputs eliminated the limitations documented by Minsky and Papert but raised a new problem: How to use training data to adjust the weights assigned to connections when input and outputs were joined by multiple paths each composed of several neurons? In 1986 an article published in Nature, “Learning representations by back-propagating errors,” David Rumelhart, Geoffrey Hinton, and Ronald J. Williams showcased an emerging answer to that question. Rumelhart worked at Stanford but in psychology rather than computer science. Hinton, who went on to lead the revival of neural networks, was trained in both psychology and, at the University of Edinburgh, artificial intelligence.

The gradual revival of neural networks took place largely outside the elite artificial intelligence community centered on MIT, Stanford, and Carnegie Mellon. Hinton was, at that point, a young faculty member at Carnegie Mellon but soon departed for the University of Toronto, in part because of his opposition to the involvement of American computer scientists in the Reagan administration’s “Star Wars” missile defense initiative. With a revolving cast of collaborators, Hinton made Canada the center of work on what he called deep learning, in a reference to the presence of many intermediate layers of neurons.

To computer scientists of the late-1970s neural networks were an old, discredited thing. To computer scientists of the mid-1990s they were a new, exciting thing. Students who started the Manchester program a year after me encountered them in the mainstream second year AI courses. The next year, Russell and Norvig devoted one of 27 chapters in their textbook to neural networks.

Powerful as the back propagation method was it relied, like statistical parsing, on access to vast amounts of computer power and training data. Its first high profile success came when Bell Labs applied neural networks to digit recognition. There are only 10 decimal digits, but they can be written in many different styles. Initial work took place from the mid-1980s using training data provided by the U.S. Postal Service, which was interested in automating the reading of ZIP codes. Bell Labs even worked on specialized neural network chips.8 After successful field trials, in 1996 Bell Labs digit recognition technology was integrated into NCR’s commercial check reading machines to speed check clearing.9 The system’s creators included Bell Labs staff member Yann LeCun and Yoshua Bengio, then a postdoctoral researcher. The two Frenchmen were both trained in engineering as well as computer science.

The subsequent history of neural nets has been well reported by Cade Metz, who emphasizes that despite the flurry of interest in the mid-1990s neural nets remained marginal until the 2010s. Development continued outside the traditional centers of AI with new algorithms and network types developed by groups led by Hinton in Toronto, LeCun at New York University, Bengio in Montreal and Jürgen Schmidhuber in Lugano, Switzerland. This work went far beyond just applying back propagation. The check recognition system, for example, relied on graph transformer networks. Work on applying neural networks to machine translation inspired the attention mechanism to focus on relevant contexts. LeCun invented convolutional networks able to recognize significant features of training data with less human guidance. Hinton, LeCun and Bengio shared the 2018 ACM A.M. Turing Award in recognition of their contributions.

Unlike most of the methods produced by traditional AI, these techniques proved both generalizable and practical. Metz recounts that in 2009 Hinton launched a summer project at Microsoft to apply neural nets to machine transcription. “In a matter of months, a professor and his two graduate students matched a system that one of the world’s largest companies had worked on for more than a decade.” Another Hinton student took the new approach to Google, where it was quickly deployed to Android phones.

Until recently neural networks were more likely to be promoted as machine learning, deep learning, or big data than as artificial intelligence. Ongoing fallout from the AI winter had stigmatized artificial intelligence, while the dominance of symbolic AI in groups such as the Association for the Advancement of Artificial Intelligence made neural network specialists more comfortable elsewhere. The Conference on Neural Information Processing Systems, initiated in 1987, grew by the 2010s to be much larger than the AAAI meeting. In 2011 Michael Wooldridge, serving as chair of the main European AI conference, reached out to include machine learning specialists on its program committee but found that few were interested in participating.19

The most dramatic demonstration of the maturity and flexibility of the new techniques took place in 2012 with AlexNet, a system developed by Hinton with his students Alex Krizhevsky and Ilya Sutskever to enter the ImageNet image recognition competition. AlexNet combined several of the group’s techniques, including a deep convolutional network, running on graphical processing units as a cheap source of highly parallel computational power. It greatly outperformed all other competitors, including programs using specialized algorithms that had been refined for years.c Their paper has, as of 2024, been cited more than 150,000 times. Other neural net systems won high profile competitions to screen molecules for potential drugs and recognize traffic signs.

These triumphs ushered in a new investment frenzy around machine learning, which mimicked and eventually outstripped the 1980s boom in expert systems. In the final installment in this series I will address the similarities and differences between today’s hothouse world of generative AI hype and the good old-fashioned kind we knew back in the 20th century.

    References

    • 1. Brooks, R.A. Intelligence without representation. Artificial Intelligence 47, (1991), 139159.
    • 2. Clark, A. Philosophical foundations. Artificial Intelligence. M.A.Boden, Ed. Academic Press, New York, 1996, 123.
    • 3. Elzway, S. Armed algorithms: Hacking the real world in Cold War America. Osiris 38, (2023), 147164.
    • 4. Haigh, T., Bartlett, N., and Williment, M.  AI group profile. Manchester University, Computer Science Magazine 9 (1994), 69.
    • 5. Hartnett, K. How a pioneer of machine learning became one of its sharpest critics. The Atlantic (May 19, 2018).
    • 6. Kline, R.R. Cybernetics, automata studies, and the Dartmouth conference on artificial intelligence. IEEE Annals of the History of Computing 33, 4 (Oct.–Dec. 2011), 516.
    • 7. Artificial Life: An Overview. C.G.Langton, Ed.  MIT Press, Cambridge, MA, 1995.
    • 8. Law, H. Bell Labs and the ‘neural’ network. BJHS Themes 8, (2023), 143154.
    • 9. LeCun, Y. et al. Gradient-based learning applied to document recognition. In Proceedings of the IEEE 86 (1998), 22782324.
    • 10. Li, X. There’s no data like more data: Automatic speech recognition and the making of algorithmic culture. Osiris 38 (2023), 165182.
    • 11. Li, X. Divination engines: A media history of text prediction. Ph.D. dissertation. New York University, (2017).
    • 12. Mendon-Plasek, A. Mechanized significance and machine learning: Why it became thinkable and preferable to teach machines to judge the world. In The Cultural Life of Machine Learning, J. Roberge and M. Castelle, Eds.  Palgrave Macmillan Cham, Switzerland, (2021).
    • 13. Mendon-Plasek, A. Irreducible worlds of inexhaustible meaning: Early 1950s machine learning as subjective decision making, creative imagining and remedy for the unforeseen. BJHS Themes 8 (2023), 6580.
    • 14. Minsky, M., and Papert, S. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA (1969).
    • 15. Pearl, J. Probabalistic Reasoning in Intelligent Systems. Morgan Kaufmann, NY (1988).
    • 16. Poibeau, T. Machine Translation. MIT Press, Cambridge, MA (2017).
    • 17. Russell, S.J., and Norvig, P. Artificial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, NJ (1995).
    • 18. Wilks, Y. Time flies like an arrow. New Scientist (Dec. 15, 1977).
    • 19. Wooldridge, M. A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going. Flatiron Books, NY (2021).

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More