News
Artificial Intelligence and Machine Learning

Can Artificial Intelligence be Open Sourced?

The question regarding the complex and nuanced reality around open source AI, especially large language models, is not whether or not it will emerge as a powerful force; it already has.

Posted
rainbow of colors, illustration

At what was billed as a “fireside chat” at Tel Aviv University in June 2023, the very first question from the audience posed to OpenAI CEO Sam Altman and chief scientist Ilya Sutskever was, “Could open source LLMs (large language models) potentially match GPT-4’s abilities without additional technical advances, or is there a ‘secret sauce’ in GPT-4 unknown to the world that sets it apart from the other models?”

After nervous laughter and applause, Sutskever said, “You don’t want to think about it in binary black-and-white terms where there is a secret sauce that will never be rediscovered,” adding that perhaps someday, an open source model would reproduce GPT-4—“but when it will be, there will be a much more powerful model in the companies, so there will always be a gap between the open source models and the private models, and this gap may even be increasing.”

In the ensuing months, despite Sutskever’s caution that binary thinking about future AI development methods is too simplistic, there have been numerous opinions published that proclaim diametrically opposed opinions about whether or not open sourcing AI, particularly generative AI, is an imperative social necessity to counter corporate concentration, or opening an existentially threatening Pandora’s box of anarchic instructions on how to make weapons or promulgate disinformation on massive scales. Examples of these seemingly incompatible opinions include “Make No Mistake – AI Is Owned by Big Tech,” published in MIT Technology Review, and “Open-Source AI Is Uniquely Dangerous,” published in IEEE Spectrum.

The question regarding complex and nuanced reality around open source AI, especially in the context of large language models, however, is not whether or not it will emerge as a powerful force. It has, and it is complicated—for instance, while OpenAI has not open-sourced its flagship GPT model, it has open-sourced others, including Point-E (for point cloud diffusion in 3D models), speech recognition model Whisper, and image-text prediction model CLIP. And, while Meta’s LLM LLaMa 2 is classified as open source, there are use cases that mandate users contact Meta for a license (neither Meta nor OpenAI responded to requests for comment).

In February 2024, Paris-based Mistral, which debuted with open source models, announced a partnership with Microsoft for its new LLM flagship that was not open sourced (https://tcrn.ch/3wMRJJ0); almost simultaneously, IBM announced it had optimized one of Mistral’s open source LLMs that greatly reduced latency through reducing model size and memory requirements (https://ibm.co/3PgNaNq).

Given such complexities, the ecosystem around it is coalescing quickly, aided perhaps by a highly public, yet secretive boardroom drama at OpenAI in November 2023, in which Altman was ousted and reinstated as CEO in the course of just over a week.

Another element aiding the emergence of the global open source AI movement has nothing to do with closed-door corporate machinations, and everything to do with the long-term success of open source software in the largest deployments imaginable over the past 25 years. Just one week after OpenAI’s drama furtively played itself out, 50 corporations, universities, and research organizations worldwide announced the formation of the AI Alliance in December 2023. Anchored by IBM and Meta, the alliance, with a stated goal to “advance safe, responsible AI rooted in open innovation,” grew by another 25 members in February.

Anthony Annunziata, head of open AI innovation at IBM and overseer of that company’s role in the alliance, said IBM’s history of embracing open source software gave it recognized legitimacy to serve as one of the alliance’s anchors. IBM was one of the first corporate entities to fully embrace Linux, the Apache Software Foundation, and the Eclipse integrated development tool environment, among other open source efforts. In 2019, IBM bought open source pioneer Red Hat for $34 billion.

The alliance’s work is split into six focus areas, including safe, secure, and trusted AI; open foundation models encompassing multi-lingual, multi-modal and science models; diversified AI hardware; AI skills development; policy advocacy, and open-source frameworks and tools. Annunziata said working groups had already started around two of them: trust and safety tooling, and policy advocacy.

Alliance is Not Alone

The AI Alliance is not the only organization quickly ramping up advocacy and action in open source AI. The Open Source Initiative (OSI) also announced in February its timeline to create a formal definition of what constitutes open source AI, with a 1.0 definition slated to be released in October (until then, an informal consensus exists that open foundation models are those with widely available weights).

Also in February, Carnegie Mellon University (CMU) began work on a multi-stakeholder Open Forum for AI (OFAI) under the leadership of Sayeed Choudhury, the university’s associate dean for digital infrastructure and director of CMU’s Open Source Programs Office (OSPO). OFAI’s other charter members include The Atlantic Council, George Washington University, Georgia Institute of Technology, the University of Texas-Austin, and—in another signal of a coalescing ecosystem—the OSI.

“There are some key questions around AI that remain unanswered. Maybe it’s not in big tech’s best interests, or maybe they don’t have the people to do so—but they are moving at speed and scale, which is a good thing in many ways,” Choudhury said. “But they are not necessarily asking those foundational research questions.”

Choudhury also said that among the overarching questions the OFAI will address is the very notion that foundational model research is so resource-intensive that only the largest corporate entities can take it on.

“Don’t get me wrong,” he said. “There is absolutely a place for big tech in this that is absolutely critical. They are providing some key infrastructure on which almost everybody can build. But I also think one of the key roles of OFAI is to just put a reality check on that narrative: is that, in fact, inevitable? And is that the only path forward for AI? Is that ‘desirable’?”

Rishi Bommasani, society lead at Stanford University’s Center for Research on Foundation Models, said the growing global ecosystem around public investment and open source AI reflects a realization that, as Choudhury postulated, industry becoming the dominant wellspring for AI innovation is not inevitable, and that currently widely deployed and publicized models such as ChatGPT are only the most visible elements of data-intensive possibilities.

“Industry has focused on texts and images, as has the field of AI for much of its history, but those are not the only modalities where we now have sufficient amounts of data,” Bommasani said. “The idea of getting academics, and also other entities, more compute resources is useful if we want to see models being built with different incentives and goals.

“We won’t necessarily duplicate what an OpenAI is building. The LLMs they build for products are fine, but maybe they are not the assets that would be most useful for science. Maybe we need to invest tremendous amounts of compute into models for proteins for other modalities where commercial viability may or may not exist, but the scientific value is even greater.”

Bob Shorten, head of the Dyson School of Engineering at AI Alliance member Imperial College London, noted that in addition to the fundamental research expected at university AI labs, there also are resources, such as what he called a “key health data resource in the U.K.,” that are not available to commercial developers.

“What is the role of the university in this space where resource and capability largely sits in the corporate sector?” Shorten asked in an email. “We would argue it is around innovation, trust and ethics, and conceptual development.”

Public Investment Accelerates

Worldwide, national, and regional governments indeed are beginning to invest significant amounts into building AI infrastructure that can rival industry to, as Bommasani suggested, take advantage of data that may not be immediately attractive to proprietary developers:

  • In the U.S., following directives issued in an executive order from President Joseph Biden in October 2023, the National Science Foundation (NSF) announced an estimated $2.6-billion investment in the National Artificial Intelligence Research Resource (NAIRR), which began pilot activities in January 2024. Among the four pillars of the pilot is NAIRR Open, which the NSF says “will enable open AI research through access to diverse AI resources via the NAIRR Pilot Portal and coordinated allocations.”

  • The U.K. announced in November 2023 a £225-million ($286-million) investment in a supercomputer at the University of Bristol, powered by 5,000 Nvidia GH200 superchips and capable of 200 quadrillion calculations per second. Dubbed Isambard-AI, it will offer computing capacity for researchers and industry to make AI-driven breakthroughs in fields such as robotics, big data, climate research, and drug discovery.

  • Japanese researchers at the Tokyo Institute of Technology, Tohoku University, Fujitsu Ltd., and the nationally-funded RIKEN research agency, began collaborative work on native Japanese LLMs in May 2023. The research, which was slated to run through the end of March 2024, will train models on the Fugaku supercomputer, and the partners plan to publish their research results on GitHub and the AI open source repository Hugging Face in fiscal 2024.

  • The European Commission (EC) announced its €1-billion ($1.07-billion) annual minimum investment in AI as stipulated in 2021 was met in both that year and 2022. In January 2024, the EC also released an updated communication of its coordinated AI plan that emphasizes generative AI more heavily than previous guidelines. Key in its provisions is the acceleration of development and deployment of Common European Data Spaces; the first stipulated property of these spaces is that they “are open for the participation of all organizations and individuals.”

The EC plan also established two consortia, including the Alliance for Language Technologies, “to address the shortage of European languages data for the training of AI solutions, as well as to uphold Europe’s linguistic diversity and cultural richness. This will support the development of European large language models.”

Adjusting the Practices of ‘Open Software’ to ‘Open AI’

The mantra first credited in 1999 to Eric Raymond in his seminal treatise on open source software—“Given enough eyeballs, all bugs are shallow”—has served as a useful aphorism about the utility and overall safety of writing open source software for more than 20 years. But using the principles of writing open source code to form a similar ecosystem for AI will be much more painstaking, according to Stanford’s Bommasani.

“Because it’s a black box, even if I tell you that LLaMa 2, or GPT-4 for that matter, has a problem with hallucinating, even in a narrow domain, it’s not obvious,” he said. “I can’t go to line 43 of the code and change something and it’s fixed. First of all, the intervention might require retraining the model, which is much more capital intensive than going to fix some lines of code.”

CMU’s Choudhury concurs, citing recent research from the university to underline the point.

“A lot of practices of the software world are being ported into the AI world,” he said. “But a key question for me is that AI has so many artifacts—data artifacts, software artifacts, and the orchestration of those artifacts—that I don’t think we can just do a 1:1 mapping of what used to happen in the software world into the AI world,” he said. “So OFAI will basically take the best of what we have learned in those previous contexts and then update, augment, and apply them in a way that is more rigorous.”

As an example, Choudhury mentioned “red-teaming,” in which “good guy” researchers emulate possible attack vectors by an adversary. While red-teaming has been touted by the Biden administration as a linchpin in assuring generative AI is secure, researchers at CMU found that “while red-teaming may be a valuable big-tent idea for characterizing a broad set of activities and attitudes aimed at improving the behavior of GenAI models, gestures towards red-teaming as a panacea for every possible risk verge on security theater” in a January 2024 study.

But OFAI will go beyond the purely technical or legalistic policy issues, Choudhury also said; for example, consulting people with varied life experiences will be an integral part of the OFAI’s mission.

“If you don’t have people who have different life experiences testing these models as well, interacting with the researchers, we aren’t going to have rich foundational evaluations of these systems,” he said. “So OFAI will be using the technological, the policy, the legal, and the community aspects in order to basically come up with a much more fleshed through definition of what it means to have openness in AI. Then we’ll build prototypes around that and test them in the community all toward developing policy recommendations.”

For IBM’s Annunziata, the formation of coalitions such as the AI Alliance and the OFAI are indicators that the global open source AI community wants to be able to ensure that questions around the technology—whether making sure research questions consider similar properties between models, or policy issues such as reaching consensus between different regions—receive efficient and transparent discussion.

“What we want to do is, instead of having this conversation be fragmented in a million places, to try to centralize some of it,” he said.

Likewise, Francis Beland, executive director of OASIS Open, which oversees numerous open source standards including the Open Document Format, said the AI community, both open source and proprietary, recognizes the necessity to coordinate efforts to create a fair landscape of technology and practices. At a recent OASIS-hosted closed-door discussion, Beland said there was recognition everybody is “seeing the same problems.”

“They are trying to fix them,” Beland said. “There is some political positioning in there where they don’t want to be seen doing this or that. But from an OASIS perspective, all of them are aware of this and are trying to be part of the solution. Some have to do it behind the scenes, some in front, but it is much different from where I sit from what the public sees.”

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More