Artificial Intelligence and Machine Learning Legally Speaking

Legal Challenges to Generative AI, Part I

Questioning the legality of using in-copyright works for training data and producing outputs derived from copyrighted training data.

hands reaching toward copyright symbol, illustration

Generative artificial intelligence (AI) has captured considerable popular attention recently. ChatGPT and DALL-E have given members of the general public opportunities to use AI systems to generate text and image outputs for fun and a wide range of other purposes. Google and Meta have announced their intentions to launch similar AI systems soon.

Generative AI has also caught the attention of lawyers who question the legality of ingesting in-copyright works as training data and producing outputs derived from copyrighted training data.

Lawyers representing four programmers (identified so far as John Does) have, for instance, sued GitHub, Microsoft, and OpenAI, alleging that GitHub’s Copilot and OpenAI’s Codex AI programs have violated laws by using publicly available open source code (including programs the Does developed) posted on GitHub’s site as training data for their generative AI systems. Also illegal, say these Does, are Copilot and Codex outputs of code sequences in response to user prompts insofar as the sequences are substantially similar or virtually identical to open source code used as training data.

The Does claim to represent a class of programmers whose legal rights GitHub and OpenAI have violated. They want a federal court to issue an injunction against these generative AI systems and to award the class $9 billion in statutory damages.

OpenAI developed Codex as a generative AI model trained on billions of lines of publicly available computer source code.

This column focuses on the Doe v. GitHub lawsuit as the first of a two-part series on legal challenges to generative AI. A subsequent column will address two similar lawsuits brought against Stability AI for its use of images as training data and for producing outputs based on the training data.

Back to Top

Background on Codex and Copilot

GitHub is an Internet hosting service for software development and version control. It reports having more than 100 million registered developers and hosting 372 million code repositories, including 28 million public repositories. Microsoft acquired GitHub for $7.5 billion in 2018.

OpenAI developed Codex as a generative AI model trained on billions of lines of publicly available computer source code, including code available in GitHub’s public repositories. Codex discerns statistical patterns in the structure of existing code. It infers these patterns based on a complex probabilistic analysis of the training data. In response to a user’s prompt, Codex produces code to implement the desired function.

In June 2021, GitHub and OpenAI launched Copilot as a cloud-based AI technology that uses Codex to assist the development of software. GitHub users can install Copilot as an extension to various code editors. Copilot treats a user’s input to a code editor as a prompt and generates suggested code that may be suitable for the developer’s purposes. Copilot subscriptions are available to GitHub users for $10 per month or $100 per year.

Back to Top

What Laws Were Arguably Violated?

Although the complaint says Copilot and Codex are engaged in “software piracy on an unprecedented scale,” it does not actually claim GitHub or OpenAI have infringed any copyrights. This is curious because open source software is generally protected by copyright and copyright is the legal basis on which open source licenses are predicated.

The Does’ most significant claim is that Copilot and Codex wrongfully removed copyright notices and other copyright-relevant information from open source programs ingested as training data.

The intentional removal or alteration of copyright management information (CMI) from copies of copyrighted works with knowledge that the removal or alteration of CMI is likely to induce, enable, facilitate, or conceal copyright infringements is illegal under § 1202 of Title 17 of the U.S. Code.

A second principal claim is that GitHub and OpenAI have breached open source license agreements by failing to respect license terms, such as requirements to give attribution to the open source developers whose code has been ingested and is being used to generate outputs and to include copyright notices in reused code.

The Does charge GitHub and OpenAI with several other legal violations, including misrepresenting others’ licensed code as their own, fraud, unjust enrichment, and violating California unfair competition and privacy laws. This column omits discussion of these subsidiary claims because the lawsuit will primarily be focused on the two principal claims.

Back to Top

Motions to Dismiss

Lawyers initiate lawsuits by filing complaints explaining the legal theories on which the lawsuits are based and core facts that support those legal theories. If courts uphold at least one theory in a case, the plaintiffs may be eligible for certain remedies, such as injunctions and damages.

After reviewing complaints, defendants’ lawyers sometimes decide to file motions to dismiss complaints for failure to state claims on which courts could grant the remedies requested in the complaint.

When considering motions to dismiss complaints, courts assume that all of the facts stated in the complaint are true (even if the defendants’ lawyers plan to contest their truthfulness if the court denies their motions).

Instead of filing an answer to the Does’ complaints, which would typically admit some allegations, deny others, and raise defenses, GitHub and OpenAI filed motions to dismiss the Does’ complaints for failure to state claims on which relief could be granted.

Among other things, GitHub and OpenAI point out the Does have not identified any code in which they claim rights. Nor have they specified any injury they suffered as a result of GitHub’s or OpenAI’s acts. Most of their claims are speculative and conclusory, not specific about the elements necessary to succeed on the merits.

OpenAI also moved to dismiss because the Does have not identified themselves. Courts do not usually allow plaintiffs to sue anonymously or pseudonymously absent special circumstances (for example, when there is a risk of retaliation). Procedural rules require plaintiffs to ask a court for permission to file lawsuits as Does. These Does failed to do this. It is, moreover, difficult for defendants to formulate adequate defenses if they do not know who is suing them.

Back to Top

Removal of Copyright Information Claims

The big money claim in the Doe v. GitHub lawsuit ($9 billion) asserts GitHub and OpenAI illegally removed CMI from source code used as training data.

Section 1202(c) defines CMI as including information identifying the work, its author and/or copyright owner, terms and conditions for use of the work, and/or identifying numbers or symbols representing identifying information.

To violate § 1202, a defendant must have intentionally removed CMI from copies of a work or must have distributed copies of a work knowing its CMI had been removed. In addition, a defendant must know or have reason to know that the CMI removal “will induce, enable, facilitate, or conceal an infringement” of copyright in the work.

Courts can award anywhere between $2,500 and $25,000 in statutory damages for each violation of this law. (When actual damages are difficult to prove, as with removal of CMI, legislatures sometimes decide to establish a statutory damage remedy to ensure some meaningful compensation is available to victims of a law’s violation.)

This double knowledge requirement may be difficult to satisfy in the Doe v. GitHub lawsuit, as it was in Stevens v. Corelogic. Stevens is a photographer who specializes in taking digital photographs of houses on behalf of real estate agents. Stevens’ photographs are typically posted on Multiple Listing Service (MLS) platforms.

Lawyers initiate lawsuits by filing complaints explaining the legal theories on which the lawsuits are based and core facts that support those legal theories.

Some metadata about Stevens’ photographs is automatically created when his digital camera takes photographs. He can also add metadata to digital image files manually using photo-editing software. Metadata embedded in digital files may be invisible to anyone who looks at the image.

Corelogic provides software to MLS for displaying real estate photographs of houses for sale. Because image files can be very large, Corelogic resizes the images and saves the resized images so they occupy less storage space and load faster on MLS sites. In the process of resizing photographs, Corelogic’s software did not preserve invisible metadata.

Stevens sued Corelogic for violating § 1202 because of its removal of CMI embedded in his photographs. The Ninth Circuit Court of Appeals affirmed a lower court ruling in Corelogic’s favor, holding that Stevens had not shown that Corelogic intentionally removed the CMI, nor that the removal would facilitate copyright infringement.

GitHub and OpenAI argue the Stevens case supports their assertion the Does have not stated a viable claim for violation of § 1202.

Back to Top

Breach of License Claims

The Does’ complaint identifies some open source licenses the Does themselves have used for software they developed that they claim GitHub and OpenAI have wrongfully included in Codex and Copilot. The Does say these and other class members’ open source licenses require attribution and inclusion of copyright notices in any reuses of their software.

As a defense against the breach of license claims, GitHub relies both on its terms of service and on license rights developers give GitHub when they choose to make their program code part of a public repository.

GitHub requires all of its users to agree to its terms of service. Included in these terms is a license granting GitHub the right to “store, archive, parse, and display … and make incidental copies” of the users’ code, as well as to “parse it into a search index or otherwise analyze it” and “share” the resulting code in public repositories with other users.

And when GitHub users decide to make their code available on its site, they must choose whether to make their code repositories private or public. Users who decide to make their repositories public “grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub’s functionality.”

Back to Top

Why No Copyright Claim?

The biggest mystery in the Doe v. GitHub case is why there is no copyright claim in the complaint. One possibility is the Does are seeking copyright registration certificates for their programs. This is a necessary procedural requirement for U.S. copyright owners who want to sue someone for infringement.

Another possibility is the Does do not want to litigate fair use defenses that GitHub and OpenAI would almost certainly raise if sued for copyright infringement. (Fair use may not be a viable defense to the CMI removal or license breach claims.)

Fair uses are not infringements of copyrights. Courts consider four factors in making fair use determinations: the purpose of the challenged use; the nature of the copyrighted work; the amount and substantiality of the taking; and harms to the market for the work.

Existing U.S. precedents seem to support such a defense if the Does sue GitHub and OpenAI for infringement. The closest precedent is the Authors Guild v. Google case. The Second Circuit Court of Appeals held that Google had made fair use of millions of in-copyright books it scanned to enable computational analysis of a database of these books and for purposes of indexing their contents to serve up snippets of text in response to user search queries.

The court held that Google had made transformative uses of the in-copyright books because the corpus facilitated greater access to information. While Google copied the whole of each book, this was necessary to achieve its transformative purpose of indexing book contents for computational analysis and search. Because Google only served up three short snippets from each book, the snippets were unlikely to undercut the market for the books.

Under the Google decision, ingesting publicly available source code would seem to be as fair as the scanning of books to index their contents. And the snippets of code that Copilot provides in response to user prompts is analogous to the snippets of text from books that Google provides in response to user search queries. Because the court found both the scans and the snippets to be fair uses, GitHub and OpenAI would seem to have plausible fair use defenses.

Back to Top


Generative AI has raised some new technology issues courts have not yet addressed. While the Doe v. GitHub complaint raises some interesting theories of liability, it is far from clear courts will find Copilot or Codex to be unlawful. GitHub is arguing Copilot is socially beneficial because it “crystallizes the knowledge gained from billions of lines of public code, harnessing the collective power of open source software and putting it at every developer’s fingertips.” In May 2023, a trial court denied the GitHub and OpenAI motions to dismiss as to the removal of CMI and breach of license claims, so the lawsuit will now proceed to address the merits. It remains to be seen how receptive the court will be to GitHub’s and OpenAI’s defenses.

Back to Top


Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More