U.S. Copyright Office’s Questions about Generative AI

Aware of the consternation generative AI has stirred up in the last year and a half among individual authors, artists, and copyright industry sectors, the U.S. Copyright Office, on its own initiative, published a Notice of Inquiry (NOI) document on August 30, 2023. It invited interested parties to submit written comments responding to dozens of questions about generative AI copyright-related issues.

In late October, the Office received approximately 10,000 comments in response to the NOI questions. The Office expects to publish a report in 2024 offering its perspective on how these questions should be answered and perhaps recommending legislation.

This column reviews various positions taken in a non-random sample of comments on the most significant questions raised in the NOI. (I read fewer than 100 of them, but those I read were generally from organizations and individuals with which or whom I was familiar and whose comments were well developed and likely to be of interest to the Office.)

The questions on which the Office requested comments include whether making copies of in-copyright works for purposes of training models is fair use or infringement, whether outputs of generative AI infringe copyrights in training data, whether developers of generative AI should be required to disclose details about their training datasets, and whether developers of generative AI should have to label outputs as AI-generated works.

One takeaway from my review of the NOI comments is that on none of those issues is there a consensus view among the commentaries I reviewed. The Office faces a tough choice: Should it simply describe the many differences of opinion about these issues without taking sides? Or should it take positions on the merits that will inevitably make many commentators unhappy?

The Ingestion and Output Copyright Questions

The NOI posed several questions about the unauthorized use of in-copyright works as training data for developing generative AI models and a few about generative AI outputs. The NOI’s questions are highly pertinent to the copyright claims at issue in the lawsuits pending against major generative AI developers.

The lawsuits challenge the legality of making copies of in-copyright works for purposes of using those works to train generative AI models. The lawsuits also accuse generative AI developers of directly or indirectly infringing copyright owners’ derivative work rights because AI-generated outputs are based upon data from works on which the models were trained. (My November 2023 Legally Speaking column, “Legal Challenges to Generative AI, Part II,” discussed these claims and likely defenses.)

Most copyright industry and creator rights organizations unsurprisingly urged the Office to opine that reproducing copies of copyrighted works for purposes of training models constitutes infringement. They also assert that generative AI outputs infringe copyrights because they are derived from the ingested training data and harm creators’ markets by unfairly competing with and suppressing demand for human-authored works.

The copyright industry comments also generally urged the Office to reject an opt-out regime under which developers would be allowed to use works as training data unless copyright owners specifically instructed them not to.

The Motion Picture Association (MPA), which is usually a copyright maximalist, was more equivocal about the training data issue. It disagreed with those who argue that all uses of copyrighted works as training data are infringing, but also with those who think all uses of such works are fair uses.

This suggests MPA’s members may want to develop and use generative AI tools for motion picture projects. This is perhaps unsurprising, given that well before generative AI captured popular attention, movie studios have been using computer software to generate stunt scenes, animations, and the like. Generative AI can be a very useful tool for creators.

Although the Business Software Alliance (BSA) also generally agrees with other copyright industry maximalist groups when copyright controversies arise, it begs to differ as to generative AI copyright issues.

BSA’s comment generally endorsed fair use defenses for building a corpus of in-copyright works as a training dataset and making incidental copies of those works during the training process. Training does not exploit the expression in those works, BSA noted, but only enables the model to make statistical correlations between and among component elements of the works.

The BSA comment pointed out that even if some outputs of generative AI systems may be substantially similar to particular inputs, infringement claims, if any, should be made against the user whose prompts yielded those outputs, not against the system’s developer.

Insofar as generative AI systems have substantial non-infringing uses, the BSA thinks they should qualify for a safe harbor from liability that the Supreme Court has twice endorsed when copyright industries have sued technology developers because their products allow users to make unauthorized copies of protected works.

Unsurprisingly, Anthropic, Google, Meta, Microsoft, OpenAI, and Stability AI, all of whom are defendants in generative AI copyright lawsuits, elaborated on the main points in the BSA submission. Their comments also explained the training process in some detail to help the Office understand that this process does not exploit the expression in the works used as training data. Datasets are, in fact, distinct entities from models that enable generative outputs.

Several library, academic, and civil society submissions explained ways in which generative AI systems promote knowledge creation and dissemination and thereby advance the constitutional purpose of copyright.

Among the other supportive comments was one submitted by an ad hoc group of artists who use generative AI as a tool in the process of their creative work. This comment cautioned against overbroad interpretations of copyright law that would undermine these artists’ ability to use these tools for ideation and other creative purposes.

The Authors Alliance, a nonprofit organization that represents the interests of authors who want to take advantage of opportunities to make their works more widely available in the digital age (of which I am board president), also supported fair use defenses for generative AI development because these tools are very useful to authors for research, idea testing, and editing purposes.

A Training Data Disclosure Requirement?

The NOI asked several questions about whether developers of AI models and creators of training datasets should be required to collect, retain, and disclose records about datasets they use to train their models. Related questions asked about what level of detail should be required, to whom records would be disclosed, and how costly it would be to impose such a duty on AI developers.

In general, the copyright industry and creator rights organizations supported requiring AI developers to collect and maintain records about training datasets. The records should, they said, be detailed enough so that copyright owners could find out whether their works had been included in the datasets and used in the training process. These commentators wanted copyright owners to be able to gain access to those records.

Generative AI developers do not generally make information about their training datasets available to the public and have reportedly resisted requests to identify works in training datasets. This means copyright owners have a difficult time at present whether their works were so used and if so, to what extent, although some tools are being developed to detect the use of individual works as training data.

The Silverman complaint^a against OpenAI, for instance, speculated that her work was used as training data because ChatGPT provided a detailed summary of her book’s contents. But she cannot know this for sure unless OpenAI is required to disclose information about training data during a legal discovery process.

The generative AI developers who addressed this set of questions unsurprisingly opposed training data disclosure mandates. Some expressed concerns that such mandates would undermine their trade secrecy rights. Others emphasized practical difficulties with complying with such obligations. Anthropic suggested that “model cards” could provide some information about models and datasets.

Many submitted comments did not address the NOI’s record-keeping questions. The comment I submitted with two colleagues (Matthew Sag and Christopher Sprigman) set forth four difficulties we foresaw if such a requirement was adopted.

First, AI researchers may have only very limited information about the works comprising the training data. They may rely on training materials collected or curated by third parties. In many cases, the only information researchers will have about a work is that it was associated with a particular URL at a particular point in time.

Second, any requirement to provide accurate information about the title, ownership, and chain of licensing of individual works in the training data would substantially increase the cost of developing AI models for research and commercial purposes. Although data science has come to recognize the importance of maintaining information about the provenance of data, this has focused on data about datasets, not about individual items within datasets.

Third, if such an obligation were imposed, it is unclear what level of effort in tracing ownership, and so forth, would satisfy the requirement.

Fourth, such an obligation may conflict with data privacy laws to the extent that it requires AI developers to collect and store personally identifying information about individuals in multiple jurisdictions.

Labeling of AI-Generated Content

Another set of NOI questions focused on whether a law should require AI-generated outputs to be labeled or otherwise publicly identified as having been generated by AI. Related questions included who should be responsible for identifying a work as AI-generated, whether there are technical or practical barriers to labeling or identification requirements, and what consequences should flow from a failure to label a particular work or removal of a required label.

Several copyright industry groups’ comments (but not MPA’s) favored labeling mandates. The Copyright Clearance Center, for instance, suggested generative AI developers would benefit if required to label outputs as AI-generated so they could avoid training their next generation models on AI-generated content instead of only on high-quality human-created content.

Universal Music Group recommended that the failure to label AI-generated outputs and the act of removing required labels should subject developers and other actors to sanctions. It suggested the Federal Trade Commission should be authorized to fine developers for violating labelling requirements. It also favored allowing private lawsuits to be initiated against egregious violators.

Some developers of generative AI systems (for example, the Getty Images Generative AI system) have voluntarily adopted watermarking or other techniques to identify AI system outputs. Adobe’s comment suggests AI-generated outputs should be tagged with its content credentials system, as now happens automatically when users generate images with its Firefly AI system.

The content credentials system adds metadata to AI-generated images. This metadata includes a visual thumbnail of the outputs; the name of the issuer of the content credential; the software or hardware device used to produce the outputs; the generative AI tool used to produce the outputs; and general editing and processing used to produce the outputs. Adobe’s comment recommended its widespread adoption in the U.S.

Generative AI developers who addressed the labeling issue were not enthusiastic about making this into a legal obligation, but were sanguine about labeling as a voluntary measure. Some pointed out that watermarks are, as a technical matter, quite easy to remove. It may be difficult to determine who was responsible for the removal. Some noted that no standards presently exist about what labeling would be appropriate.

My comment (with Sag and Sprigman) suggests any proposal for labeling or otherwise identifying AI-generated material should be carefully calibrated to the specific public interest objective the regulation aims to achieve. This calibration is important because different policy objectives necessarily entail different kinds of labeling and different thresholds for identification.

Our comment noted the line between AI-generated and human-generated may sometimes be difficult to draw, as when a person enters prompts, reviews the outputs, and then edits the AI-generated text. The line will also be difficult to draw when people use AI-powered editing tools to manipulate a work that was initially human-authored, or adds significant human authorship to an image that was initially AI-generated.

For example, photos taken on an iPhone in “portrait mode” would be AI-generated, according to some definitions—significant aesthetic features of the work are determined by a machine learning algorithm—but there is no obvious consumer interest in having all such images labeled or watermarked. Indeed, we imagine many iPhone users would object to such an interference.

In other contexts, our comment explained, labeling may be important because the public needs to know whether the content they are being presented with has been manipulated, or even entirely manufactured. If a news report features an image of the Pope in a white puffy jacket, the tools used to create the image are far less important than the fact that the image is fake. Accordingly, in certain contexts, any manipulation of the image or text should be disclosed.

Conclusion

The Copyright Office should be wary of taking stances either for or against the training data fair use defenses and claims that outputs infringe, as courts are the appropriate institutions to decide these questions. Determining whether particular uses of in-copyright works are fair or infringing requires a fact-intensive analysis. Comments submitted in response to the Office’s NOI do not provide it with sufficient information to make well-informed judgments on those issues.

On the training data disclosure and labeling requirement issues, the Office may, however, need to take a stance, as legislation would be necessary to impose such requirements on generative AI developers.

The Office may be influenced on the training data disclosure issue by what the European Union eventually does in its AI Act. The latest draft of that Act would impose such a requirement on AI developers, but this requirement has been contentious in the E.U.

Because some AI developers are voluntarily labeling measures, the Office decide to endorse this approach and encourage standards bodies to work on labeling specifications so that developers and consumers will be able to discern the provenance of AI-outputs.

While this column did not address another major set of questions in the NOI—namely, whether AI-generated outputs are copyrightable—the Office received many comments addressing these questions. Not everyone agrees with the Office’s current policy against copyrights in AI-generated works, so the Office may have to rethink or refine its policy analysis.

In other words, the Office has its hands full with tackling the many complex issues the NOI introduced to the AI policy arena.

Footnotes

a. The performer Sarah Silverman joined two lawsuits against OpenAI and Meta concerning copyright infringement, accusing the companies of training their AI models using her writing without permission.

U.S. Copyright Office’s Questions about Generative AI

The Ingestion and Output Copyright Questions

A Training Data Disclosure Requirement?

Labeling of AI-Generated Content

Conclusion

U.S. Copyright Office’s Questions about Generative AI

DOI

March 2024 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

The Ingestion and Output Copyright Questions

A Training Data Disclosure Requirement?

Labeling of AI-Generated Content

Conclusion

U.S. Copyright Office’s Questions about Generative AI

DOI

March 2024 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.