Gauging Societal Impacts of Large Language Models

To see what happens when people interact with Artificial Intelligence (AI) systems in realistic settings, the U.S. National Institute of Standards and Technology (NIST) is launching a new program to test large language models (LLMs).

“Assessing Risks and Impacts of AI (ARIA) aims to help organizations and individuals determine whether a given AI technology will be valid, reliable, safe, secure, private and fair once deployed,” the agency said in a statement.

ARIA was conceived to support the NIST AI Risk Management Framework (AI RMF) with awareness that effective evaluation of AI technologies “presents new and interesting challenges that move measurement science beyond determining the accuracy of a system,’’ explained Reva Schwartz, ARIA program lead for NIST’s Information Technology Lab. The program “must include the impacts to individuals, communities, and society both positive and negative,” while taking into account “how important context is for AI evaluation,’’ she added.

How the program will work

ARIA expands on NIST’s AI Risk Management Framework, which was released in 2023, and will enhance the framework’s risk measurement function by developing a new set of methodologies and metrics to analyze and monitor AI risk and impacts.

The program is open to any researcher, team, or interested party who participates under the terms of the evaluation rules, Schwartz said

In addition to testing LLMs, the program will evaluate, validate, and verify AI’s capabilities and impacts. “Current AI model testing, which is often done in laboratory settings, does not allow for a full examination of the risk and impacts of AI systems to people,’’ she explained.

ARIA will trace language model tasks at three different evaluation levels: model testing, red teaming, and field testing. “The multilevel test environment can expand our understanding beyond model capabilities and consider why and for whom a given risk creates impact, including in settings that mimic real-world conditions in field testing,” Schwartz said.

For the full ARIA evaluation, the field testing level is expected to capture information from thousands of people interacting with models under regular use and natural settings, she said.

ARIA will establish a suite of metrics focused on technical robustness—defined as the “ability of a system to maintain its level of performance under a variety of circumstances”—as well as “societal robustness,” which is the ability of a system to maintain its level of performance across a variety of societal contexts and related expectations, Schwartz said.

Avoiding potential bias

Holly Wiberg, an assistant professor of operations research and public policy at Carnegie Mellon University’s Heinz College, said she is excited about the “multi-pronged approach” ARIA will take to evaluating models “in a more traditional sense,’’ as well as red teaming, which stress-tests a model to see if it can, essentially, be broken.

When a user interacts with an LLM and prompts and guides it, “That can go off the rails in various ways,’’ Wiberg said. Red teaming can provide insight into the worst-case scenario if a bad actor is interacting with an LLM, “but it doesn’t give you a sense of the general safety or accuracy of an LLM in everyday use with well-meaning or neutral end users.”

The only potential downside to the ARIA approach could be human bias, observed Rahul Vishwakarma, a senior member of the IEEE and a data scientist at Octopyd. The bias could come from the people participating in the testing/evaluation of the AI system, or from the experts conducting the evaluation, he said.

“Whoever is participating in this program needs to be really careful” about the data being collected,” Vishwakarma said.

“Subjectivity is definitely a risk in the evaluation of any model, depending on the person or organization that’s evaluating a tool,’’ agreed Wiberg, adding that AI program evaluators need to clearly define their metrics.

She believes NIST is making “a pretty concerted effort to pose well-defined and specific evaluations for each of these pilot tasks … formulating these evaluation questions and metrics of success that hopefully, will help mitigate the issues of subjectivity—but it’s something to be aware of.”

For any framework, there is value in getting more diverse perspectives looking at a problem or an LLM critically, and thinking through what could go wrong and how users will interact with it, Wiberg said. “The more perspectives you’re able to incorporate in an evaluation, the more you’re able to mitigate the risk of subjectivity.”

Vishwakarma also recommends mandatory ARIA certification for any government agency and company working with AI. “The certification will bring trust and reliability among the end users,’’ he said. “However, to mandate the certification, the data laws within [different countries] should also be considered.”

On the whole, Wiberg feels optimistic ARIA will be “an important and useful step in developing generalizable frameworks for evaluating AI applications. “My only point of pause is, given the diversity of tasks and application areas of large language models and the varying levels of consequential outcomes of interacting with the outcomes, developing frameworks that are generalizable is challenging.”

The program’s coordinators also need to think about how to make the evaluations scalable, Wiberg said. “How insights obtained from this pilot can be applied and useful in broader settings that don’t have the same resources or detailed scrutiny and testing remains to be seen.”

Making AI models more beneficial to society

NIST’s goal with the ARIA program is to provide guidance by offering “a test environment for researchers and developers to evaluate their models in a scientific setting and determine how their models may create impacts in the real world,’’ Schwartz said. “They can take learnings from ARIA to improve model functionality and societal robustness.”

Over the long term, the outcomes from ARIA should include “guidelines, tools, evaluation methodologies, and measurement methods for making AI models and systems less harmful and more beneficial for individuals, communities, and society,’’ she said.

Once the pilot has concluded, Schwartz said the ARIA program will explore the feasibility of a full-scale evaluation. Eventually, the techniques and methods in the ARIA evaluations may be adapted for other NIST evaluations that are currently focused on algorithmic performance, she said, adding that this will provide an opportunity for those communities to consider the impacts of the technologies being studied, such as facial recognition and text retrieval.

Esther Shein is a freelance technology and business writer based in the Boston area.