acm-header
Sign In

Communications of the ACM

ACM TechNews

We Could Run Out of Data to Train AI Language Programs


Running out of words.

If data shortages push AI researchers to incorporate more diverse data sets into the training process, it would be a “net positive” for language models, says USC's Swabha Swayamdipta.

Credit: Stephanie Arnett/MITTR

Researchers at the Epoch artificial intelligence (AI) research and forecasting organization warn of the potential depletion of data for training AI language algorithms as early as 2026.

The creation of more powerful and capable language models requires finding ever-more training texts.

AI researchers categorize this data as high quality and low quality; Epoch's Pablo Villalobos said high-quality text is the more popular training data, because researchers prefer the models replicating language based on high-quality data.

The University of Southern California's Swabha Swayamdipta said data shortages could prompt a "net positive" redefinition of low and high quality that benefits language models.

Researchers also may invent methods for extending the life of training data, with Swayamdipta suggesting a model could be trained on the same data multiple times.

From MIT Technology Review
View Full Article

 

Abstracts Copyright © 2022 SmithBucklin, Washington, DC, USA


 

No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account