The exponential growth of big data has powered a multitude of technology innovations. Yet, it was the mass adoption of Generative AI (GenAI) that pushed large language models (LLMs), built on enormous volumes of data, into the public consciousness.
Following the launch of OpenAI’s ChatGPT in 2022, other technology companies raced to scale up, supercharge, and release their own LLMs; innovations such as Google’s Gemini, Anthropic’s Claude, and Apple Intelligence have become mainstream in a staggeringly short timeframe. By March 2024—according to a report by the World Bank—the top 40 GenAI products were attracting nearly 3 billion visits each month, and Statista has forecast there will be nearly 730 million global users of AI tools by the end of this decade.
However, this high-speed rollout of GenAI and LLMs has also seen a snowballing of widely publicized snags, such as intellectual property disputes and concerns about bias, privacy, and sustainability. In parallel, users in specialist sectors, such as law or finance, soon discovered that LLMs did not always meet their domain-specific needs, and lay users found—often to their amusement—that GenAI tools are prone to making things up; the euphemism “hallucinations” quickly entered common parlance.
Against this backdrop, smaller models and datasets have emerged as a solution to some of their larger cousins’ drawbacks. Techniques such as knowledge distillation, transferring knowledge from large to smaller models, and pruning, removing model parameters (such as weight or temperature) without degrading accuracy, are also supporting the shrink. This and developments in edge computing enabled by smaller models that can run ‘on device’ raise questions about the dominance of gigantic datasets and models.
Bigger is not always better
“The heavy reliance of LLMs on massive datasets raises real concerns about sustainability,” said Amir H. Gandomi, a professor of data science at the University of Technology Sydney, Australia. “I think we’re going to see some changes down the line. For starters, there’s the whole issue of content scraping, copyright disputes, and new regulations are already pushing companies to rethink how they get their training data.”
Gandomi also flags data quality, misinformation, and noise as impacting large models’ reliability; they are also energy and computing power hungry, he said. With co-authors Ishfaq Hussain Rather and Sushil Kumar of the Jawaharlal Nehru University, New Delhi, India, Gandomi recently published a review of deep learning techniques for democratizing AI with smaller datasets.
“What’s great about smaller, curated datasets is they’re faster to process, cost less to use, and are ideal for real-time applications or resource-constrained environments,” Gandomi explained. “While they may sometimes lack the generalizabililty of large-scale datasets, removing irrelevant or noisy data makes [them] more accurate and easier to interpret, which is especially critical in fields like healthcare, where every decision matters.”
According to Gandomi, the shift toward “smarter, resource-efficient models” is being enabled by methods such as transfer learning, knowledge distillation, and retrieval augmented generation. “There’s a lot of potential in techniques like transfer learning and active learning, which can make small datasets even more powerful,” he said, adding that open-source AI is also supporting a move to smaller models as it makes tools and models publicly accessible.
“Techniques like synthetic data generation and few-shot learning are making it possible to do more with less, so individuals, startups, and researchers can build impactful AI without relying on massive infrastructure,” said Gandomi. “At the end of the day, the more accessible AI becomes, the more innovation stays in everyone’s hands, not just the big players.”
Julian Faraway, an expert in data and statistics at the University of Bath in the U.K., similarly points to the importance of precision over size in specialist sectors such as healthcare.
Faraway highlights the advantages of smaller datasets in pharmaceuticals, where candidate drugs are tested via randomized clinical trials, which “produce very high-quality but small datasets, which enables reliable decisions. In contrast, bigger observational datasets have been less useful. It can be difficult to extract useful information, due to unseen biases and defects in the data.”
While big data can be useful for predictive tasks, it is unreliable for explanation, Faraway said. “If you want causal explanations, small data from designed experiments is often more effective. A knowledge of causation is critical in improving systems.”
Downsizing, distilling, and pruning
While open-source technologies may democratize AI and enable anyone to build smaller, sector-specific models, the biggest names in the industry are also embracing the ‘downsizing’ trend. In 2024, Google distilled Gemini 1.5 Flash from Gemini 1.5 Pro, and launched Gemma, a family of lightweight LLMs: Gemma 2B has two billion parameters and Gemma 7B has seven billion.
Meanwhile, Meta FAIR researchers Andrey Gromov and Kushal Tirumala and their co-authors analyzed a “layer-pruning strategy” for open-weight LLMs; that is, LLMs whose parameters are publicly accessible. The team found “minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed.”
At Nvidia research labs, a team of researchers used pruning and knowledge distillation to demonstrate that an existing LLM could be re-trained with less than 3% of its original training data. The resultant small language models (SLMs), named the Minitron family, were “significantly cheaper to obtain compared to training each model from scratch (requiring up to 40× fewer training tokens), while still performing favorably to a number of similarly sized community models,” according to their report.
Edge computing—where data is processed locally ‘on device’ for faster response and reduced reliance on the cloud—is also fueling the development of small, high-performance models. Microsoft, for example, has launched phi-3.5-mini, a 3.8-billion parameter language model trained on 3.3 trillion tokens, which is designed to run locally on a user’s smartphone.
Gandomi believes edge computing could be a “game-changer” for the future of smaller datasets and models. “Smaller LLMs work well here because they require lower latency and less computational power, making them ideal for real-time edge environments.”
As edge computing pushes AI to be more efficient and practical, developers will move away from the “bigger is better mindset,” said Gandomi, “It’s making space for smaller, smarter models that can get the job done without needing massive resources. It’s about making AI work smarter, not just bigger.”
Technology companies have invested billions of dollars in building extremely large models, and it is safe to assume their associated AI products will be embedded into our daily lives— unless legal or regulatory hitches throw a wrench into the works. However, as sustainability and domain-specificity become more important to users, developers continue to shrink models, and processes shift to the edge, small could—in some cases—be the new big.
Karen Emslie is a location-independent freelance journalist and essayist.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment