News
Artificial Intelligence and Machine Learning

A Hybrid Future for AI

The drive for efficiency brings large language models out of the cloud.

Posted
neon lines glowing in the dark, illustration

Nvidia’s rise to a $2-trillion valuation at the beginning of 2024 underscored the extraordinary computing demands of artificial intelligence systems that power ChatGPT and a host of other cloud services that create videos, music, and computer programs on demand.

The power of computing and memory scaling has provided much of the impetus behind the surge in interest in generative AI based on large language models (LLMs). As models get bigger they seem to harness emergent behavior, making them more useful. But, as the growth in parameter counts has easily outstripped Moore’s Law, such scaling comes at a high cost. Much of the concern around resource usage has been focused on the enormous arrays of graphics processing units (GPUs) and accelerators in training grids used to train models for weeks at a time. Inferencing has a far lower computational demand per token than training, but an influx of users can quickly overwhelm available resources. That limit appears much sooner if the queries are complex and contain thousands of tokens, each of which roughly equates to four characters of text.

The problem has led to the launch of services such as GPT-For-Work and Artificial Analysis that attempt to predict the financial and energy cost of cloud-based LLMs before deployment. According to estimates from Artificial Analysis, processing a million tokens using OpenAI’s GPT4-Turbo costs $15. Twenty million requests resulting in maximum-length outputs could easily translate into a bill of more than $1 million. However, the 4o version launched in the spring would likely cut that number by 50%, although it also gains efficiency by halving the maximum length of its output.

The cost problem is made worse by the way in which LLMs operate during inferencing. During training, an LLM can ingest entire sentences at a time. This makes full use of the parallel arithmetic engines in GPUs. Inferencing, by contrast, demands a serial feedback loop with a single token driving each iteration.

“The model can’t generate a new token before it has produced the last. And each token has to go through the entire network. We have to run all these weights through the compute unit each time,” explained Joseph Soriaga, senior director of technology at Qualcomm AI Research.

Despite the enormous memory-bandwidth requirements of LLMs, the execution units in GPUs often are woefully under-used in inferencing because they must work serially. Operators of chatbots with a large user base can gain some parallelization by batching independent queries together. Yet applications that are sensitive to latency make batching harder to justify. Also, users are increasingly concerned about privacy, knowing their questions, documents, and conversations will be uploaded to the cloud and possibly used in downstream training and fine-tuning.

For all these reasons, many researchers and commercial suppliers expect more of the work of LLMs to be offloaded to users’ own devices and servers. Though they cannot approach the performance of GPUs designed for cloud servers, consumer devices have the clear benefits of mass adoption, plus electricity bills paid by the users themselves. But how to achieve this when even relatively mature LLMs like ChatGPT’s version 3.5 engine need hundreds of gigabytes of low-latency memory to stream the weights to the processor?

Researchers are attacking this issue on three fronts: the model itself; the data on which it is trained, and using system-level engineering. Some of the model-focused techniques come from work on scaling down older convolutional neural networks and are in widespread use. For example, analysis of the neuron weights in a trained model often finds many that are so close to zero they can be pruned from the model with little loss in accuracy.

Unfortunately, pruning does not map well onto the matrix-multiplication engines of the GPUs and neural accelerators in consumer platforms. It is possible to cull more than half the weights in a model and still not see a speedup.

A team led by Christopher Ré, associate professor of computer science at Stanford University, and Beidi Chen, assistant professor of electrical and computer engineering at Carnegie Mellon University, worked on an alternative approach to pruning for LLMs, calling it contextual sparsity. This works on the intuition that each input to an LLM will only activate a tiny portion of the overall model on each pass. If you can predict ahead of time which of those portions will make a difference, you can deactivate as much as 85% of the model. In effect, it prunes the network for each request, delivering almost a threefold reduction in latency. The team used a machine-learning model to determine which groups to activate based on the input, as this maps better onto the matrix-multiplication engines of GPUs than a hash-based algorithm they used initially.

Hardware designers have embraced quantization, sometimes called microscaling, where weights are processed using formats with far less precision than the 32-bit floating-point words used in training. AI accelerators routinely support the parallel execution of highly restricted integer and even 4-bit floating-point formats, which dramatically cuts memory size. Many papers have shown ways to fine-tune number-format choices to better match the distribution of weights in real-world models and avoid the loss in accuracy that comes with a move away from high-resolution floating point. However, quantization does not directly address the lack of parallelism in LLM inferencing.

Increased parallelism could come from another widespread technique for building smaller models: knowledge distillation. This technique uses a full-sized model to train a far smaller model at the cost of overall accuracy. Yet that lack of accuracy need not be a major issue if the smaller model helps speed up the much larger version. That is the logic that led two groups working independently within the conglomerate Alphabet to develop more or less the same method to slash the overhead of running tokens through a server-based model.

The core idea appeared more than five years ago: predict a series of outputs to give the engine many tokens to work on in parallel. However, both the DeepMind and Google teams determined the techniques used then to be too inaccurate to deliver a useful speedup: the large model had to reject too many of the candidate tokens. Instead, both teams proposed using a less-accurate but much more lightweight model to act as a “draft” engine. This runs the model auto-regressively but, with far fewer weights, the outputs appear far more rapidly. Once enough have been generated, the group is then presented to the full model, which either agrees with or replaces the candidate string with a single, more accurate token.

A key advantage of this approach is that the worst-case token rate is only slightly worse than the original model. Experiments showed overall the technique can deliver a speedup of over 2.5 times. Qualcomm is among the organizations working on device-based AI that see this combination of draft and full models as a way of splitting the workload between consumer devices and the cloud. In this use-case, a consumer device runs the draft model and presents groups of tokens to a cloud model at a time. Though this split architecture adds round-trip delays from network activity, the resulting token rate may prove better than sending the original request to the server. Qualcomm has achieved token rates approaching 20 per second using a Llama-7B model running solely on its Snapdragon processor and Soriaga said it presents the possibility of splitting workloads between devices and the cloud.

For systems that need to work with models that cannot be downsized easily, a team from Yandex, Hugging Face, and the University of Washington developed their Petals engine to exploit the same kind of crowdsourcing that underpinned scientific projects like Folding@Home and SETI@Home.

Petals divides the inference work of an LLM across a peer-to-peer network of machines, many of which could be donated by volunteers, providing a complement to federated learning, which does the same for training. But this work has also shown how distributing workloads, including speculative decoding may run into performance issues. Though speed is largely unaffected by network bandwidth, latency is an issue. For Petals, the computation rate drops by more than half as round-trip latency increases from a minimum of 5ms to the 100ms that is common in cloud deployments.

In their review of the many different techniques being pursued across the industry for LLM optimization, Emory University associate professor of computer science Liang Zhao and colleagues found there is a lack of comprehensive benchmarks. Those that exist are quite narrow in scope, making it hard to trade off optimizations. Each will have different effects on energy consumption, memory usage, processing cost, and latency.

“We believe that establishing such a benchmark is crucial and would prove immensely beneficial for researchers and practitioners,” said Zhao, and a key member of a team that has surveyed numerous cost-focused optimization techniques. In their work they have found there are significant obstacles to evaluating how each change affects performance.

“The size of LLMs poses significant challenges for conducting extensive combinatorial evaluations, particularly when comparing optimizations like pruning or quantization that may involve retraining models or applying changes post-training,” Zhao added.

In parallel to work on making existing LLMs smaller, researchers are looking at ways to changes in architecture and training strategies that would yield more efficient inferencing engines. Nikhil Sardana and Jonathan Frankel of Databricks subsidiary MosaicML argued at NeurIPS 2023 that it makes sense to spend more on training a model for longer if the outcome demands a relatively small model.

Others are designing LLMs for more consistent results than are typically achieved using knowledge distillation. Named after Matryoshka dolls, MatFormer developed by a team from several U.S. universities working with Google Research jointly optimizes a full-size network of Transformers alongside a set of smaller, nested networks. The researchers claimed, when split out and used independently, the output distributions of the different models are much closer to each other because they are all trained in the same procedure.

With progress being made on multiple fronts, the rapid growth of research in this area shows the industry is keen to find a solution to the spiraling cost of AI even if the work is itself often expensive to carry out.

Further Reading

  • Bai, G., Chai, Z., Ling, C., Wang, S., Lu, J., Zhang, N., Shi, T., Yu, Z., Zhu, M., Zhang, Y., Yang, C., Cheng, Y., and Zhao, L. Beyond efficiency: A Systematic Survey of Resource-Efficient Large Language Models, arXiv:2401.00625 (2024).
  • Liu, Z., Wang, J., Dao, T., Zhou, T., Yuan, B., Song, Z., Shrivastava, A., Zhang, C., Tian, Y., Ré, C., and Chen, B.
    Déjà Vu: Contextual Sparsity for Efficient LLMs at Inference Time, Proceedings of the 40th International Conference on Machine Learning (2023), Article No.919, pp 22137-22176
  • Leviathan, Y., Kalman, M., and Matias, Y.
    Fast Inference from Transformers via Speculative Decoding, Proceedings of the 40th International Conference on Machine Learning (2023), Article No.795, pp 19274-19286
  • Devvrit, K., Kudugunta, S., Kusupati, A., Dettmers, T., Chen, K., Dhillon, I.S., Tsvetkov, Y., Hajishirzi, H., Kakade, S.M., Farhadi, A., and Jain, P.
    MatFormer: Nested Transformer for Elastic Inference, Workshop on Advancing Neural Network Training (WANT) at NeurIPS 2023. arXiv: 2310.07707

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More