There is no denying that deep learning, especially with generative models AI, has deeply transformed how we use computers. It has also quickly become where a very large fraction of the world’s computing resources is now focused.
Most deep-learning systems implementations involve expressing the machine learning (ML) model in some higher-level framework (for example, Caffe in the early days, then TensorFlow, and now PyTorch). During inference time (execution), the models in early ML systems were essentially interpreted, which means the framework evaluates the model like an interpreter and calls lower-level operator libraries that implemented the actual computing operations required for that portion of the model. These libraries are highly hardware-dependent and were often written by hand! Today, deep learning runs on a very broad range of different devices, from low-power sensor nodes to gigantic, specialized machines in datacenters spread across the globe. The cross-product between model operators and the hardware users want to run them on is very large. So, this purely library-based approach is simply not sustainable and cannot move at the pace AI models are developed. Also, this approach misses a lot of optimization opportunities that involve crossing the boundaries of model operators, from model representation down to how the code of given operators on different hardware.
Performance optimizations are essential to making deep learning deployable and economically viable. Over the past five years or so, there have been several significant efforts in bringing a compilation-based (as opposed to library-based) approach. The idea is to use compiler techniques to optimize and generate code for a given model for a given specific hardware target. Using a compiler makes the process of supporting new models and new hardware more sustainable and it makes it possible to apply end-to-end optimizations. I have been fortunate to be involved in one such early (and still going strong) effort—Apache TVM. ML engineers love compilers because using them makes life simpler by allowing them to spend more time making better models, not optimizing them. And hardware vendors love it because now models can run well on their hardware and make good use of computing resources. ML compilation is now becoming ubiquitous, with vendors like NVIDIA offering compilers that target their hardware and PyTorch 2.0 embracing compilation natively.
ML compilers honor the semantics of model architectures—that is, they strive to not change their numerical behavior. But there are many model architectures that effectively offer the same functionality with similar accuracy. At the same time, different model architectures may offer different lower-level compiler optimization opportunities. Therefore, searching for the model architecture variant that best enables lower-level optimizations is an attractive approach for more optimization opportunities—a concept proposed in the accompanying paper.
The key idea presented in the paper is to express model architecture search as a program transformation, such that it can be naturally unified with the optimization and compilation process. A central question in this unification is expressing the legality of the transformations at the model architecture level. For example, suppose there are several choices of a given convolution in the model architecture, that choice can be expressed as code transformation. Since there are many equivalent (accuracy-wise) behaviors/semantics, the optimization process needs some guidance on what model architecture-related transformations are considered equivalent. A brute force-way would be to simply retrain and evaluate accuracy in all choices, but that is expensive computationally.
The authors propose a clean and efficient way to obviate the need for expensive retraining and evaluation at every choice, quickly rejecting variants that do not preserve acceptable accuracy. It works as follows: define an equivalence notion based on representational capacity, which determines the ability of a given model network architecture to learn; then use a cheap-to-compute legality check that determines if a transformation being considered would significantly compromise representational capacity, and if so, discard that transformation without retraining. The paper’s experimental evaluation showed that it is effective in discarding invalid transformations. Importantly, the paper implements an end-to-end system in the Apache TVM framework and shows encouraging performance results—often over 3x better inference time performance with equivalent accuracy.
This line of work will only grow in importance, given the appetite for all the shiny new AI features in applications we use daily. Without squeezing as much performance as possible out of the underlying hardware, AI will clearly have limited impact. So, it is on us system builders to create efficient systems to sustain that. I hope you enjoy reading this paper as much as I did. Onward!
Join the Discussion (0)
Become a Member or Sign In to Post a Comment