Many believe the company that enables real intelligence on end devices (such as mobile and IoT devices) will define the future of computing. Racing toward this goal, many companies, whether tech giants such as Google, Microsoft, Amazon, Apple and Facebook, or startups spent tens of billions of dollars each year on R&D.
Assuming hardware is the major constraint for enabling real-time mobile intelligence, more companies dedicate their main efforts to developing specialized hardware accelerators for machine learning and inference. Billions of dollars have been spent to fuel this intelligent hardware race.
This article challenges the view. By drawing on a recent real-time AI optimization framework CoCoPIE, it maintains that with effective compression-compiler co-design, a large potential is yet left untapped in enabling real-time artificial intelligence (AI) on mainstream end devices.
The principle of compression-compilation co-design is to design the compression of deep learning models and their compilation to executables in a hand-in-hand manner. This synergistic method can effectively optimize both the size and speed of deep learning models, and also can dramatically shorten the tuning time of the compression process, largely reducing the time to the market of AI products. When applied to models running on mainstream end devices, the method can produce real-time experience across a set of AI applications that had been broadly perceived possible only with special AI accelerators.
Foregoing the need for special hardware for real-time AI has some profound implications, thanks to the multifold advantages of mainstream processors over special hardware:
These reasons suggest that whenever mainstream processors can meet the speed and efficiency requirements of an AI application, they should be the preferred device to consider. The current industry however puts much emphasis on special AI hardware development, based on assumed insufficiency of mainstream processors in meeting the real-time requirements. In the rest of this article, we explain why the assumption is largely biased when compression-compilation co-design is used, how the principle can be materialized effectively into a practical framework CoCoPIE, and the implications to the AI industry.
Compression and compilation are the two key steps in fitting a deep learning model on a hardware for efficient executions. Model compression is a common technique for reducing the size and improving the speed of deep learning models. Compression techniques fall into two categories, pruning and quantization. Pruning removes layers or convolution filters or channels, while quantization reduces the precision of parameters (for example, from floating-point to short integer). Compilation refers to the process of generating executable codes from a given deep learning model. It, in essence, is a process of mapping the high-level operations in deep learning to the low-level instructions that the underlying hardware supports. The process plays a critical role in optimizing the code for efficient executions.
The principle of compression-compilation co-design is to design the two components for AI in a hand-in-hand manner. The synergy may exhibit at three levels:
All the examples mentioned are part of a software framework for mobile AI named CoCoPIE. We will next give an overview of CoCoPIE based on previous publications,8,18 and then use each of its main components to explain the compression-compilation co-design principle and the significant benefits.
CoCoPIE stands for Compression-Compilation co-design for Performance, Intelligence, and Efficiency. It is a software framework we have recently put together that holds numerous records on real-time AI on main-stream end devices in both performance and energy efficiency
CoCoPIE consists of two main components, which both reflect the Compression-Compilation co-design principle: CoCo-Gen generates efficient DNN execution codes via a synergy of pattern-based DNN pruning and pattern-aware code generation; CoCo-Tune dramatically shortens the process in identifying the appropriate set of DNN parameters to prune by a composability-based compiler framework. Here, we explain each of the two components and how compression-compilation co-design makes them possible.
CoCo-Gen: Pattern-based pruning and code generation. Weight pruning reduces the number of weights in DNN. As shown in Figure 1, prior weight pruning methods fall into two categories: general and non-structured pruning where arbitrary weights can be pruned; and structured pruning which prunes filters or channels in a way that produces regular and smaller weight matrices. For the better fit of regular structures for GPU/CPU executions, DNNs from regular compression have shown better speeds over those from irregular compression,10,16,20 but are subject to more notable accuracy loss.10,20
We introduce a new method—pattern-based pruning—features fine-grained pruning patterns inside coarse-grained structures.
Figure 2 illustrates the basic idea of pattern-based pruning. For each kernel (in a CONV filter), a fixed number of weights are pruned, and the remaining weights (white cells) form specific "patterns." We define the example in Figure 2 as 4-entry pattern pruning, since every kernel reserve four non-zero weights out of the original 3 × 3 kernel (the most commonly used kernel). It can generalize to other kernel sizes and fully connected layers. Each kernel has the flexibility in choosing among a number of predefined patterns.
At theory and algorithm levels, such patterns exhibit similarities to the connection structure in human visual systems.12,13,15 At compiler level, the known patterns allow a compiler to re-order and generate codes at filter and kernel level such that kernels with the same pattern can be grouped together for consecutive executions, thereby maximizing instruction-level parallelism. At hardware level, 4-entry patterns perfectly fit the SIMD architecture in embedded processors, for both CPUs and GPUs.
The selection of appropriate patterns for a kernel can be achieved via search through an extended ADMM-based framework,15 which can be sped up via a composability-based method as we will discuss.
The method can be used together with connectivity pruning, which cuts the connections between certain input and output channels, to achieve even higher weight pruning/acceleration rates.
Figure 3 shows the overview of the internal workflow of CoCo-Gen.18 After pattern-based training performs kernel pattern and connectivity pruning, execution code generation performs multiple pattern-based optimizations. Similar to other DNN compilers (for example, TVM2), CoCo-Gen converts DNN models into computational graphs and applies multiple graph-based optimizations. It works on a fine-grained DNN layerwise representation (LR), which captures the kernel patterns and tuning-related information. It employs an optimization, filter kernel reorder, to address two challenges of pattern-based pruning—heavy control-flow instructions, and thread divergence and load imbalance—by grouping the filters with similar lengths and patterns together. It stores DNN models in a novel form compressed weight storage, specifically designed for the kernel patterns and connectivity pruning, yielding a much better compression rate than the conventional compressed sparse row (CSR) format does. It further uses a register-level optimization, load redundancy elimination, to maximize memory performance. In sum, allowing compilers to treat pruned kernels as special patterns, the compression-compilation co-design unleashes the power of compiler optimizations for best matching DNN models with underlying hardware (see Niu et al.18 for more details).
CoCo-Tune: A compiler framework for fast pruning. Finding out what is the best set of filters or connectivities to prune—such as the pattern-based pruning in the previous section—can be very time consuming. For a DNN with W filters, the entire configuration space of pruned network can be as large as 2|W|, even if only filter pruning is considered (adding pattern variations would worsen the complexity further). It often takes hours to evaluate just one configuration (that is, training the pruned network and then testing it).
CoCo-Tune is a compiler-based framework that shortens the time by up to 180X. Its inspiration comes from software engineering. More specifically, it explores composability, a property (fundamental in software engineering) that we discovered in the training of a collection of pruned CNN models. The basic observation is that two CNN networks in the promising subspace often differ in only some layers, and the training results of the common layers can be reused across networks to save some training time. More generally, CoCo-Tune views the networks to search as compositions of a set of building blocks (a block is a sequence of CNN layers). It pretrains (some of) these building blocks and assembles them into the to-be-explored networks.
To identify the best set of building blocks to pretrain to maximize the benefits, it uses a novel algorithm, which represents all layers of all to-be-explored networks as a sequence of symbols and uses a hierarchical compression algorithm Sequitur17 to produce a context free grammar (CFG) and uses it to quickly find out the most reusable building blocks.
We integrate the techniques into a compiler-based framework, CoCo-Tune, which, for an arbitrary CNN (in Caffe Prototxt format) and other inputs, automatically generates Tensor-Flow code to build teacher-student learning structures to materialize composability-based CNN pruning (see Guan et al.8 for more details).
Evaluation and demos. Results on DNNs: We evaluate CoCo-Gen on a Samsung Galaxy S10 cell phone with the latest Qualcomm Snapdragon 855 mobile platform that consists of a Qualcomm Kryo 485 Octa-core CPU and a Qualcomm Adreno 640GPU. Six representative DNNs are used in this evaluation, VGG-16 (VGG), ResNet-50 (RNT), and MobileNet-V2 (MBNT) trained on two datasets, ImageNet and CIFAR-10, respectively. The accompanying table characterizes these DNNs and lists the number of pruning patterns and the loss of prediction accuracy caused by our pruning. Figure 4 shows the CPU and GPU performance of CoCo-Gen compared to TFLite,6 TVM,2 and MNN.1 CoCoGen outperforms all other frameworks for all cases. On CPU, CoCo-Gen achieves 12× to 44.5× speedup over TFLite, 2.3× to 8.1× over TVM, and 1.9× to 15.5× over MNN, respectively. On GPU, CoCo-Gen achieves 2.5× to 20×, 4.1× to 11.4×, and 2.5× to 6.2× speedup over TFLite, TVM, and MNN, respectively.a For the largest DNN (VGG) and largest dataset (ImageNet), CoCo-Gen completes CONV layers on a single input within 18.9ms on GPU, meeting the real-time requirement (usually 30 frames/sec, that is, 33ms/frame).
In terms of energy consumption, CoCo-Gen is 8.6× less than TVM. The power consumption rate of the entire mobile device is 4.1W, slightly higher than that of TVM executions, 3.8W (tested by Qualcomm Trepn power profiler). But its 9.3× less execution time leads to the large savings in energy.
The results also consistently outperform a number of ASIC and FPGA solutions in both performance and energy efficiency. Figure 5 demonstrates the comparison results on performance and energy efficiency with special ASIC hardware including Google's cloud TPU-V2 and edge TPU,7 NVIDIA Jetson AGX Xavier, Cambricon MLU-100, Eyeriss,3 and others and comparison results on accuracy and energy efficiency with the FPGA solution ESE9 (FPGA 2017 Best Paper Award) from DeePhi. The comparisons are on the same network models, and weight quantization is not used in CoCo-Gen solution (Eyeriss and ESE use 12-bit fixed-point quantizations).
The better results of CoCo-Gen come from three reasons: the compression-compilation codesign more effectively matches models with hardware; smartphone chips are built with the most advanced technology (for example, 7nm, 11nm technology), while FPGA/ASIC solutions are based on older and less energy-efficient 28nm or 40nm technologies. Current ASIC/FPGA solutions are often optimized for a specific DNN type/size (for example, edge TPU for small-scale DNNs, Cambricon MLU-100 for large-scale DNNs), while CoCo-Gen, as a software method, can better adapt to the networks.
Real application demos: We also demonstrate the efficacy of CoCo-Gen through three interesting and key DNN applications, style transfer,5 DNN coloring,11 and super resolution.4 The style transfer model is based on a generative network23 trained on Microsoft COCO.14 DNN coloring uses the Places scene24 dataset to train a novel architecture that can jointly extract and fuse global and local features to perform the final colorization. The super resolution model mainly utilizes residual blocks with wider activation and linear low-rank convolution21 trained on the DIV2K19 dataset. With structured pruning and compiler optimization, we implement the models on a Samsung Galaxy S10 mobile phone. We demonstrate that our implementations are able to achieve real-time inference on off-the-shelf mobile device with video demos.
Figure 6 shows sample inputs and outputs of the three applications. According to our study,18 CoCo-Gen on unpruned models can already outperform existing DNN frameworks (for example, TVM) in speed thanks to its advanced optimizations (for example, operator replacement and SIMD optimizations). When combined with pattern-based pruning, CoCo-Gen produces 4.2×, 3.6×, and 3.7× extra speedups on style transfer, coloring and super resolution, respectively. An inference in all the applications can complete within 75ms, demonstrating the promise of real-time performance of complex DNN applications on the off-the-shelf devices. (Please see video demos of the applications at CoCoPIE YouTube channel.b
This article has introduced the concept of compression-compilation co-design and how it is materialized into a software framework CoCoPIE for real-time AI on mobile devices. The promising progress opens up many potential directions for future development. We briefly discuss two of them:
The first is to expand the scope of the co-design-based optimizations. So far, the principle of compression-compilation co-design has been focused on DNN models. Besides DNN, a real-world AI application often includes a lot of other parts, such as data collection, data preprocessing, the use of the DNN prediction in follow-up operations, and so on. Even though DNN may play an important role in the overall application, its optimizations may not be sufficient for the entire application to meet users' needs. So, an important direction is on how to generalize the co-design principle into holistic optimizations to the entire AI-based applications.
The second is to increase the applicability of the co-design-based optimizations. This direction relates with privacy and security. As they are two important factors in many AI model constructions and deployments, how to integrate them into the co-design process is worth pursuing. For instance, typically model pruning requires access to both the models and the training datasets, but there are scenarios where datasets may not be accessible to the model optimizer due to either privacy policies or artificial boundaries among corporations. Effective ways to circumvent these roadblocks could expand the applicability of the optimizations.22 This direction also relates with the way the optimization framework takes to deliver its service (for example, standalone software versus cloud-based service).
In summary, the research reported in this article has provided strong evidence of the promise of the co-design principle, indicating it is possible to instill AI directly on existing commodity computing devices while offering even higher speeds and better energy efficiency than special AI accelerating hardware does. The results open new opportunities for democratizing AI capability on end devices, while cautioning against the broad perception on the indispensability of special AI hardware for real-time AI on end devices. We believe these results will prompt the industry to reexamine the directions and strategies on the pursuit of mobile AI.
Figure. Watch the authors discuss this work in the exclusive Communications video. https://cacm.acm.org/videos/cocopie
3. Chen, Y., Krishna, T., Emer, J., and Sze, V. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of IEEE Intern. Solid-State Circuits Conf. Digest of Technical Papers, 2016, 262–263.
7. Google Cloud TPU. Google cloud TPU, 2017; https://cloud.google.com/tpu/
11. Iizuka, S., Simo-Serra, E., and Ishikawa, H. Let there be color! joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Trans. Graphics 3, 4 (July 2016).
14. Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, L. Microsoft coco: Common objects in context. In Proceedings in European Conf. on Computer Vision. Springer, 2014, 740–755.
19. Timofte, R., Agustsson, E., Gool, L., Yang, M., and Zhang, L. Ntire challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE Conf. Computer Vision and Pattern Recognition Workshops, 2017, 114–125.
24. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems, 2014, 487–495.
Contact email@example.com for details on the technology being commercialized through CoCoPIE LLC.
©2021 ACM 0001-0782/21/6
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from firstname.lastname@example.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.
No entries found