Research and Advances
Artificial Intelligence and Machine Learning Contributed articles

CoCoPIE: Enabling Real-Time AI on Off-the-Shelf Mobile Devices via Compression-Compilation Co-Design

A new framework allows intelligence on mainstream end devices without special hardware.
Internet-of-Things icons on white grid, illustration
  1. Introduction
  2. Key Insights
  3. Compression-Compilation Co-Design: The Concept
  4. CoCoPIE
  5. Conclusion and Future Work
  6. References
  7. Authors
  8. Footnotes
Internet-of-Things icons on white grid, illustration

Many believe the company that enables real intelligence on end devices (such as mobile and IoT devices) will define the future of computing. Racing toward this goal, many companies, whether tech giants such as Google, Microsoft, Amazon, Apple and Facebook, or startups spent tens of billions of dollars each year on R&D.

Back to Top

Key Insights

  • Before eagerly pursuing hardware for real-time AI on end devices, it is worth having a closer look at whether mainstream end devices are sufficient.
  • For real-time AI, there is still a large potential to tap into on mainstream end devices.
  • Through innovative compression-compiler co-design, the CoCoPlE framework has demonstrated a promising approach to unlocking the potential.

Assuming hardware is the major constraint for enabling real-time mobile intelligence, more companies dedicate their main efforts to developing specialized hardware accelerators for machine learning and inference. Billions of dollars have been spent to fuel this intelligent hardware race.

This article challenges the view. By drawing on a recent real-time AI optimization framework CoCoPIE, it maintains that with effective compression-compiler co-design, a large potential is yet left untapped in enabling real-time artificial intelligence (AI) on mainstream end devices.

The principle of compression-compilation co-design is to design the compression of deep learning models and their compilation to executables in a hand-in-hand manner. This synergistic method can effectively optimize both the size and speed of deep learning models, and also can dramatically shorten the tuning time of the compression process, largely reducing the time to the market of AI products. When applied to models running on mainstream end devices, the method can produce real-time experience across a set of AI applications that had been broadly perceived possible only with special AI accelerators.

Foregoing the need for special hardware for real-time AI has some profound implications, thanks to the multifold advantages of mainstream processors over special hardware:

  • Time to market: Special hardware often takes multiple years before it reaches the market. The creation of the associated compiler and system software further lengthens the process. Applications using such hardware often need to use the special APIs and meet many special constraints (for example, tiling computations to a certain size), which lengthens the time to market of AI product.
  • Cost: Developing a special ASIC processor is costly and adding them into existing systems incurs extra expenses.
  • Technology maturity: Unlike general-purpose processors, special hardware has a much smaller production volume; the technology available for their production is hence usually several generations behind general-purpose processors. Most AI accelerators, for instance, are based on 28nm to 65nm CMOS technology, with a transistor density over 10× lower than state-of-art mobile CPU or GPU.
  • Speed: As a consequence of the old technology, special processors run much slower than general-purpose processors do.
  • Eco-system: General-purpose processors have a well-developed eco-system (debugging tools, optimization tools, security measures), which makes the development of high-quality applications much easier than on special processors.
  • Adoption: For all these reasons, the adoption of a special processor is usually limited to the company that creates it and its few close customers. As a result, an AI application developed for the processor can be adopted by a limited number of devices.

These reasons suggest that whenever mainstream processors can meet the speed and efficiency requirements of an AI application, they should be the preferred device to consider. The current industry however puts much emphasis on special AI hardware development, based on assumed insufficiency of mainstream processors in meeting the real-time requirements. In the rest of this article, we explain why the assumption is largely biased when compression-compilation co-design is used, how the principle can be materialized effectively into a practical framework CoCoPIE, and the implications to the AI industry.

Back to Top

Compression-Compilation Co-Design: The Concept

Compression and compilation are the two key steps in fitting a deep learning model on a hardware for efficient executions. Model compression is a common technique for reducing the size and improving the speed of deep learning models. Compression techniques fall into two categories, pruning and quantization. Pruning removes layers or convolution filters or channels, while quantization reduces the precision of parameters (for example, from floating-point to short integer). Compilation refers to the process of generating executable codes from a given deep learning model. It, in essence, is a process of mapping the high-level operations in deep learning to the low-level instructions that the underlying hardware supports. The process plays a critical role in optimizing the code for efficient executions.

The principle of compression-compilation co-design is to design the two components for AI in a hand-in-hand manner. The synergy may exhibit at three levels:

  • Demands/Preferences Level: At this level, the synergy is on taking the preferences or demands of one component into consideration when designing the other component. An example is the pattern-based pruning, which creates a regular computation pattern amenable for the compilation step to effectively map to vector units or GPU.
  • Perspective/Insight Level: At this level, the synergy is on taking the perspective or insights in the domain of one component when treating the problems in the domain of the other component. An example is the composability-based pruning, which generalizes the principle of composability or modularity in programming systems into a novel DNN model pruning method to dramatically reduce the needed computations.
  • Methodology Level: At this level, the synergy is on closely integrating the methodology of the two components together. We will illustrate this synergy through a compiler framework that automatically generates code to enable a new way of deep learning pruning, which speeds the process by up to 180X.

All the examples mentioned are part of a software framework for mobile AI named CoCoPIE. We will next give an overview of CoCoPIE based on previous publications,8,18 and then use each of its main components to explain the compression-compilation co-design principle and the significant benefits.

Back to Top


CoCoPIE stands for Compression-Compilation co-design for Performance, Intelligence, and Efficiency. It is a software framework we have recently put together that holds numerous records on real-time AI on main-stream end devices in both performance and energy efficiency

CoCoPIE consists of two main components, which both reflect the Compression-Compilation co-design principle: CoCo-Gen generates efficient DNN execution codes via a synergy of pattern-based DNN pruning and pattern-aware code generation; CoCo-Tune dramatically shortens the process in identifying the appropriate set of DNN parameters to prune by a composability-based compiler framework. Here, we explain each of the two components and how compression-compilation co-design makes them possible.

CoCo-Gen: Pattern-based pruning and code generation. Weight pruning reduces the number of weights in DNN. As shown in Figure 1, prior weight pruning methods fall into two categories: general and non-structured pruning where arbitrary weights can be pruned; and structured pruning which prunes filters or channels in a way that produces regular and smaller weight matrices. For the better fit of regular structures for GPU/CPU executions, DNNs from regular compression have shown better speeds over those from irregular compression,10,16,20 but are subject to more notable accuracy loss.10,20

Figure 1. (a) Non-structured weight pruning and (b) two types of structured weight pruning.

We introduce a new method—pattern-based pruning—features fine-grained pruning patterns inside coarse-grained structures.

Figure 2 illustrates the basic idea of pattern-based pruning. For each kernel (in a CONV filter), a fixed number of weights are pruned, and the remaining weights (white cells) form specific “patterns.” We define the example in Figure 2 as 4-entry pattern pruning, since every kernel reserve four non-zero weights out of the original 3 × 3 kernel (the most commonly used kernel). It can generalize to other kernel sizes and fully connected layers. Each kernel has the flexibility in choosing among a number of predefined patterns.

Figure 2. Illustration of (a) kernel pattern pruning on CONV kernels, and (b) connectivity pruning by removing kernels.

At theory and algorithm levels, such patterns exhibit similarities to the connection structure in human visual systems.12,13,15 At compiler level, the known patterns allow a compiler to re-order and generate codes at filter and kernel level such that kernels with the same pattern can be grouped together for consecutive executions, thereby maximizing instruction-level parallelism. At hardware level, 4-entry patterns perfectly fit the SIMD architecture in embedded processors, for both CPUs and GPUs.

The selection of appropriate patterns for a kernel can be achieved via search through an extended ADMM-based framework,15 which can be sped up via a composability-based method as we will discuss.

The method can be used together with connectivity pruning, which cuts the connections between certain input and output channels, to achieve even higher weight pruning/acceleration rates.

Figure 3 shows the overview of the internal workflow of CoCo-Gen.18 After pattern-based training performs kernel pattern and connectivity pruning, execution code generation performs multiple pattern-based optimizations. Similar to other DNN compilers (for example, TVM2), CoCo-Gen converts DNN models into computational graphs and applies multiple graph-based optimizations. It works on a fine-grained DNN layerwise representation (LR), which captures the kernel patterns and tuning-related information. It employs an optimization, filter kernel reorder, to address two challenges of pattern-based pruning—heavy control-flow instructions, and thread divergence and load imbalance—by grouping the filters with similar lengths and patterns together. It stores DNN models in a novel form compressed weight storage, specifically designed for the kernel patterns and connectivity pruning, yielding a much better compression rate than the conventional compressed sparse row (CSR) format does. It further uses a register-level optimization, load redundancy elimination, to maximize memory performance. In sum, allowing compilers to treat pruned kernels as special patterns, the compression-compilation co-design unleashes the power of compiler optimizations for best matching DNN models with underlying hardware (see Niu et al.18 for more details).

Figure 3. Overview of CoCo-Gen acceleration framework.

CoCo-Tune: A compiler framework for fast pruning. Finding out what is the best set of filters or connectivities to prune—such as the pattern-based pruning in the previous section—can be very time consuming. For a DNN with W filters, the entire configuration space of pruned network can be as large as 2|W|, even if only filter pruning is considered (adding pattern variations would worsen the complexity further). It often takes hours to evaluate just one configuration (that is, training the pruned network and then testing it).

CoCo-Tune is a compiler-based framework that shortens the time by up to 180X. Its inspiration comes from software engineering. More specifically, it explores composability, a property (fundamental in software engineering) that we discovered in the training of a collection of pruned CNN models. The basic observation is that two CNN networks in the promising subspace often differ in only some layers, and the training results of the common layers can be reused across networks to save some training time. More generally, CoCo-Tune views the networks to search as compositions of a set of building blocks (a block is a sequence of CNN layers). It pretrains (some of) these building blocks and assembles them into the to-be-explored networks.

To identify the best set of building blocks to pretrain to maximize the benefits, it uses a novel algorithm, which represents all layers of all to-be-explored networks as a sequence of symbols and uses a hierarchical compression algorithm Sequitur17 to produce a context free grammar (CFG) and uses it to quickly find out the most reusable building blocks.

We integrate the techniques into a compiler-based framework, CoCo-Tune, which, for an arbitrary CNN (in Caffe Prototxt format) and other inputs, automatically generates Tensor-Flow code to build teacher-student learning structures to materialize composability-based CNN pruning (see Guan et al.8 for more details).

Evaluation and demos. Results on DNNs: We evaluate CoCo-Gen on a Samsung Galaxy S10 cell phone with the latest Qualcomm Snapdragon 855 mobile platform that consists of a Qualcomm Kryo 485 Octa-core CPU and a Qualcomm Adreno 640GPU. Six representative DNNs are used in this evaluation, VGG-16 (VGG), ResNet-50 (RNT), and MobileNet-V2 (MBNT) trained on two datasets, ImageNet and CIFAR-10, respectively. The accompanying table characterizes these DNNs and lists the number of pruning patterns and the loss of prediction accuracy caused by our pruning. Figure 4 shows the CPU and GPU performance of CoCo-Gen compared to TFLite,6 TVM,2 and MNN.1 CoCoGen outperforms all other frameworks for all cases. On CPU, CoCo-Gen achieves 12× to 44.5× speedup over TFLite, 2.3× to 8.1× over TVM, and 1.9× to 15.5× over MNN, respectively. On GPU, CoCo-Gen achieves 2.5× to 20×, 4.1× to 11.4×, and 2.5× to 6.2× speedup over TFLite, TVM, and MNN, respectively.a For the largest DNN (VGG) and largest dataset (ImageNet), CoCo-Gen completes CONV layers on a single input within 18.9ms on GPU, meeting the real-time requirement (usually 30 frames/sec, that is, 33ms/frame).

Table. DNNs characteristics (under kernel pattern and connectivity pruning).

Figure 4. Performance comparison: x-axis: different trained DNN models; y-axis: average DNN inference execution time on a single input.

In terms of energy consumption, CoCo-Gen is 8.6× less than TVM. The power consumption rate of the entire mobile device is 4.1W, slightly higher than that of TVM executions, 3.8W (tested by Qualcomm Trepn power profiler). But its 9.3× less execution time leads to the large savings in energy.

The results also consistently outperform a number of ASIC and FPGA solutions in both performance and energy efficiency. Figure 5 demonstrates the comparison results on performance and energy efficiency with special ASIC hardware including Google’s cloud TPU-V2 and edge TPU,7 NVIDIA Jetson AGX Xavier, Cambricon MLU-100, Eyeriss,3 and others and comparison results on accuracy and energy efficiency with the FPGA solution ESE9 (FPGA 2017 Best Paper Award) from DeePhi. The comparisons are on the same network models, and weight quantization is not used in CoCo-Gen solution (Eyeriss and ESE use 12-bit fixed-point quantizations).

Figure 5. Comparison with existing ASIC and FPGA solutions.

The better results of CoCo-Gen come from three reasons: the compression-compilation codesign more effectively matches models with hardware; smartphone chips are built with the most advanced technology (for example, 7nm, 11nm technology), while FPGA/ASIC solutions are based on older and less energy-efficient 28nm or 40nm technologies. Current ASIC/FPGA solutions are often optimized for a specific DNN type/size (for example, edge TPU for small-scale DNNs, Cambricon MLU-100 for large-scale DNNs), while CoCo-Gen, as a software method, can better adapt to the networks.

Real application demos: We also demonstrate the efficacy of CoCo-Gen through three interesting and key DNN applications, style transfer,5 DNN coloring,11 and super resolution.4 The style transfer model is based on a generative network23 trained on Microsoft COCO.14 DNN coloring uses the Places scene24 dataset to train a novel architecture that can jointly extract and fuse global and local features to perform the final colorization. The super resolution model mainly utilizes residual blocks with wider activation and linear low-rank convolution21 trained on the DIV2K19 dataset. With structured pruning and compiler optimization, we implement the models on a Samsung Galaxy S10 mobile phone. We demonstrate that our implementations are able to achieve real-time inference on off-the-shelf mobile device with video demos.

Figure 6 shows sample inputs and outputs of the three applications. According to our study,18 CoCo-Gen on unpruned models can already outperform existing DNN frameworks (for example, TVM) in speed thanks to its advanced optimizations (for example, operator replacement and SIMD optimizations). When combined with pattern-based pruning, CoCo-Gen produces 4.2×, 3.6×, and 3.7× extra speedups on style transfer, coloring and super resolution, respectively. An inference in all the applications can complete within 75ms, demonstrating the promise of real-time performance of complex DNN applications on the off-the-shelf devices. (Please see video demos of the applications at CoCoPIE YouTube channel.b

Figure 6. Examples of style transfer, coloring, and super resolution implemented on our mobile device.

Back to Top

Conclusion and Future Work

This article has introduced the concept of compression-compilation co-design and how it is materialized into a software framework CoCoPIE for real-time AI on mobile devices. The promising progress opens up many potential directions for future development. We briefly discuss two of them:

The first is to expand the scope of the co-design-based optimizations. So far, the principle of compression-compilation co-design has been focused on DNN models. Besides DNN, a real-world AI application often includes a lot of other parts, such as data collection, data preprocessing, the use of the DNN prediction in follow-up operations, and so on. Even though DNN may play an important role in the overall application, its optimizations may not be sufficient for the entire application to meet users’ needs. So, an important direction is on how to generalize the co-design principle into holistic optimizations to the entire AI-based applications.

The second is to increase the applicability of the co-design-based optimizations. This direction relates with privacy and security. As they are two important factors in many AI model constructions and deployments, how to integrate them into the co-design process is worth pursuing. For instance, typically model pruning requires access to both the models and the training datasets, but there are scenarios where datasets may not be accessible to the model optimizer due to either privacy policies or artificial boundaries among corporations. Effective ways to circumvent these roadblocks could expand the applicability of the optimizations.22 This direction also relates with the way the optimization framework takes to deliver its service (for example, standalone software versus cloud-based service).

In summary, the research reported in this article has provided strong evidence of the promise of the co-design principle, indicating it is possible to instill AI directly on existing commodity computing devices while offering even higher speeds and better energy efficiency than special AI accelerating hardware does. The results open new opportunities for democratizing AI capability on end devices, while cautioning against the broad perception on the indispensability of special AI hardware for real-time AI on end devices. We believe these results will prompt the industry to reexamine the directions and strategies on the pursuit of mobile AI.

Figure. Watch the authors discuss this work in the exclusive Communications video.

    1. Alibaba. 2019.

    2. Chen, T. et al. TVM: An automated end-toend optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, 2018, 578–594.

    3. Chen, Y., Krishna, T., Emer, J., and Sze, V. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In Proceedings of IEEE Intern. Solid-State Circuits Conf. Digest of Technical Papers, 2016, 262–263.

    4. Dong, C., Loy, C., He, K., and Tang, X. Learning a deep convolutional network for image super-resolution. In European Conf. Computer Vision. Springer, 2014, 184–199.

    5. Gatys, L., Ecker, A., and Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conf. Computer Vision and Pattern Recognition, 2016, 2414–2423.

    6. Google. Tensorflow lite, 2019.

    7. Google Cloud TPU. Google cloud TPU, 2017;

    8. Guan, H., Shen, X., and Lim, S. Wootz: A compiler-based framework for fast CNN pruning via composability. In Proceedings of the Programming Language Design and Implementation, 2019.

    9. Han, S. et al. Ese: Efficient speech recognition engine with sparse LSTM on FPGA. FPGA, 2017, 75–84.

    10. He, Y., Zhang, X., and Sun, J. Channel pruning for accelerating very deep neural networks. In Proceedings of the 2017 IEEE Intern. Conf. on Computer Vision. 2017, 1398–1406.

    11. Iizuka, S., Simo-Serra, E., and Ishikawa, H. Let there be color! joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Trans. Graphics 3, 4 (July 2016).

    12. Lebedev, V. and Lempitsky, V. Fast convnets using group-wise brain damage. In Proceedings of the IEEE Conf. Computer Vision and Pattern Recognition, 2016, 2554–2564.

    13. Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. Pruning filters for efficient convnets. In Proceedings of the Intern. Conf. on Learning Representations, 2017.

    14. Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, L. Microsoft coco: Common objects in context. In Proceedings in European Conf. on Computer Vision. Springer, 2014, 740–755.

    15. Ma, X. et al. PCONV: The missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices. AAAI, 2020.

    16. Mao, H., Han, S., Pool, J., Li, W., Liu, X., Wang, Y., and Dally, W. Exploring the regularity of sparse structure in convolutional neural networks. 2017; arXiv:1705.08922, 2017.

    17. Nevill-Manning, C. and Witten, I. Identifying hierarchical structure in sequences: A linear-time algorithm. J. Artif. Intell. Res. 7 (1997), 67–82.

    18. Niu, W., Ma, X., Lin, S., Wang, S., Qian, X., Lin, X., Wang, Y., and Ren, B. PatDNN: Achieving real-time DNN execution on mobile devices with pattern-based weight pruning. ASPLOS, 2020.

    19. Timofte, R., Agustsson, E., Gool, L., Yang, M., and Zhang, L. Ntire challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE Conf. Computer Vision and Pattern Recognition Workshops, 2017, 114–125.

    20. Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, 2016, 2074–2082.

    21. Yu, J., Fan, Y., Yang, J., Xu, N., Wang, Z., Wang, X., and Huang, T. Wide activation for efficient and accurate image super-resolution. 2018; arXiv:1808.08718.

    22. Zhan, Z. et al. Priv: A privacy-preserving deep neural network model compression framework. arXiv preprint, 2020.

    23. Zhang, H. and Dana, K. Multi-style generative network for real-time transfer. 2017; arXiv:1703.06953.

    24. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems, 2014, 487–495.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More