As artificial intelligence becomes a pervasive tool for the billions of Internet of Things (IoT) devices at the edge, the data movement bottleneck imposes severe limitations on the performance and autonomy of these systems. Processing-in-memory (PiM) is emerging as a way of mitigating the data movement bottleneck while satisfying the stringent performance, energy efficiency, and accuracy requirements of edge imaging applications that rely on convolutional neural networks (CNNs).
The globalization of affordable Internet access has spurred a revolution in computer architectures, characterized by the accelerate widespread adoption of smartphones, tablets, and other smart devices, which are now commonplace.15 The rise of IoT applications in a wide range of domains (for example, personal computing, education, industry, military, healthcare, digital agriculture) brings with it the ability to integrate billions of devices on the Internet, as depicted in Figure 1.
This integration presents unprecedented challenges, such as the need for inexpensive computation and communication, capable of crunching the increasing volumes of data generated every day.
The IoT paradigm also promises a more intimate connection between the cyber and physical worlds, as data becomes a ubiquitous asset exchanged among all manner of connected (smart) devices. Moreover, the data flow is often bidirectional, taking place not just from the physical to the cyber world, but also from the cyber to (what is possible in) the physical world.
Bringing this vision to fruition will require that IoT devices exhibit AI. The most promising approaches today are based on empowering systems with the ability to learn autonomously from experience by assimilating large amounts of data—using machine learning (ML) algorithms—with a particular focus on deep learning and image inference.
The total volume of digital data created, replicated, and consumed within a year has surpassed dozens of zetabytes (ZBs or 1021 bytes) in 2020, and the International Data Corporation estimates that this number will grow to hundreds of ZBs in coming years.18 The COVID-19 pandemic contributed to this figure because of widespread work-from-home mandates and a sudden increase in videoconferencing and streaming data. A significant portion of this data is consumed at the edge, often with processing performed entirely on smartphones and embedded systems. The rise of many other AI-based applications applied to the big data revolution exacerbates this problem by placing increasing stress on computing and memory systems, particularly those operating at the edge.
The case for edge computing as the enabler of sustainable AI scaling is strengthened because a sizable portion of the data generated by modern digital systems originates from sensors located at the edge—under this paradigm, data is processed where it is generated, as Figure 2 illustrates. In contrast with the conventional approach shown on the left, with the emergence of cloud-edge hierarchies, AI moves to the edge layer, shown on the right in the figure. This new paradigm creates pressure for more intensive computation on the edge-processing nodes, but it also decreases the time and energy spent communicating with the cloud and introduces data privacy and latency benefits.
To achieve these goals, computer architects and software developers must adopt a holistic vision of the combined cloud+edge system to keep the unnecessary movement of data between components to a minimum by processing data where it is generated and stored, as this is the dominant performance and energy bottleneck.9
At present, neural networks are widely used in many domains and are becoming integral components of other emerging applications, such as self-driving cars, always-on biosignal monitoring, augmented and virtual reality, critical IoT, and voice communication (which represents up to 25% of the use cases of 5G at the edge), all of which require AI algorithms to operate on high volumes of data at the edge (see Figure 3). Examples include autonomous vehicles, digital agriculture, smartphones, and smart IoT devices that process substantial volumes of data while running AI kernels at the edge. The systems that support these emerging applications will be expected to make decisions faster—and, often, better—than their human counterparts, with support for continuous fine-tuning of their decision-making by factoring in ever-increasing volumes of data for training and inference.
CNNs are the de facto standard for image-based decision-making tasks. These models make heavy use of the convolution and multiply-and-accumulate (MAC) operations, which represent more than 90% of the total cost of computation.11 For this reason, state-of-the-art neural network accelerators (for example, Google's Tensor Processing Unit) have focused on optimizing the performance and energy efficiency of MAC operations.
Characterizing and optimizing CNN architectures. Typical CNNs consist of hundreds of millions of parameters and require the execution of billions of operations per second to achieve real-time performance. As their accuracy improves, CNNs include more parameters and layers, becoming wider and deeper. The use of compact data representations provided by quantization mitigates some of the overhead of these more complex network architectures and allows for high degrees of parallelism and data reuse, which are especially useful in constrained processing environments.
Data quantization significantly reduces the computation and storage requirements of neural networks by decreasing the bit width of the model's weights. Quantizing these values to less than 8 bits while retaining accuracy, however, requires manual effort, hyperparameter tuning, and intensive retraining.
While training requires a high dynamic range, inference does not: In most cases, 2-bit to 4-bit precision achieves the desired levels of accuracy.10 Going further, it is even possible to approximate CNNs by binarizing (that is, quantizing to one bit) their input, weights, and/or activations.10,17
In many practical settings, particularly edge computing, the performance and energy-efficiency benefits of binary neural networks outweigh the accuracy loss. A further benefit of binary neural networks is the ability to approximate the convolution operation required by CNNs, by combining the much more efficient bitwise XNOR and bit-counting operations.
Table 1 illustrates the memory and processing requirements of five widely used CNN models across a range of devices, from datacenter servers to IoT nodes. The table indicates the number of parameters and MAC operations for all the networks. W/A refers to the bit widths of weights and activations for each level of quantization. All networks have 224-by-224 input resolution. These devices were selected from a large set of neural network algorithms to fit the edge nodes, namely in terms of memory required to run the models. Their size is shown as a function of weight/activation bit widths (32-bit, 2-bit, 1-bit), as well as the number of MAC operations used by each network.
Specialization as the enabler of high-performance AI. Modern deep-learning algorithms have substantial computational, memory, and energy requirements, which makes their implementation on edge devices challenging. This challenge can be addressed by exploiting two unique characteristics of ML algorithms: First, each class of deep-learning algorithm relies on a limited set of specialized operations; second, in many cases these algorithms provide good accuracy even when they use low-bit-width operations.13
In recent years, several frameworks (for example, TensorFlow, PyTorch, TensorRT) helped bridge the semantic gap between the high-level description of a neural network model and its underlying hardware mapping by using specialized instructions. This is achieved by performing operations in a bulk parallel manner, minimizing memory accesses and maximizing compute resource utilization.
These observations and constraints lead to the formulation of a set of well-defined target metrics for AI at the edge:
The good old processor-centric computing model. The late 20th and early 21st centuries saw the widespread use of the processor-centric computational model. In this model, programs and data are stored in memory, and processing takes place in specialized ALUs (arithmetic logic units). Together with Moore's Law, the introduction of caches, branch predictors, out-of-order execution, multithreading, and several other hardware and software optimizations enabled a steady and mostly unfazed series of performance improvements over the past decades.
In contrast, memory systems have improved at a much slower pace. This performance gap between the processor and the main memory—supported by the fact that these technologies are still several process-node generations apart—has given rise to a critical data movement bottleneck, dubbed the memory wall.14 Memory is the dominant performance and energy bottleneck in modern computing systems; data movement is much more expensive than computation, both in latency and energy. The data movement bottleneck will remain relevant as the number of smart devices connected to the Internet—as of this writing, already in the billions, as depicted in Figure 1—continues to grow.
A compelling possibility: processing data where it resides. As the demand for inferencing at the edge grows, accessing data more efficiently becomes increasingly relevant. The proposed improvements span data reuse by exploiting temporal and physical locality; algorithm design, with the introduction of optimized neural network topologies and the use of quantization; specialized hardware, introducing dedicated vectorized instructions designed to address the demands of these workloads (for example, SIMD MAC operations); and PiM architectures. When performed at the edge, PiM enables higher throughput for AI applications, without compromising device autonomy.17
PiM solutions differ primarily in their proximity to the data:14 In the processing-near-memory (PnM) paradigm, computation takes place close to where data resides, but in a different medium—for example, in the logic layer of a 3D-stacked memory; in contrast, the processing-using-memory (PuM) paradigm takes advantage of the storage medium's physical properties to perform computation—for example, ReRAM (resistive random access memory) or DRAM (dynamic random access memory) cells.
Processing-near-memory. 3D-stacked memories are an emerging type of memory architecture that enables the vertical stacking of memory layers on top of a logic layer. This logic layer can be designed to feature hardware support for several operations, thus enabling computation inside the memory units.19
Processing-using-memory. DRAM technology is especially well suited for supporting bitwise operations since adjacent memory rows can communicate with one another through their bitlines.
Ambit21 supports bulk bitwise majority/AND/OR/NOT functions by exploiting the analog operation of DRAM. The combination of these operations allows the design of full applications. Recent studies show that Ambit's core operating principle can be performed in commodity off-the-shelf DRAM chips with no changes to DRAM.7 Ambit improves performance by 30 to 45 times and reduces energy consumption by 25 to 60 times for the execution of bulk bitwise operations, resulting in an overall speedup in database queries of four to 12 times.
SIMDRAM8 creates an optimized graph representation of a user-defined arbitrary operation using bitwise majority and not operations, which can be performed using the triple-row activation command defined in Ambit. The SIMDRAM control unit orchestrates the computation from start to finish by executing the previously defined DRAM commands. SIMDRAM improves performance by 88/5.8 times and reduces energy consumption by 257/31 times, compared with CPU/GPU execution.
As AI becomes a pervasive tool for billions of IoT devices at the edge, the data movement bottleneck imposes limitations on these systems' performance and autonomy.
The DRAM-based PuM architecture pLUTo5 extends the flexibility and performance of PuM by introducing a mechanism for bulk in-DRAM value lookups. The lookups take place entirely within the DRAM subarray and therefore do not require that data be moved off-chip at any point. With pLUTo, it is possible to implement arbitrarily complex functions as table lookups (so long as the memory arrays are sufficiently large to accommodate them), while minimizing the overall movement of data. pLUTo improves performance by 33/8 times and reduces energy consumption by 110/80 times compared with CPU/GPU execution.
PRIME2 and ISAAC22 are two promising neural network accelerators based on ReRAM. These proposals leverage the ReRAM crossbar array to perform matrix-vector multiplication efficiently in the analog domain. These solutions report performance and energy consumption improvements of up to 2,360 and 895 times, respectively, relative to state-of-the-art neural-processing unit designs.
Enabling the adoption of PiM. Because quantized neural networks rely on simple operations, they are a prime target for exploiting the use of PiM to process AI kernels at the edge. Performing more straightforward computation close to where data resides greatly reduces the overall movement of data, which improves latency, throughput, and energy efficiency. These can likely be applied even more efficiently to quantized and binary CNNs, which use XNOR and bit count operators.
To validate the performance and energy merits of PiM, we present a quantitative analysis of the improvements to CNN image inference at the edge, offered by Ambit, a DRAM-based PuM architecture that supports bitwise logic operations.
This section evaluates the accuracy, performance, and energy costs of performing inference for binary and quantized versions of the most recent neural networks, introduced in Table 1. To this end, Table 2 shows the energy consumption of Ambit-based PiM designs to perform inference on each of these networks. These estimates account for the cost of performing the dominant MAC operations required by the convolutions. Energy is calculated for execution on both ARM CPUs at the edge (baseline), as well as using an analytical model for PiM technology. (For full access to the original references, see https://github.com/joaodinissf/qcnn-accuracy/.)
Table 2. Energy consumed per inference in Table 1.
This shows that different accuracy-performance-energy trade-offs are possible. Table 3 is an analysis of the performance of the four most recent neural networks from Table 1, targeting 15FPS and 60FPS, for 3-, 2- and 1-bit precision, with supported inference throughput (in FPS) for Qualcomm Snapdragon 865, Intel Xeon Gold 6154, and Edge TPU baselines, compared with an Ambit-based PiM architecture. Cells with a '-' indicate lack of real-time support; cells with a '+' indicate real-time support. Empirical observations show that these quantized models ensure adequate accuracy for many applications. Ingenious quantization techniques enable further compression of even the smallest neural network models and the portability of very large networks for use in power-constrained devices, with small, tunable, accuracy losses.
Accuracy. The average accuracy loss is 6.8%, and in the best case the loss is as low as 2.2%. In cases where the accuracy of the 1-bit network may result in too great of an accuracy loss, settling for 2- or 3-bit quantization often yields good accuracy values.
Performance. The described Ambit-based accelerator achieves 15FPS in eight cases out of 12, and 60FPS for the other four cases. In contrast, the Qualcomm Snapdragon, Intel Xeon, and Edge TPU baselines sustain 15FPS in, respectively, zero, three, and nine cases out of 12, and 60FPS in zero, zero, and nine cases out of 12. The average speedup of the Ambit-based accelerator over each of these baselines is 58.8, 21.3, and 4.6 times.
Two key conclusions can be drawn. First, the Edge TPU can sustain a processing rate of only 3.4FPS for VGG-16, because of this network's high number of parameters. This illustrates the poor scalability of current specialized neural network accelerators to process very large networks. Second, the flexible degree of parallelism in the PiM implementation, attained through the operation of multiple subarrays in parallel, allows the inference time to scale quasi-linearly with the size of the network, which enables a near-constant inference time in the largest networks.
Smart devices may soon come to incorporate both conventional and PiM-enabled memories.
Energy. Average energy savings of 35.4 times were observed for 1-bit, relative to 32-bit precision. Energy gains are linearly proportional to the degree of quantization.
As the volume of data to process approaches the installed processing capacity, the successful implementation of AI at the edge will depend on the development of optimized architectures that are able to perform AI tasks while meeting strict performance and energy-efficiency requirements. After the introduction of 1,000-core processors and the expansion of levels of parallelism in the compute units, computer architects and software developers must now turn to the memory to design the next generation of high-performance and highly efficient systems. While PiM architectures have shown promise,16 enabling their adoption at the edge layer will require answers to the following open questions for manufacturers designing PiM:
PiM software development for end users includes:
Demand for AI at the edge will continue to grow. High-performance and energy-efficient PiM-based edge architectures for the processing of AI should have the following qualities:
Where do we go from here? Several PiM architectures have demonstrated the ability to perform AI tasks with unprecedented levels of efficiency by meeting the take-aways presented here. Existing PiM designs support a limited range of operations, however, and further work is necessary to meet the requirements of AI tasks.
PiM is not a silver bullet, and it will not supersede conventional computing—although it may soon become evident that computers and other smart devices benefit from incorporating both conventional processing units and PiM-enabled memories. When designing new architectures, computer architects must remain mindful that data movement is expensive in both latency and energy. Emerging technologies and architectures enable the mitigation of data movement and, as such, pave a path for the design of more efficient computing devices. The current paradigm will take us only so far; PiM presents a compelling alternative.
Acknowledgments. This work was partially supported by Instituto de Telecomunicações and Fundação para a Ciência e a Tecnologia, Portugal, under grants EXPL/EEI-HAC/1511/2021, PTDC/EEIHAC/30485/2017 and UIDB/EEA/50008/2020.
1. Bonawitz, K., Kairouz, P., McMahan, B., Ramage, D. Federated learning and privacy: building privacy-preserving systems for machine learning and data science on decentralized data. acmqueue 19, 5 (2021), 87–114; https://dl.acm.org/doi/10.1145/3494834.3500240.
2. Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., Xie, Y. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Computer Architecture News 44, 3 (2016), 27–39; https://dl.acm.org/doi/10.1145/3007787.3001140.
3. Devaux, F. The true processing in memory accelerator. In Proceedings of the IEEE Hot Chips 31 Symp. IEEE Computer Society, 2019, 1–24; https://ieeexplore.ieee.org/document/8875680.
4. Duarte, P., Tomas, P., Falcao, G. SCRATCH: An end-to-end application-aware soft-GPGPU architecture and trimming tool. In Proceedings of the 50th Annual IEEE/ACM Intern. Symp. Microarchitecture, 2017, 165–177; https://ieeexplore.ieee.org/document/8686531.
5. Ferreira, J.D. et al. pLUTo: Enabling massively parallel computation in DRAM via lookup tables. In Proceedings of the 55th IEEE/ACM Intern. Symp. Microarchitecture, 2022, 900–919; http://bit.ly/3I8RCKZ.
6. Fuketa, H., Uchiyama, K. Edge artificial intelligence chips for the cyberphysical systems era. Computer 54, 1 (2021), 84–88; https://ieeexplore.ieee.org/document/9321799.
7. Gao, F., Tziantzioulis, G., Wentzlaff, D. ComputeDRAM: In-memory compute using off-the-shelf DRAMs. In Proceedings of the 52nd Annual IEEE/ACM Intern. Symp. Microarchitecture, 2019, 100–113; https://dl.acm.org/doi/10.1145/3352460.3358260.
8. Hajinazar, N. et al. SIMDRAM: A framework for bit-serial SIMD processing using DRAM. In Proceedings of the 26th ACM Intern. Conf. Architectural Support for Programming Languages and Operating Systems, 2021, 329–345; https://dl.acm.org/doi/10.1145/3445814.3446749.
9. Hennessy, J.L., Patterson, D.A. A new golden age for computer architecture. Commun. ACM 62, 2 (Feb. 2019), 48–60; https://dl.acm.org/doi/10.1145/3282307.
10. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. J. Machine Learning Research 18, (2017), 1–30; https://www.jmlr.org/papers/volume18/16-456/16-456.pdf.
11. Jouppi, N. et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the ACM/IEEE 44th Annual Intern. Symp. Computer Architecture, 2017, 1–12; https://ieeexplore.ieee.org/abstract/document/8192463/similar.
12. Kwon, Y.-C. et al. A 20nm 6GB function-in-memory DRAM, based on HBM2 with a 1.2 TFLOPS programmable computing unit using bank-level parallelism, for machine learning applications. In Proceedings of the 2021 IEEE Intern. Solid-state Circuits Conf., 350–352; https://ieeexplore.ieee.org/document/9365862.
13. Marques, J. Andrade, J., Falcao, G. Unreliable memory operation on a convolutional neural network processor. In Proceedings of the IEEE 2017 Intern. Workshop on Signal Processing Systems, 1–6; https://ieeexplore.ieee.org/document/8110024.
14. Mutlu, O., Ghose, S., Gómez-Luna, J., Ausavarungnirun, R. A modern primer on processing in memory. 2020; https://arxiv.org/abs/2012.03112.
15. Pandey, P., Pompili, D. Handling limited resources in mobile computing via closed-loop approximate computations. IEEE Pervasive Computing 18, 1 (2019), 39–48; https://ieeexplore.ieee.org/document/8705029.
16. Radojkovic, P., Carpenter, P., Esmaili-Dokht, P., Cimadomo, R., Charles, H.-P., Sebastian, A., Amato, P. Processing in memory: The tipping point. White Paper from European Technology Platform for High-performance Computing, 2021; https://bit.ly/3JSrm8H
17. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A. Enabling AI at the edge with XNOR-networks. Commun. ACM 63, 12 (Dec. 2020), 83–90; https://dl.acm.org/doi/10.1145/3429945.
18. Reinsel, D., Gantz, J., Rydning, J. The digitization of the world—from edge to core. I.D. Corp. White Paper, 2018; https://bit.ly/40H3iMl.
19. Rosenfeld, P. Performance exploration of the hybrid memory cube. Ph.D. dissertation. University of Maryland, 2014; https://user.eng.umd.edu/~blj/papers/thesis-PhD-paulr-HMC.pdf.
21. Seshadri, V. et al. Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology. In Proceedings of the 50th Annual IEEE/ACM Intern. Symp. Microarchitecture, 2017, 73–287; https://dl.acm.org/doi/10.1145/3123939.3124544.
22. Shafiee, A. et al. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News 44, 3 (2016), 14–26; https://dl.acm.org/doi/10.1145/3007787.3001139.
23. Talpes, E. et al. Compute solution for Tesla's full self-driving computer. IEEE Micro 40 (2020), 25–35; https://ieeexplore.ieee.org/document/9007413.
24. Zheng, H., Hu, H., Han, Z. Preserving user privacy for machine learning: Local differential privacy or federated machine learning? IEEE Intelligent Systems 35, 44 (2020), 5–14; https://ieeexplore.ieee.org/document/9144394.
Copyright held by owner/author. Publication rights licensed to ACM.
Request permission to publish from firstname.lastname@example.org
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.
No entries found