As artificial intelligence becomes a pervasive tool for the billions of Internet of Things (IoT) devices at the edge, the data movement bottleneck imposes severe limitations on the performance and autonomy of these systems. Processing-in-memory (PiM) is emerging as a way of mitigating the data movement bottleneck while satisfying the stringent performance, energy efficiency, and accuracy requirements of edge imaging applications that rely on convolutional neural networks (CNNs).
The globalization of affordable Internet access has spurred a revolution in computer architectures, characterized by the accelerate widespread adoption of smartphones, tablets, and other smart devices, which are now commonplace.15 The rise of IoT applications in a wide range of domains (for example, personal computing, education, industry, military, healthcare, digital agriculture) brings with it the ability to integrate billions of devices on the Internet, as depicted in Figure 1.
Figure 1. IoT vs. non-IoT active device connections.
This integration presents unprecedented challenges, such as the need for inexpensive computation and communication, capable of crunching the increasing volumes of data generated every day.
The IoT paradigm also promises a more intimate connection between the cyber and physical worlds, as data becomes a ubiquitous asset exchanged among all manner of connected (smart) devices. Moreover, the data flow is often bidirectional, taking place not just from the physical to the cyber world, but also from the cyber to (what is possible in) the physical world.
Bringing this vision to fruition will require that IoT devices exhibit AI. The most promising approaches today are based on empowering systems with the ability to learn autonomously from experience by assimilating large amounts of data—using machine learning (ML) algorithms—with a particular focus on deep learning and image inference.
AI Demands at the Edge Will Grow
The total volume of digital data created, replicated, and consumed within a year has surpassed dozens of zetabytes (ZBs or 1021 bytes) in 2020, and the International Data Corporation estimates that this number will grow to hundreds of ZBs in coming years.18 The COVID-19 pandemic contributed to this figure because of widespread work-from-home mandates and a sudden increase in videoconferencing and streaming data. A significant portion of this data is consumed at the edge, often with processing performed entirely on smartphones and embedded systems. The rise of many other AI-based applications applied to the big data revolution exacerbates this problem by placing increasing stress on computing and memory systems, particularly those operating at the edge.
The case for edge computing as the enabler of sustainable AI scaling is strengthened because a sizable portion of the data generated by modern digital systems originates from sensors located at the edge—under this paradigm, data is processed where it is generated, as Figure 2 illustrates. In contrast with the conventional approach shown on the left, with the emergence of cloud-edge hierarchies, AI moves to the edge layer, shown on the right in the figure. This new paradigm creates pressure for more intensive computation on the edge-processing nodes, but it also decreases the time and energy spent communicating with the cloud and introduces data privacy and latency benefits.
Figure 2. Migration of AI to the edge.
With the growth in data generation and the expansion of demand for AI applications at increasing rates, five goals justify moving the processing of AI to the edge:1,6,20,24
- Latency. Many applications have an interactive nature and thus cannot endure long latencies, especially for memory, storage, and network requests.
- Reliability. Communication networks are not reliable at all places, at all times. To ensure maximum uptime, AI-based decision making must not rely on always-available communication networks.
- Privacy/security. Some applications require sensitive data to be kept in a controlled local environment, avoiding its circulation to/from the cloud. Examples are medical, finance, or autonomous driving, among many others.
- Bandwidth. Data that is processed near where it is collected does not need to be sent to the cloud, which reduces the overall bandwidth demand on the network and the edge-computing systems.
- Data provenance. Provenance issues may prevent data from being processed far from where it is generated. Datacenter storage may need to comply with regional data protection legislation such as General Data Protection Regulation (GDPR) in Europe and Personal Information Protection Act (PIPA) in Canada.
To achieve these goals, computer architects and software developers must adopt a holistic vision of the combined cloud+edge system to keep the unnecessary movement of data between components to a minimum by processing data where it is generated and stored, as this is the dominant performance and energy bottleneck.9
AI at the Edge: ML Solutions for Data Challenges
At present, neural networks are widely used in many domains and are becoming integral components of other emerging applications, such as self-driving cars, always-on biosignal monitoring, augmented and virtual reality, critical IoT, and voice communication (which represents up to 25% of the use cases of 5G at the edge), all of which require AI algorithms to operate on high volumes of data at the edge (see Figure 3). Examples include autonomous vehicles, digital agriculture, smartphones, and smart IoT devices that process substantial volumes of data while running AI kernels at the edge. The systems that support these emerging applications will be expected to make decisions faster—and, often, better—than their human counterparts, with support for continuous fine-tuning of their decision-making by factoring in ever-increasing volumes of data for training and inference.
Figure 3. Examples of data processing with AI kernels at the edge.
CNNs are the de facto standard for image-based decision-making tasks. These models make heavy use of the convolution and multiply-and-accumulate (MAC) operations, which represent more than 90% of the total cost of computation.11 For this reason, state-of-the-art neural network accelerators (for example, Google’s Tensor Processing Unit) have focused on optimizing the performance and energy efficiency of MAC operations.
Characterizing and optimizing CNN architectures. Typical CNNs consist of hundreds of millions of parameters and require the execution of billions of operations per second to achieve real-time performance. As their accuracy improves, CNNs include more parameters and layers, becoming wider and deeper. The use of compact data representations provided by quantization mitigates some of the overhead of these more complex network architectures and allows for high degrees of parallelism and data reuse, which are especially useful in constrained processing environments.
Data quantization significantly reduces the computation and storage requirements of neural networks by decreasing the bit width of the model’s weights. Quantizing these values to less than 8 bits while retaining accuracy, however, requires manual effort, hyperparameter tuning, and intensive retraining.
While training requires a high dynamic range, inference does not: In most cases, 2-bit to 4-bit precision achieves the desired levels of accuracy.10 Going further, it is even possible to approximate CNNs by binarizing (that is, quantizing to one bit) their input, weights, and/or activations.10,17
In many practical settings, particularly edge computing, the performance and energy-efficiency benefits of binary neural networks outweigh the accuracy loss. A further benefit of binary neural networks is the ability to approximate the convolution operation required by CNNs, by combining the much more efficient bitwise XNOR and bit-counting operations.
Table 1 illustrates the memory and processing requirements of five widely used CNN models across a range of devices, from datacenter servers to IoT nodes. The table indicates the number of parameters and MAC operations for all the networks. W/A refers to the bit widths of weights and activations for each level of quantization. All networks have 224-by-224 input resolution. These devices were selected from a large set of neural network algorithms to fit the edge nodes, namely in terms of memory required to run the models. Their size is shown as a function of weight/activation bit widths (32-bit, 2-bit, 1-bit), as well as the number of MAC operations used by each network.
Table 1. Convolutional neural networks using the Imagenet dataset.
Specialization as the enabler of high-performance AI. Modern deep-learning algorithms have substantial computational, memory, and energy requirements, which makes their implementation on edge devices challenging. This challenge can be addressed by exploiting two unique characteristics of ML algorithms: First, each class of deep-learning algorithm relies on a limited set of specialized operations; second, in many cases these algorithms provide good accuracy even when they use low-bit-width operations.13
In recent years, several frameworks (for example, TensorFlow, PyTorch, TensorRT) helped bridge the semantic gap between the high-level description of a neural network model and its underlying hardware mapping by using specialized instructions. This is achieved by performing operations in a bulk parallel manner, minimizing memory accesses and maximizing compute resource utilization.
Efficient Edge AI: Architecting Data-Centric Systems
These observations and constraints lead to the formulation of a set of well-defined target metrics for AI at the edge:
- Accuracy. The success rate of the AI task10 (for example, image classification, object detection, sentence generation, translation).
- Throughput. The rate of processing of input data. Many real-time AI applications that support video must sustain processing rates on the order of thousands of FPS (frames per second)—for example, 2,300FPS in self-driving cars23 or hundreds to thousands of FPS in ultrasound medical devices.
- Latency. The critical path delay associated with the processing of a single input element. 5G standards define a maximum latency of one millisecond for positioning and tracking systems; self-driving cars must provide latencies within the same order of magnitude.23
- Power and energy. Most edge devices are battery-powered and maximizing battery life is a key design target. For reference, the computational system of self-driving cars requires a power supply of up to 2.5kW.23
- Data precision. AI data need not always be represented in 64- or 32-bit floating-point precision. For many inference applications, integer precision of eight bits or less suffices.4,13
The good old processor-centric computing model. The late 20th and early 21st centuries saw the widespread use of the processor-centric computational model. In this model, programs and data are stored in memory, and processing takes place in specialized ALUs (arithmetic logic units). Together with Moore’s Law, the introduction of caches, branch predictors, out-of-order execution, multithreading, and several other hardware and software optimizations enabled a steady and mostly unfazed series of performance improvements over the past decades.
In contrast, memory systems have improved at a much slower pace. This performance gap between the processor and the main memory—supported by the fact that these technologies are still several process-node generations apart—has given rise to a critical data movement bottleneck, dubbed the memory wall.14 Memory is the dominant performance and energy bottleneck in modern computing systems; data movement is much more expensive than computation, both in latency and energy. The data movement bottleneck will remain relevant as the number of smart devices connected to the Internet—as of this writing, already in the billions, as depicted in Figure 1—continues to grow.
A compelling possibility: processing data where it resides. As the demand for inferencing at the edge grows, accessing data more efficiently becomes increasingly relevant. The proposed improvements span data reuse by exploiting temporal and physical locality; algorithm design, with the introduction of optimized neural network topologies and the use of quantization; specialized hardware, introducing dedicated vectorized instructions designed to address the demands of these workloads (for example, SIMD MAC operations); and PiM architectures. When performed at the edge, PiM enables higher throughput for AI applications, without compromising device autonomy.17
PiM solutions differ primarily in their proximity to the data:14 In the processing-near-memory (PnM) paradigm, computation takes place close to where data resides, but in a different medium—for example, in the logic layer of a 3D-stacked memory; in contrast, the processing-using-memory (PuM) paradigm takes advantage of the storage medium’s physical properties to perform computation—for example, ReRAM (resistive random access memory) or DRAM (dynamic random access memory) cells.
Processing-near-memory. 3D-stacked memories are an emerging type of memory architecture that enables the vertical stacking of memory layers on top of a logic layer. This logic layer can be designed to feature hardware support for several operations, thus enabling computation inside the memory units.19
Processing-using-memory. DRAM technology is especially well suited for supporting bitwise operations since adjacent memory rows can communicate with one another through their bitlines.
Ambit21 supports bulk bitwise majority/AND/OR/NOT functions by exploiting the analog operation of DRAM. The combination of these operations allows the design of full applications. Recent studies show that Ambit’s core operating principle can be performed in commodity off-the-shelf DRAM chips with no changes to DRAM.7 Ambit improves performance by 30 to 45 times and reduces energy consumption by 25 to 60 times for the execution of bulk bitwise operations, resulting in an overall speedup in database queries of four to 12 times.
SIMDRAM8 creates an optimized graph representation of a user-defined arbitrary operation using bitwise majority and not operations, which can be performed using the triple-row activation command defined in Ambit. The SIMDRAM control unit orchestrates the computation from start to finish by executing the previously defined DRAM commands. SIMDRAM improves performance by 88/5.8 times and reduces energy consumption by 257/31 times, compared with CPU/GPU execution.
As AI becomes a pervasive tool for billions of IoT devices at the edge, the data movement bottleneck imposes limitations on these systems’ performance and autonomy.
The DRAM-based PuM architecture pLUTo5 extends the flexibility and performance of PuM by introducing a mechanism for bulk in-DRAM value lookups. The lookups take place entirely within the DRAM subarray and therefore do not require that data be moved off-chip at any point. With pLUTo, it is possible to implement arbitrarily complex functions as table lookups (so long as the memory arrays are sufficiently large to accommodate them), while minimizing the overall movement of data. pLUTo improves performance by 33/8 times and reduces energy consumption by 110/80 times compared with CPU/GPU execution.
PRIME2 and ISAAC22 are two promising neural network accelerators based on ReRAM. These proposals leverage the ReRAM crossbar array to perform matrix-vector multiplication efficiently in the analog domain. These solutions report performance and energy consumption improvements of up to 2,360 and 895 times, respectively, relative to state-of-the-art neural-processing unit designs.
Enabling the adoption of PiM. Because quantized neural networks rely on simple operations, they are a prime target for exploiting the use of PiM to process AI kernels at the edge. Performing more straightforward computation close to where data resides greatly reduces the overall movement of data, which improves latency, throughput, and energy efficiency. These can likely be applied even more efficiently to quantized and binary CNNs, which use XNOR and bit count operators.
To validate the performance and energy merits of PiM, we present a quantitative analysis of the improvements to CNN image inference at the edge, offered by Ambit, a DRAM-based PuM architecture that supports bitwise logic operations.
To PiM or Not to PiM: A Quantitative Analysis
This section evaluates the accuracy, performance, and energy costs of performing inference for binary and quantized versions of the most recent neural networks, introduced in Table 1. To this end, Table 2 shows the energy consumption of Ambit-based PiM designs to perform inference on each of these networks. These estimates account for the cost of performing the dominant MAC operations required by the convolutions. Energy is calculated for execution on both ARM CPUs at the edge (baseline), as well as using an analytical model for PiM technology. (For full access to the original references, see https://github.com/joaodinissf/qcnn-accuracy/.)
Table 2. Energy consumed per inference in Table 1.
This shows that different accuracy-performance-energy trade-offs are possible. Table 3 is an analysis of the performance of the four most recent neural networks from Table 1, targeting 15FPS and 60FPS, for 3-, 2- and 1-bit precision, with supported inference throughput (in FPS) for Qualcomm Snapdragon 865, Intel Xeon Gold 6154, and Edge TPU baselines, compared with an Ambit-based PiM architecture. Cells with a ‘-‘ indicate lack of real-time support; cells with a ‘+’ indicate real-time support. Empirical observations show that these quantized models ensure adequate accuracy for many applications. Ingenious quantization techniques enable further compression of even the smallest neural network models and the portability of very large networks for use in power-constrained devices, with small, tunable, accuracy losses.
Table 3. Supported inference throughput for three baselines and PiM.
Accuracy. The average accuracy loss is 6.8%, and in the best case the loss is as low as 2.2%. In cases where the accuracy of the 1-bit network may result in too great of an accuracy loss, settling for 2- or 3-bit quantization often yields good accuracy values.
Performance. The described Ambit-based accelerator achieves 15FPS in eight cases out of 12, and 60FPS for the other four cases. In contrast, the Qualcomm Snapdragon, Intel Xeon, and Edge TPU baselines sustain 15FPS in, respectively, zero, three, and nine cases out of 12, and 60FPS in zero, zero, and nine cases out of 12. The average speedup of the Ambit-based accelerator over each of these baselines is 58.8, 21.3, and 4.6 times.
Two key conclusions can be drawn. First, the Edge TPU can sustain a processing rate of only 3.4FPS for VGG-16, because of this network’s high number of parameters. This illustrates the poor scalability of current specialized neural network accelerators to process very large networks. Second, the flexible degree of parallelism in the PiM implementation, attained through the operation of multiple subarrays in parallel, allows the inference time to scale quasi-linearly with the size of the network, which enables a near-constant inference time in the largest networks.
Smart devices may soon come to incorporate both conventional and PiM-enabled memories.
Energy. Average energy savings of 35.4 times were observed for 1-bit, relative to 32-bit precision. Energy gains are linearly proportional to the degree of quantization.
Open Challenges
As the volume of data to process approaches the installed processing capacity, the successful implementation of AI at the edge will depend on the development of optimized architectures that are able to perform AI tasks while meeting strict performance and energy-efficiency requirements. After the introduction of 1,000-core processors and the expansion of levels of parallelism in the compute units, computer architects and software developers must now turn to the memory to design the next generation of high-performance and highly efficient systems. While PiM architectures have shown promise,16 enabling their adoption at the edge layer will require answers to the following open questions for manufacturers designing PiM:
- Software/hardware co-design. The integration of PiM technology and its compilers with new firmware and the operating system should be facilitated to enable the design of edge systems with maximal performance and energy efficiency in order to meet the computational demands of AI applications and use cases. PiM can be specialized for the edge by supporting common image-processing operations, which are especially relevant for performance.
- Low design complexity and cost. PiM designs should be made sufficiently low-cost to entice hardware manufacturers to integrate them into their products. This will entail the development of early-access commercial prototypes and proof-of-concept products that encourage early adopters to disseminate the technology and offset R&D costs. PiM technology is reaching a point of maturity, and it will soon make its way to self-driving cars, healthcare, digital agriculture, and other edge AI applications with massive total addressable markets.
- Multitiered PiM architectures. As PiM technology matures, it will populate multiple levels of the existing memory hierarchy with complementary processing components, each with its own benefits and drawbacks.3,12 For example, processing-in-cache architectures have been proposed, which trade off the DRAM capacity for greater speed and support for a wider range of operations. Analogously, processing-in-storage (typically oriented toward nonvolatile memories) is well suited for the execution of simpler operations on vast volumes of data, with very high throughput.
- The compiler. PiM substrates will require custom compilers; these compilers can be a tool to aid software developers in the identification of common memory access and computational patterns, in order to automate certain steps of the compilation process to yield maximal performance (for example, the adoption of ideal data mapping or the circumvention of bottlenecking memory accesses).
PiM software development for end users includes:
- Algorithm design. Mapping existing and emerging applications to PiM substrates requires the algorithm to exploit data quantization, exposed parallelism, and the coalescing of memory accesses to offset the high-cost operations in the system, which are especially costly at the edge. The programmer is responsible for achieving the desired trade-off between performance, energy efficiency, and accuracy to satisfy the requirements of the target AI application.
- The development framework. A low-cost approach to empower the compiler with knowledge about the application’s data flow is to create an intuitive and expressive API that abstracts as many as possible of the high-efficiency operations supported by the PiM substrate in an easy-to-use way. It is tempting to draw an analogy between the current state of PiM and the early days of general-purpose computing on graphics processing units (GPGPU): CUDA and OpenCL were instrumental in enabling the mass-scale adoption of GPUs, and a similar API must assume that role for PiM.
- Benchmarking tools. Standard benchmarking, profiling, simulation, and analysis tools enable the comparison of different architectures and are therefore essential for the R&D stage of PiM. This is especially important for emerging applications, as is the case for many edge AI workloads. Like the role that MLPerf (the result of a collaboration of a consortium of AI leaders from academia, research labs, and industry) plays in the development of machine-learning algorithms, a set of standardized tests would allow the development of PiM to remain steadfast and open.
What the Future Will Bring and the Role of PiM
Demand for AI at the edge will continue to grow. High-performance and energy-efficient PiM-based edge architectures for the processing of AI should have the following qualities:
- Support quantized data structures. As shown exhaustively in the literature, particularly for inference, the use of reduced-bit-width representations yields only small losses in accuracy. This is particularly useful for reducing the movement of data within/from/to the memory subsystem and for increasing the degree of vectorization and parallelism.
- Support specialized instructions. Bitwise and LUT (lookup table)-based operations can be implemented in the memory substrate as row-level operations or in the logic layer, with sizable benefits for running AI kernels with low hardware complexity, high bandwidth, and high parallelism.
- Avoid the unnecessary movement of data. The compiler should be able to detect what part of the AI kernel should run on the processor and what part should run in memory. Not only are different portions of the kernel more suitable for distinct subsystems, but it is also fundamental to balance the workload among them to maximize performance. The minimization of energy consumption is another target that the computer architecture community should aim for.
- Foster a PiM ecosystem. Achieving critical mass in the adoption of PiM is a crucial milestone for its success, and reaching it requires a programmer-friendly interface, intuitive compilers, and comprehensive test suites, with a set of industry-standard benchmarks.
Where do we go from here? Several PiM architectures have demonstrated the ability to perform AI tasks with unprecedented levels of efficiency by meeting the take-aways presented here. Existing PiM designs support a limited range of operations, however, and further work is necessary to meet the requirements of AI tasks.
PiM is not a silver bullet, and it will not supersede conventional computing—although it may soon become evident that computers and other smart devices benefit from incorporating both conventional processing units and PiM-enabled memories. When designing new architectures, computer architects must remain mindful that data movement is expensive in both latency and energy. Emerging technologies and architectures enable the mitigation of data movement and, as such, pave a path for the design of more efficient computing devices. The current paradigm will take us only so far; PiM presents a compelling alternative.
Acknowledgments. This work was partially supported by Instituto de Telecomunicações and Fundação para a Ciência e a Tecnologia, Portugal, under grants EXPL/EEI-HAC/1511/2021, PTDC/EEIHAC/30485/2017 and UIDB/EEA/50008/2020.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment