Research Highlights
Artificial Intelligence and Machine Learning

Neural Architecture Search as Program Transformation Exploration

In this work, we express neural architecture operations as program transformations whose legality depends on a notion of representational capacity.

Posted
interweaved colored lines, illustration
Read the related Technical Perspective

Abstract

Improving the performance of deep neural networks (DNNs) is important to both the compiler and neural architecture search (NAS) communities. Compilers apply program transformations in order to exploit hardware parallelism and memory hierarchy. However, legality concerns mean they fail to exploit the natural robustness of neural networks. In contrast, NAS techniques mutate networks by operations such as the grouping or bottlenecking of convolutions, exploiting the resilience of DNNs. In this work, we express such neural architecture operations as program transformations whose legality depends on a notion of representational capacity. This allows them to be combined with existing transformations into a unified optimization framework. This unification allows us to express existing NAS operations as combinations of simpler transformations. Crucially, it allows us to generate and explore new tensor convolutions. We prototyped the combined framework in TVM and were able to find optimizations across different DNNs, that significantly reduce inference time — over 3× in the majority of cases. Furthermore, our scheme dramatically reduces NAS search time.

 

1. Introduction

Deep neural networks (DNNs)12 are everywhere, and there is a growing need to implement them efficiently1,4,10 This has led to an explosion in research from application8 to hardware.2

Currently, there are two distinct communities optimizing DNNs for commodity devices. In one, neural architecture search (NAS) researchers explore different network models trading off size and accuracy. In the other, compiler developers take the resulting networks and explore optimizations to deliver hardware performance. We argue that this two-stage deployment process is artificial and can be unified. This paper shows that program and neural architecture transformations can be interleaved delivering significant performance improvement and greater expressivity.

Current Limitations.

NAS researchers assume the compiler is a black box bundled with the hardware, while compiler writers assume that the network architecture is set in stone. NAS researchers can discover good networks but are limited to a set of pre-implemented operations; compiler writers can efficiently exploit hardware structure but miss larger scale optimization opportunities.

NAS designs are not guaranteed to be correct and have to be separately evaluated through either a full retraining process or on a smaller proxy task. This training process severely limits the search space and can render large scale searches intractable.

What we want is the best of both worlds. We wish to combine neural architecture and compiler optimization in a unified framework. In this paper, we recast neural architecture search as program transformation exploration. By including transformations such as grouping and bottlenecking into the compiler optimization search space, we not only leverage the extensive history of program transformation research, but also discover new forms of neural architecture reduction that would not have been available to us otherwise.

Our approach.

Program transformations are necessarily restricted as they must be safe. Our solution is to unlock the space of neural transformations by introducing a new safety metric based on Fisher Potential. It is a compile-time, cheap-to-compute metric that is able to reject transformations that would incur accuracy losses, eliminating the need to train while searching. This unification allows the exploration of a space that leads to new operators with efficient implementations. It also shows that neural architecture options that previously required the engineering efforts of experts to develop, such as spatial bottlenecking, can be expressed as compositions of more fundamental transformations and discovered automatically.

2. Overview

2.1 Models and Code

The fundamental building block of a DNN is the tensor convolution, a generalization of a basic convolution.

Basic Convolution.

Consider the first row in Figure 1. This shows a basic convolution where each element in the 2D output matrix is the weighted sum of the corresponding and neighboring elements of the 2D input matrix. The weights are stored in a small 2D filter or weight matrix W whose size corresponds to the size of the neighborhood. This is shown both diagrammatically on the left and as code on the right.

Figure 1.  The relation between neural architecture components, code and transformations. The first row shows a basic convolution. The second shows a tensor convolution. In the third row, we see a program transformation: loop interchange. The fourth row shows a neural architecture transformation: bottlenecking where the number of weights is divided by B.

Tensor Convolution.

A tensor convolution is a generalization of the basic convolution as shown in row 2 of Figure 1. The input is now a 3D tensor with a new dimension, Ci, referred to as the input channels. Similarly, the output is expanded by the number of output channels (Co). The weight matrix is now a 4D tensor of Co×Ci channels with two spatial dimensions (height, width). Two new loops scan the channels, the outermost, Ci shown by an arrow, while each of the tensors arrays have appropriate extra dimensions.

2.2 Models and Code Transformations

Code Transformation.

Loop interchange changes the order of computation and memory access without affecting the values computed. In row 2 of Figure 1, interchanging the outer loop iterators gives the code in row 3 with the Co and Ci loops interchanged. We can represent this using polyhedral notation [Ci,Co][Co,Ci] or TVM-like syntax: .interchange(CI,CO). This is shown diagrammatically with the arrow denoting the new outermost, Co loop order. From a neural architecture perspective nothing has changed; the number of computations and memory accesses remains the same, as do the values of the output matrix. From a code perspective, the memory access behavior is significantly altered, affecting execution time.

Model Transformation.

A popular choice in NAS for reducing convolutional complexity is bottlenecking.13 For a convolution with Co filters, we choose a bottlenecking factor B. This reduces the size of the weight and output tensors and reduces the range of the outermost loop iterator by a factor B. Again this can be represented as [Co][Co<Co/B] or .bottleneck(B). This is shown in row 4 of Figure 1. From a NAS point of view, this is a standard operation reducing complexity with a minimal effect on representational capacity. From a program transformation point of view, this is illegal as the computed values are changed.

2.3 Combined Space

If we consider swapping a full convolution for a reduced convolution as a program transformation, we can combine them. For example, we could apply loop interchange to the code in Figure 1 row 4, ([Co<Co/B,Ci][Ci,Co<Co/B]). With Ci now the outermost iterator, we can reapply bottlenecking, giving input channel bottlenecking. Such an optimization is both semantically invalid, and also unavailable in existing neural architecture search spaces. However, given the robustness of neural networks to noise, in some specific cases it may be just as representationally preserving as output channel bottlenecking. It is only through the lens of program transformations over loop nests that we are able to automatically access such operators.

3. Neural Architecture Search

Starting from an overall network skeleton, NAS attempts to design one or more cells that slot into different locations within the skeleton. Cells are described as a DAG with nodes as intermediate feature maps (or tensors) and edges representing possible operations (for example, convolution or tensor product). The task is then to find the best DAG (or cell) that can be slotted into the skeleton and trained on a given dataset. Recent work takes cells as predefined options and attempts to select where best to place each of the cells in the skeleton14,16 to match specific constraints.

3.1 Neural Architectures Components

The convolution operation dominates computational complexity and is the primary component of interest in NAS.5,18

Standard Convolution.

Figure 1 row 2 shows the standard convolution operation, in which an input volume of size Hi×Wi×Ci is convolved with a set of Co filters of size Kh×Kw×Ci, each producing an individual feature map in the output. The convolution operation is

c o , w o , h o O ( c o , w o , h o ) = c i C i k h K h k w K w I ( c i , w o + k w , h o + k h ) * K ( c o , c i , k w , k h ) .

Bottlenecked Convolution.

A popular choice for reducing convolutional complexity is bottlenecking7 as shown in Figure 1 row 4. A bottlenecking factor B is selected, reducing the number of weights/filters to Co/B. The bottlenecked convolution operation is

c o < C o B , w o , h o O ( c o , w o , h o ) = c i C i k h K h k w K w I ( c i , w o + k w , h o + k h ) * K ( c o , c i , k w , k h ) .

 

Grouped Convolution.

Here, the Ci channel input is split along the channel dimension into G groups, each of which has Ci/G channels. Each group is independently convolved with its input split, producing Co/G channels which are concatenated along the channel dimension. Let O=[O1;,OG] i.e. each slice Og,g1,G is concatanated to form O. Ig is similarly defined. Group convolution is then as follows:

c o , w o , h o O ( c o , w o , h o ) = c i C i k h K h k w K w I ( c i , w o + k w , h o + k h ) * K ( c o , c i , k w , k h ) .

This reduces the number of basic convolutions from Co×Ci to G×Co/G×Ci/G=(Co×Ci)/G reducing the number of operations used by a factor of G. Note that we can think of standard convolution as grouped convolution with G=1.

Depthwise Convolution.

Notice that if the number of groups is equal to the number of input channels and the number of output channels, G=Ci=Co, then there is a single, 2D convolutional filter for each input channel Ci. This is known as depthwise convolution and is a special case of group convolution.

3.2 NAS Search Space Example

Typically, the NAS heuristic finds the best cell to fill in the skeleton by choosing appropriate operations. Figure 2 gives an example cell from.5 The skeleton is ResNet-like with 5 cells in series. Each cell contains 4 inter-connected nodes (A,B,C,D). Between nodes, downsampling takes place where the spatial dimensions are halved in size while doubling the channel depth. This restricts the types of operations available on each edge. This gives a total of 15625 possible cells, which captures most of the available options within cell-based NAS techniques, each of which the authors train exhaustively on various datasets.

Figure 2.  NAS. Edges represent operations that transform intermediate states taken from a list of options specified by the designer.

Rather than designing cells by choosing which of a predetermined list of options is best, we instead wish to choose a sequence of transformations from an original model which will allow us to step through the cell design space. We describe the program transformations we use in the next section.

4. Program Transformations

Due to the restricted, static, convex and affine nature of tensor convolutions, it is natural to describe program transformations of them in the polyhedral model. which consists of three main components:

The domain is a collection of the possible statement instances that occur within the iteration space of a set of loop bounds. We can represent each statement instance with a multidimensional co-ordinate corresponding to the iterator values of the loops that surround it.

set of accesses are affine mappings of the iteration space to memory. Two statement instances have a dependence ordering between them if they have accesses to the same memory location and at least one of them is a write.

schedule assigns a timestamp to each statement instance, dictating the order in which they are executed. Different schedules for the same program represent possible program transformations.

To illustrate this, consider the implementation of the 1×1 convolution at the start of a residual block listed in Algorithm 1. We can describe the domain as follows:

S 1 ( c o , h , w ) | 0 c o < C o 0 h < H 0 w < W S 2 ( c o , h , w , c i ) | 0 c o < C o 0 h < H 0 w < W 0 c i < C i

We can also describe the schedule as follows:

T S 1 ( c o , h , w ) = ( c o , h , w ) T S 2 ( c o , h , w , c i ) = ( c o , h , w , c i )

We now briefly discuss the transformations we consider.

Algorithm 1. 
Naive implementation of 1×1 tensor convolution.
1   for (co=0; co<Co; co++)
2     for (oh=0; oh<OH; oh++)
3       for (ow=0; w<OW; ow++)
4   S1    O[c_o][h][w] = 0.;
5         for (ci=0; ci<Ci; ci++)
6   S2      O[co][oh][ow] +=
7             W[co][1][1] *
8             I[ci][oh][ow];

Loop Interchange.

Interchanging two nested loops involves applying a permutation to the schedule of the enclosed statements. For example, in the algorithm shown in 1 and a statement S1 we can express this as simply as:

T S 1 ( c o , h , w ) = ( c o , w , h )

Strip-Mining.

This is performed by mapping an iterator into two new iterators whose combined range is that of the original. A constant strip-mining factor is selected which forms the range of the new inner loop. The outer loop range is that of the original divided by the strip-mine factor. For example, to strip mine the inner ci loop of Algorithm 1 we have

T S 2 ( c o , h , w , c i ) = ( c o , w , h , c i / 32 , c i m o d 32 )

 

Tiling.

Tiling is a combined transformation consisting of strip-mining followed by interchange. For further detail, we point the reader to the excellent polyhedral literature.6,15

4.1 Legality

Legality in the polyhedral framework is determined by preservation of data dependences. If there is a data dependence between two dynamic statement instances under the original program schedule, the relative ordering between these statements must be preserved under the new transformed schedule. Given a linear transformed schedule of the iteration space I in matrix form T and a dependence matrix D, then a transformation is legal iff:

i , j , S 1 , S 2 , D i j d S 1 , S 2 T ( i ) T ( j )

which is to say that the lexicographical ordering of the iteration instances is preserved.

5. Unified Space

Our approach to joining NAS and compiler optimizations is to describe convolutional alternatives as polyhedral program transformations and develop a novel legality criterion.

5.1 Extending the Polyhedral Model

Bottlenecking.

The bottleneck transformation is a reduction in the outer iterator in the domain node by parameterized constant factor B. Let the vector J denote the iterators spanning the domain of the tensor loop nest, J=[co,ci,h,w,kh,kw] where co is the outermost iterator and J=[ci,h,w,kh,kw] be the enclosed inner iterators, then bottlenecking can be expressed as

T S ( c o , J ) = ( c o , J ) | c o < C o / B

The value B is constrained such that ComodB0. This gives a factor B reduction in computation. Figure 1 row 4 shows an example of bottlenecking.

Grouping.

Grouping can be thought of as tiling the two outer iterators by a common factor and then discarding one of the iterators. The original iterator domains must both be divisible by the grouping factor G. Let J=[ci,J], such that J=[co,ci,J] then we can define grouping as

T S ( c o , c i , J ) = ( g , c o / G , c i / G , J )

to give the algorithm in Algorithm 2. Note as Co,Ci and G are compile-time known constants each of the loop bounds is affine. Examining the code we see that each slice g of the output array refers only to the corresponding slice of the weight and input array.

Algorithm 2. 
Grouping transformation of tensor convolution.
for (g=0; g<G; g++)
  for (co=Co/G*g; co<Co/G*(g+1); co++)
    for (ci=Ci/G*g; ci<Ci/G*(g+1); ci++)
      for (oh=0; oh<OH; oh++)
        for (ow=0; w<OW; ow++)
          for (kh=0; kh<KH; kh++)
            for (kw=0; kw<KW; kw++)
              O[co][oh][ow] +=
                        W[co][ci][kh][kw] *
                        I[co][oh+kh][ow+kw];
Algorithm 3. 
Depthwise transformation of tensor convolution.
for (g=0; g<Co; g++)
  for (oh=0; oh<OH; oh++)
    for (ow=0; w<OW; ow++)
      for (kh=0; kh<KH; kh++)
        for (kw=0; kw<KW; kw++)
          O[g][oh][ow] +=
                    W[g][kh][kw] *
                    I[g][oh+kh][ow+kw];

Depthwise.

Depthwise convolutions can be considered as a special case of group convolutions where the group size equals the number of output channels, Co,G=Co. For this transformation to be possible, the number of input and output channels must be equal, Co=Ci.

This means the two inner loops will have strip counts of 1 as Co/G=Ci/G=1 and

T S ( c o , c i , J ) = ( g , 1 , 1 , J )

which can be trivially simplified to

T S ( c o , c i , J ) = ( g , J )

5.2 Fisher Potential as a Legality Check

The transformations described above fail to preserve traditional program semantics. We instead state that a schedule transformation for a neural network with respect to a task is legal if the final classification accuracy on the held out data is the same, or similar to within a small δ. Training all possible transformed networks to determine accuracy, however, would be prohibitively expensive.

To address this, we employ Fisher Potential14 as a pre-training legality check. It is a cheap-to-compute measure that is effective at rejecting any architectures whose final test accuracy is significantly below the original starting point, without needing to perform training. Informally, Fisher Potential is, locally, the total information that each loop nest (layer) contains about class labels under a simplifying assumption of conditional independence.

Formal Definition.

For a single channel, c, of a convolutional filter in a network, consider that for some input minibatch of N examples, its outputs are an N×W×H tensor where W and H are the channel’s spatial width and height. We refer to this tensor as the activation A of this channel and denote the entry corresponding to example n in the mini-batch at location (i,j) as Anij. As the network has a loss function , then we can get the gradient of the loss with respect to this activation channel A. Let us denote this gradient as g and index it as gnij. The channel c error, Δc can then be computed by

Δ c = 1 2 N n N ( i W j H A n i j g n i j ) 2 .

This gives us a filter-wise score for a particular channel. In order to gauge the sensitivity of the full convolution, we sum Δc over each channel (as in14):

Δ l = c o C o Δ c o

This score is summed for each of the convolutional blocks in the network. For an original network and a proposed alternative architecture, we reject the proposal if its score is below that of the original.

To summarize, the Fisher Potential of a proposed network architecture is the sum of the Fisher Information for each layer when given a single random minibatch of training data, as performed in.14

Example.

Let us consider, candidate networks designed from NAS-Bench-2015 as shown in Figure 3. Each point represents a different neural architecture of which there are 15625. The y-axis shows final CIFAR-10 top-1 error (lower is better), and the x-axis shows the Fisher Potential assigned to the model at initialization (higher is better). We can see that without requiring any training, Fisher Potential is able to filter out poorly-performing architectures, visible in the cluster of low scoring networks on the left with poor final errors. Many good networks are also discarded — this is unfortunate but acceptable for our scenario, since the space is large and densely populated with networks.

Figure 3.  Fisher Potential as a rejection filter for invalid architectures.

5.3 Expressive Power

Neural Architecture Search techniques rely on hand-designed exploration spaces. themselves are engineered by hand. For example, a recent paper found that bottlenecking could be applied in the spatial domain.11 This can now be added to a list of candidate operations. However, spatial bottlenecking is automatically captured within our framework as the combination of existing transformations. As an example, consider the convolution in row 2 of Figure 1. We use the shorthand notation [i]B(b)[i(b)] to denote bottlenecking domain i by factor b and [i,j]int[j,i] to denote interchange. Then the spatial bottleneck operation can be constructed as the following sequence of transformations:

T S : [ C o , C i , H , W , K h , K w ] int. [ H , W , C o , C i , K h , K w ] B(b). [ H ( b ) , W , C o , C i , K h , K w ] int. [ W , H ( b ) , C o , C i , K h , K w ] B(b). [ W ( b ) , H ( b ) , C o , C i , K h , K w ] int. [ C o , C i , H ( b ) , W ( b ) , K h , K w ]

The expressiveness of the polyhedral model, equipped with these new transformations, means that there are novel transformations that can be derived from this unified framework.

6. Implementation and Setup

We expressed our new transformations as polyhedral transformations and implemented the resulting operators in TVM.3

Transfromation Space: An overview of the transformations used in this paper is given in Table 1. The first four are standard program transformations available within TVM. To these, we add our two neural architecture transformations. Finally, for GPU platforms, we enable GPU mapping transformations.

Table 1. 
TVM optimization operations
OptimizationDescription
Program Transformations
reorderInterchange nested loops
tileCache and register blocking
unrollLoop unrolling
prefetchMemory coalescing between threads
splitDivide iteration into multiple axes
fuseCombine two axes into one
Neural Architecture Transformations
bottleneckReduce domain by factor B
groupSlice and offset two loops by factor G
Mapping to GPU
blockIdxBlock-wise parallelism
threadIdxThreads within blocks
vthreadStriding thread access

Baseline TVM: TVM allows users to implement schedules for each new operator by hand. Due to the large number of operators involved, for each device we use TVM’s default schedules; automatic schedule design19 has yet to be incorporated into TVM. We then enable auto-tuning of parameter values within the schedule to find best performance.

Search: Our current search process is relatively naive. We enumerate random sequences of transformations through TVM, and generate 1000 configurations of the resulting operations for each network. We then check which candidates pass our implementation of the Fisher Potential legality test and select the best performing one.

Platforms: We evaluate on an ARM A57 mobile CPU (mCPU), and an Nvidia 128-Core Maxwell mobile GPU (mGPU) available on the Jetson Nano board. We also evaluate an Intel Core i7 (CPU) and an Nvidia 1080Ti (GPU).

Models: We evaluate our methodology on three popular networks: ResNet-34,7 ResNeXt-29,17 and DenseNet-161.9 These networks were chosen to represent a range of convolutional architectures, from standard 3×3 convolutions in ResNet-34 to grouped convolutions in ResNeXt and a heavy reliance on 1×1 convolutions in DenseNet.

Comparison: The models are implemented with each operation written as a TVM Tensor Expression,3 which is an einsum-style syntax for expressing tensor computations. This is lowered to TVM IR, where our transformations can be employed. This allows for a fair comparison of each approach.

For each network, we consider three approaches. First we compile the model using TVM using its default schedule (labeled TVM). We report the best performance achieved after auto-tuning. Next, we use BlockSwap14 as NAS to compress the modifiable convolutions in the network, followed by compilation with TVM (labeled NAS). Finally, we apply our unified approach as described in the previous section (labeled Ours).

7. Results

7.1 CIFAR-10 Network Performance

The results for each network are shown in Figure 4. The combination of TVM and NAS is a strong baseline, however, for each of the models our unified method is able to generate models with improved hardware performance. In general there is more performance to gain on the mGPU than other platforms, as relaxed memory pressure from smaller designs is of increasing importance.

Figure 4.  End-to-end performance for several networks on different hardware devices on CIFAR-10.

ResNet-34: NAS is able to find a modest improvement of 1.12× speedup over TVM on the GPU platform, increasing to 2× on i7 CPU, showing the power of compressed convolutions. Our unified approach is able to find further improvement giving 2× and 3× speedup respectively. The improvement is more dramatic on the smaller platforms with 5× and 10× speedups on the mCPU and mGPU.

ResNext-29-2x64d: NAS is unable to find any improvement here due to the already highly compact structure of the network. There are simply no NAS options that improve performance over the TVM baseline across all platforms. Despite this our approach is able to find small improvements of 1.3× and 1.1× on the CPU and GPU platforms by combining NAS and program transformations. This increases to 1.4× on the embedded CPU and 7× on the mGPU.

DenseNet-161: The impact of NAS is much more varied here. While it can find 2.2× improvement on the CPU, it has a negligible improvement over TVM on the GPU. It also struggles on the mCPU but finds 6× on the mGPU. Our approach is able to improve by over 3× for both server platforms, but only finds 1.2× on the mCPU while acheiving 10× on the mGPU.

7.2 Analysis

Accuracy. For all CIFAR-10 networks pictured in Figure 4, changes in accuracy were less than 1% in absolute difference. The ResNet-34 in Figure 5 had an original ImageNet Top-1 and Top-5 accuracy of 73.2% and 91.4% respectively. The final network compiled with our transformations had a Top-1 and Top-5 accuracy of 73.4% and 91.4%; it was slightly more accurate than the original but considerably faster. To give a broader overview, we present the accuracy of several ImageNet models in Figure 7, which shows that for each model accuracy degradation is small or non-existent.

Figure 5.  Exploring different sequences of transformations for an individual layer of ResNet-34 on the Intel Core i7 CPU.

Size: One benefit of these neural transformations is their effect on model size, both in terms of weights and runtime memory usage. We found that CIFAR-10 networks could be compressed in size by 23×. Likewise, the ImageNet ResNet-34 in Figure 5 was compressed from 22M parameters to 9M without a loss in accuracy.

Search time: Our search process involves suggesting configurations and rejecting or accepting them based on Fisher Potential. Since Fisher Potential is extremely cheap to compute, the search time overhead introduced by our method is small, less than 5 minutes on a CPU. During this time we were able to explore 1000 different configurations, discarding approximately 90% of the candidate transformation sequences through the Fisher Potential legality check.

7.3 Transformation Sequence Case Studies

There were 3 particular transformation sequences that dominated the list of best performing transformations. In neural architecture terms, the resulting operations of each of these sequences of transformations is a new convolution-like operator, previously unavailable to NAS.

 Sequence 1 applies grouping to the kernels over the spatial domain of the input. These are then concatenated to form one output. The exact sequence is: [split interchange group interchange fuse].

 Sequence 2 is an operator in which the output channels have been unrolled by factor 16 and then the remaining iteration domain has been grouped by factor G=2. The exact sequence is: [unroll group interchange]  Sequence 3 is an operator that was devised by splitting up the iteration domain of the output channels, and applying different levels of channel grouping to each new domain (e.g. G=2 on first half, G=4 on the second half). The exact sequence is: [split group interchange group]. Each of them is applicable across all networks and may be more widely applicable as standard tensor convolutions for other networks.

7.4 Exploring Layer-Wise Optimizations

In Figure 5, we examine the impact of our sequences layer-by-layer on one network on one platform: ResNet-34 on the i7. The exact configurations mirror the experiment in the original TVM paper.3 While we achieve a 3x improvement overall as shown in Figure 4, the speedup achieved varies considerably across layers. In 4 of the 11 layers no performance improvement is found, as Fisher Potential marks these individual layers to be extremely sensitive to compression. Simple grouping with a factor G=2 is able to give around 2× speedup across 7 of the 11 layers. Sequence 1 is able to give a small improvement in most cases — particularly layer 11 — where spatial reduction yields new parallelization opportunities. Sequence 2 provides a slight improvement in many cases, and through the later layers it is able to offer larger speedups through improved data reuse.  Sequence 3 is the best option for most early layers but suffers compared to other sequences in the later layers.

7.5 Alternative NAS comparison

To provide an alternative evaluation against an existing NAS technique, we re-implement FBNet16 using the convolutional blocks available in our NAS space, and our three baseline networks as the skeletons into which the selected cells are inserted. Figure 6, shows the performance of resulting networks on CIFAR-10 relative to TVM, NAS and our approach on the Intel Core i7 for each network. In each case, FBNet does provide a modest improvement over NAS. However, it is worth noting that this involves an expensive training step at each stage of evaluation (requiring 3 GPU days per network). Our approach is able to consistently improve over FBNet, with no training required.

Figure 6.  Intel Core i7 performance of FBNet on the three networks. FBNet is able to improve over NAS with large training cost. Our approach outperforms FBNet with no training required

7.6 ImageNet Network Performance

We also applied our technique to the ImageNet classification dataset, to show that the method transfers beyond CIFAR-10. In Figure 7, we first show the resultant networks for running our method on ResNet-18, ResNet-34, DenseNet-161, DenseNet-201, and DenseNet-169, with their inference times recorded on the Intel i7 CPU. For each network, accuracy is within 2%, and there are significant gains in inference time.

Figure 7.  Accuracy vs inference time (log scale) of our approach (Ours) compared to TVM (Original + TVM) when applied to different variants of ResNet and DenseNet on the ImageNet dataset. Our approach gives a significant reduction in inference time.

7.7 Interpolating Between Models

The additional expressive power of our method allows us to explore the search space of networks in a more fine-grained manner than traditional NAS techniques. This is because NAS techniques simply choose operations from a list, whereas we generate new ones. To illustrate this consider Figure 8 where we plot the accuracy and inference time of a ResNet-34 with two BlockSwap-based models (the blue points labeled NAS-A and NAS-B). We can explore alternatives by a series of parametrized transformations in our framework, yielding the points in red. Each of these points is a new possible blocktype that would not be accessible to a traditional NAS technique unless explicitly written by the human designer. During this interpolation we are also able to find a Pareto optimal point.

Figure 8.  Two NAS models (the blue points labeled NAS-A and NAS-B) composed of grouped blocks (with g=2 and g=4 respectively) can be chained together by a series of parametrized transformations in our framework, yielding the points in red. Each point is the mean of three training runs, with error bars.

8. Conclusion and Future Work

This paper presents a new unified program transformation approach to optimizing convolution neural networks by expressing neural model operations as formal program transformations. We develop a new legality check based on Fisher potential, and show that different types of convolution can be described as transformations of more fundamental building blocks. We implemented this approach in TVM and show that our combined approach significantly outperforms both TVM and state-of-the-art NAS. We eliminate the need to train while searching, dramatically reducing search time. Future work could consider further transformations in the neural architecture space in order for a more extensive analysis, and to expand the possibilities beyond that of just convolutional networks.

    References

    • 1. Chen, T. et al. BenchNN: On the broad potential application scope of hardware neural network accelerators. In Intern. Symp. on Workload Characterization, 2012.
    • 2. Chen, T. et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the Intern. Conf. on Architectural Support for Programming Languages and Operating Systems, 2014.
    • 3. Chen, T. et al. TVM: An automated end-to-end optimizing compiler for deep learning. In USENIX Symp. on Operating Systems Design and Implementation, 2018.
    • 4. Chowdhery, A. et al. Visual wake words dataset. arXiv preprint arXiv:1906.05721, 2019.
    • 5. Dong, X. and Yang, Y. NAS-Bench-201: Extending the scope of reproducible neural architecture search. In Intern. Conf. on Learning Representations, 2020.
    • 6. Grosser, T. et al. Polly-polyhedral optimization in LLVM. In Intern. Workshop on Polyhedral Compilation Techniques, 2011.
    • 7. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, 2016.
    • 8. Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
    • 9. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, 2017, 47004708.
    • 10. Huang, Y. et al.  GPipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, 2019.
    • 11. Peng, J. et al. Accelerating deep neural networks with spatial bottleneck modules. arXiv preprint arXiv:1809.02601, 2018.
    • 12. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Networks 61, (2015), 85117.
    • 13. Tan, M. et al.  MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, 2019.
    • 14. Turner, J. et al. BlockSwap: Fisher-guided block substitution for network compression on a budget. In Intern. Conf. on Learning Representations, 2020.
    • 15. Vasilache, N., Bastoul, C., and Cohen, A. Polyhedral code generation in the real world. In Intern. Conf. on Compiler Construction, 2006.
    • 16. Wu, B. et al. FBNet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, 2019.
    • 17. Xie, S. et al. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
    • 18. Ying, C. et al. NAS-Bench-101: Towards reproducible neural architecture search. arXiv preprint arXiv:1902.09635, 2019.
    • 19. Zheng, L. et al. Ansor: Generating High-Performance Tensor Programs for Deep Learning. arXiv preprint arXiv:2006.06762, 2020.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More