Abstract
Improving the performance of deep neural networks (DNNs) is important to both the compiler and neural architecture search (NAS) communities. Compilers apply program transformations in order to exploit hardware parallelism and memory hierarchy. However, legality concerns mean they fail to exploit the natural robustness of neural networks. In contrast, NAS techniques mutate networks by operations such as the grouping or bottlenecking of convolutions, exploiting the resilience of DNNs. In this work, we express such neural architecture operations as program transformations whose legality depends on a notion of representational capacity. This allows them to be combined with existing transformations into a unified optimization framework. This unification allows us to express existing NAS operations as combinations of simpler transformations. Crucially, it allows us to generate and explore new tensor convolutions. We prototyped the combined framework in TVM and were able to find optimizations across different DNNs, that significantly reduce inference time — over 3× in the majority of cases. Furthermore, our scheme dramatically reduces NAS search time.
Introduction
Deep neural networks (DNNs)^{12} are everywhere, and there is a growing need to implement them efficiently^{1}^{,}^{4}^{,}^{10} This has led to an explosion in research from application^{8} to hardware.^{2}
Currently, there are two distinct communities optimizing DNNs for commodity devices. In one, neural architecture search (NAS) researchers explore different network models trading off size and accuracy. In the other, compiler developers take the resulting networks and explore optimizations to deliver hardware performance. We argue that this twostage deployment process is artificial and can be unified. This paper shows that program and neural architecture transformations can be interleaved delivering significant performance improvement and greater expressivity.
Current Limitations.
NAS researchers assume the compiler is a black box bundled with the hardware, while compiler writers assume that the network architecture is set in stone. NAS researchers can discover good networks but are limited to a set of preimplemented operations; compiler writers can efficiently exploit hardware structure but miss larger scale optimization opportunities.
NAS designs are not guaranteed to be correct and have to be separately evaluated through either a full retraining process or on a smaller proxy task. This training process severely limits the search space and can render large scale searches intractable.
What we want is the best of both worlds. We wish to combine neural architecture and compiler optimization in a unified framework. In this paper, we recast neural architecture search as program transformation exploration. By including transformations such as grouping and bottlenecking into the compiler optimization search space, we not only leverage the extensive history of program transformation research, but also discover new forms of neural architecture reduction that would not have been available to us otherwise.
Our approach.
Program transformations are necessarily restricted as they must be safe. Our solution is to unlock the space of neural transformations by introducing a new safety metric based on Fisher Potential. It is a compiletime, cheaptocompute metric that is able to reject transformations that would incur accuracy losses, eliminating the need to train while searching. This unification allows the exploration of a space that leads to new operators with efficient implementations. It also shows that neural architecture options that previously required the engineering efforts of experts to develop, such as spatial bottlenecking, can be expressed as compositions of more fundamental transformations and discovered automatically.
Overview
Models and Code
The fundamental building block of a DNN is the tensor convolution, a generalization of a basic convolution.
Basic Convolution.
Consider the first row in Figure 1. This shows a basic convolution where each element in the 2D output matrix is the weighted sum of the corresponding and neighboring elements of the 2D input matrix. The weights are stored in a small 2D filter or weight matrix W whose size corresponds to the size of the neighborhood. This is shown both diagrammatically on the left and as code on the right.
Tensor Convolution.
A tensor convolution is a generalization of the basic convolution as shown in row 2 of Figure 1. The input is now a 3D tensor with a new dimension, ${C}_{i}$, referred to as the input channels. Similarly, the output is expanded by the number of output channels (${C}_{o}$). The weight matrix is now a 4D tensor of ${C}_{o}\times {C}_{i}$ channels with two spatial dimensions (height, width). Two new loops scan the channels, the outermost, ${C}_{i}$ shown by an arrow, while each of the tensors arrays have appropriate extra dimensions.
Models and Code Transformations
Code Transformation.
Loop interchange changes the order of computation and memory access without affecting the values computed. In row 2 of Figure 1, interchanging the outer loop iterators gives the code in row 3 with the ${C}_{o}$ and ${C}_{i}$ loops interchanged. We can represent this using polyhedral notation $[{C}_{i},{C}_{o}]\mapsto [{C}_{o},{C}_{i}]$ or TVMlike syntax: .interchange(CI,CO)
. This is shown diagrammatically with the arrow denoting the new outermost, ${C}_{o}$ loop order. From a neural architecture perspective nothing has changed; the number of computations and memory accesses remains the same, as do the values of the output matrix. From a code perspective, the memory access behavior is significantly altered, affecting execution time.
Model Transformation.
A popular choice in NAS for reducing convolutional complexity is bottlenecking.^{13} For a convolution with ${C}_{o}$ filters, we choose a bottlenecking factor $B$. This reduces the size of the weight and output tensors and reduces the range of the outermost loop iterator by a factor $B$. Again this can be represented as $\left[{C}_{o}\right]\mapsto [{C}_{o}^{\prime}<{C}_{o}/B]$ or .bottleneck(B)
. This is shown in row 4 of Figure 1. From a NAS point of view, this is a standard operation reducing complexity with a minimal effect on representational capacity. From a program transformation point of view, this is illegal as the computed values are changed.
Combined Space
If we consider swapping a full convolution for a reduced convolution as a program transformation, we can combine them. For example, we could apply loop interchange to the code in Figure 1 row 4, ($[{C}_{o}^{\prime}<{C}_{o}/B,{C}_{i}]\mapsto [{C}_{i},{C}_{o}^{\prime}<{C}_{o}/B]$). With ${C}_{i}$ now the outermost iterator, we can reapply bottlenecking, giving input channel bottlenecking. Such an optimization is both semantically invalid, and also unavailable in existing neural architecture search spaces. However, given the robustness of neural networks to noise, in some specific cases it may be just as representationally preserving as output channel bottlenecking. It is only through the lens of program transformations over loop nests that we are able to automatically access such operators.
Neural Architecture Search
Starting from an overall network skeleton, NAS attempts to design one or more cells that slot into different locations within the skeleton. Cells are described as a DAG with nodes as intermediate feature maps (or tensors) and edges representing possible operations (for example, convolution or tensor product). The task is then to find the best DAG (or cell) that can be slotted into the skeleton and trained on a given dataset. Recent work takes cells as predefined options and attempts to select where best to place each of the cells in the skeleton^{14}^{,}^{16} to match specific constraints.
Neural Architectures Components
The convolution operation dominates computational complexity and is the primary component of interest in NAS.^{5}^{,}^{18}
Standard Convolution.
Figure 1 row 2 shows the standard convolution operation, in which an input volume of size ${H}_{i}\times {W}_{i}\times {C}_{i}$ is convolved with a set of ${C}_{o}$ filters of size ${K}_{h}\times {K}_{w}\times {C}_{i}$, each producing an individual feature map in the output. The convolution operation is
$\begin{array}{c}\forall {c}_{o},{w}_{o},{h}_{o}\phantom{\rule{0ex}{0ex}}O({c}_{o},{w}_{o},{h}_{o})=\phantom{\rule{0ex}{0ex}}\\ \sum _{{c}_{i}}^{{C}_{i}}\sum _{{k}_{h}}^{{K}_{h}}\sum _{{k}_{w}}^{{K}_{w}}I({c}_{i},{w}_{o}+{k}_{w},{h}_{o}+{k}_{h})*K({c}_{o},{c}_{i},{k}_{w},{k}_{h}).\end{array}$ (1)Bottlenecked Convolution.
A popular choice for reducing convolutional complexity is bottlenecking^{7} as shown in Figure 1 row 4. A bottlenecking factor $B$ is selected, reducing the number of weights/filters to ${C}_{o}/B$. The bottlenecked convolution operation is
$\begin{array}{c}\forall {c}_{o}^{\prime}<\frac{{C}_{o}}{B},{w}_{o},{h}_{o}\phantom{\rule{0ex}{0ex}}O({c}_{o}^{\prime},{w}_{o},{h}_{o})=\phantom{\rule{0ex}{0ex}}\\ \sum _{{c}_{i}}^{{C}_{i}}\sum _{{k}_{h}}^{{K}_{h}}\sum _{{k}_{w}}^{{K}_{w}}I({c}_{i},{w}_{o}+{k}_{w},{h}_{o}+{k}_{h})*K({c}_{o}^{\prime},{c}_{i},{k}_{w},{k}_{h}).\end{array}$ (2)Grouped Convolution.
Here, the ${C}_{i}$ channel input is split along the channel dimension into $G$ groups, each of which has ${C}_{i}/G$ channels. Each group is independently convolved with its input split, producing ${C}_{o}/G$ channels which are concatenated along the channel dimension. Let $O=[{O}_{1}^{\prime};\dots ,{O}_{G}^{\prime}]$ i.e. each slice ${O}_{g}^{\prime},g\in 1,\dots G$ is concatanated to form $O$. ${I}_{g}^{\prime}$ is similarly defined. Group convolution is then as follows:
$\begin{array}{c}\forall {c}_{o}^{\prime},{w}_{o},{h}_{o}\phantom{\rule{0ex}{0ex}}O({c}_{o}^{\prime},{w}_{o},{h}_{o})=\phantom{\rule{0ex}{0ex}}\\ \sum _{{c}_{i}^{\prime}}^{{C}_{i}^{\prime}}\sum _{{k}_{h}}^{{K}_{h}}\sum _{{k}_{w}}^{{K}_{w}}I({c}_{i}^{\prime},{w}_{o}+{k}_{w},{h}_{o}+{k}_{h})*K({c}_{o}^{\prime},{c}_{i}^{\prime},{k}_{w},{k}_{h}).\end{array}$ (3)This reduces the number of basic convolutions from ${C}_{o}\times {C}_{i}$ to $G\times {C}_{o}/G\times {C}_{i}/G=({C}_{o}\times {C}_{i})/G$ reducing the number of operations used by a factor of $G$. Note that we can think of standard convolution as grouped convolution with $G=1$.
Depthwise Convolution.
Notice that if the number of groups is equal to the number of input channels and the number of output channels, $G={C}_{i}={C}_{o}$, then there is a single, 2D convolutional filter for each input channel ${C}_{i}$. This is known as depthwise convolution and is a special case of group convolution.
NAS Search Space Example
Typically, the NAS heuristic finds the best cell to fill in the skeleton by choosing appropriate operations. Figure 2 gives an example cell from.^{5} The skeleton is ResNetlike with 5 cells in series. Each cell contains 4 interconnected nodes (A,B,C,D). Between nodes, downsampling takes place where the spatial dimensions are halved in size while doubling the channel depth. This restricts the types of operations available on each edge. This gives a total of 15625 possible cells, which captures most of the available options within cellbased NAS techniques, each of which the authors train exhaustively on various datasets.
Rather than designing cells by choosing which of a predetermined list of options is best, we instead wish to choose a sequence of transformations from an original model which will allow us to step through the cell design space. We describe the program transformations we use in the next section.
Program Transformations
Due to the restricted, static, convex and affine nature of tensor convolutions, it is natural to describe program transformations of them in the polyhedral model. which consists of three main components:
The domain is a collection of the possible statement instances that occur within the iteration space of a set of loop bounds. We can represent each statement instance with a multidimensional coordinate corresponding to the iterator values of the loops that surround it.
A set of accesses are affine mappings of the iteration space to memory. Two statement instances have a dependence ordering between them if they have accesses to the same memory location and at least one of them is a write.
A schedule assigns a timestamp to each statement instance, dictating the order in which they are executed. Different schedules for the same program represent possible program transformations.
To illustrate this, consider the implementation of the $1\times 1$ convolution at the start of a residual block listed in Algorithm 1. We can describe the domain as follows:
$\begin{array}{c}\begin{array}{ccc}S1({c}_{o},h,w)& & 0\le {c}_{o}<{C}_{o}\wedge 0\le h<H\wedge \\ & & 0\le w<W\\ S2({c}_{o},h,w,{c}_{i})& & 0\le {c}_{o}<{C}_{o}\wedge 0\le h<H\wedge \\ & & 0\le w<W\wedge 0\le {c}_{i}<{C}_{i}\end{array}\end{array}$We can also describe the schedule as follows:
$\begin{array}{c}\phantom{\rule{0ex}{0ex}}\phantom{\rule{0ex}{0ex}}\phantom{\rule{0ex}{0ex}}{T}_{S1}({c}_{o},h,w)=({c}_{o},h,w)\\ \phantom{\rule{0ex}{0ex}}\phantom{\rule{0ex}{0ex}}\phantom{\rule{0ex}{0ex}}{T}_{S2}({c}_{o},h,w,{c}_{i})=({c}_{o},h,w,{c}_{i})\end{array}$We now briefly discuss the transformations we consider.

Loop Interchange.
Interchanging two nested loops involves applying a permutation to the schedule of the enclosed statements. For example, in the algorithm shown in 1 and a statement $S1$ we can express this as simply as:
${T}_{S1}({c}_{o},h,w)=({c}_{o},w,h)$StripMining.
This is performed by mapping an iterator into two new iterators whose combined range is that of the original. A constant stripmining factor is selected which forms the range of the new inner loop. The outer loop range is that of the original divided by the stripmine factor. For example, to strip mine the inner ${c}_{i}$ loop of Algorithm 1 we have
${T}_{S2}({c}_{o},h,w,{c}_{i})=({c}_{o},w,h,{c}_{i}/32,{c}_{i}mod32)$Tiling.
Tiling is a combined transformation consisting of stripmining followed by interchange. For further detail, we point the reader to the excellent polyhedral literature.^{6}^{,}^{15}
Legality
Legality in the polyhedral framework is determined by preservation of data dependences. If there is a data dependence between two dynamic statement instances under the original program schedule, the relative ordering between these statements must be preserved under the new transformed schedule. Given a linear transformed schedule of the iteration space $I$ in matrix form $T$ and a dependence matrix $D$, then a transformation is legal iff:
$\forall i,j,S1,S2,D\phantom{\rule{0ex}{0ex}}i\to j\in {d}_{S1,S2}\to T\left(i\right)\preccurlyeq T\left(j\right)$which is to say that the lexicographical ordering of the iteration instances is preserved.
Unified Space
Our approach to joining NAS and compiler optimizations is to describe convolutional alternatives as polyhedral program transformations and develop a novel legality criterion.
Extending the Polyhedral Model
Bottlenecking.
The bottleneck transformation is a reduction in the outer iterator in the domain node by parameterized constant factor $B$. Let the vector $J$ denote the iterators spanning the domain of the tensor loop nest, $J=[{c}_{o},{c}_{i},h,w,{k}_{h},{k}_{w}]$ where ${c}_{o}$ is the outermost iterator and ${J}^{\prime}=[{c}_{i},h,w,{k}_{h},{k}_{w}]$ be the enclosed inner iterators, then bottlenecking can be expressed as
${T}_{S}({c}_{o},{J}^{\prime})=({c}_{o}^{\prime},{J}^{\prime}){c}_{o}^{\prime}<{C}_{o}/B$The value $B$ is constrained such that ${C}_{o}modB\equiv 0$. This gives a factor $B$ reduction in computation. Figure 1 row 4 shows an example of bottlenecking.
Grouping.
Grouping can be thought of as tiling the two outer iterators by a common factor and then discarding one of the iterators. The original iterator domains must both be divisible by the grouping factor $G$. Let ${J}^{\prime}=[{c}_{i},{J}^{\prime \prime}]$, such that $J=[{c}_{o},{c}_{i},{J}^{\prime \prime}]$ then we can define grouping as
${T}_{S}({c}_{o},{c}_{i},{J}^{\prime \prime})=(g,{c}_{o}/G,{c}_{i}/G,{J}^{\prime})$to give the algorithm in Algorithm 2. Note as ${C}_{o},{C}_{i}$ and $G$ are compiletime known constants each of the loop bounds is affine. Examining the code we see that each slice $g$ of the output array refers only to the corresponding slice of the weight and input array.


Depthwise.
Depthwise convolutions can be considered as a special case of group convolutions where the group size equals the number of output channels, ${C}_{o},G={C}_{o}$. For this transformation to be possible, the number of input and output channels must be equal, ${C}_{o}={C}_{i}$.
This means the two inner loops will have strip counts of 1 as ${C}_{o}/G={C}_{i}/G=1$ and
${T}_{S}({c}_{o},{c}_{i},{J}^{\prime \prime})=(g,1,1,{J}^{\prime})$which can be trivially simplified to
${T}_{S}({c}_{o},{c}_{i},{J}^{\prime \prime})=(g,{J}^{\prime})$Fisher Potential as a Legality Check
The transformations described above fail to preserve traditional program semantics. We instead state that a schedule transformation for a neural network with respect to a task is legal if the final classification accuracy on the held out data is the same, or similar to within a small $\delta $. Training all possible transformed networks to determine accuracy, however, would be prohibitively expensive.
To address this, we employ Fisher Potential^{14} as a pretraining legality check. It is a cheaptocompute measure that is effective at rejecting any architectures whose final test accuracy is significantly below the original starting point, without needing to perform training. Informally, Fisher Potential is, locally, the total information that each loop nest (layer) contains about class labels under a simplifying assumption of conditional independence.
Formal Definition.
For a single channel, $c$, of a convolutional filter in a network, consider that for some input minibatch of $N$ examples, its outputs are an $N\times W\times H$ tensor where $W$ and $H$ are the channel’s spatial width and height. We refer to this tensor as the activation $A$ of this channel and denote the entry corresponding to example $n$ in the minibatch at location $(i,j)$ as ${A}_{nij}$. As the network has a loss function $\mathcal{L}$, then we can get the gradient of the loss with respect to this activation channel $\frac{\partial \mathcal{L}}{\partial A}$. Let us denote this gradient as $g$ and index it as ${g}_{nij}$. The channel $c$ error, ${\Delta}_{c}$ can then be computed by
${\Delta}_{c}=\frac{1}{2N}\sum _{n}^{N}(\u2013\sum _{i}^{W}\sum _{j}^{H}{A}_{nij}{g}_{nij}{)}^{2}.$ (4)This gives us a filterwise score for a particular channel. In order to gauge the sensitivity of the full convolution, we sum ${\Delta}_{c}$ over each channel (as in^{14}):
$\Delta l=\sum _{{c}_{o}}^{{C}_{o}}{\Delta}_{{c}_{o}}$ (5)This score is summed for each of the convolutional blocks in the network. For an original network and a proposed alternative architecture, we reject the proposal if its score is below that of the original.
To summarize, the Fisher Potential of a proposed network architecture is the sum of the Fisher Information for each layer when given a single random minibatch of training data, as performed in.^{14}
Example.
Let us consider, candidate networks designed from NASBench201^{5} as shown in Figure 3. Each point represents a different neural architecture of which there are 15625. The $y$axis shows final CIFAR10 top1 error (lower is better), and the $x$axis shows the Fisher Potential assigned to the model at initialization (higher is better). We can see that without requiring any training, Fisher Potential is able to filter out poorlyperforming architectures, visible in the cluster of low scoring networks on the left with poor final errors. Many good networks are also discarded — this is unfortunate but acceptable for our scenario, since the space is large and densely populated with networks.
Expressive Power
Neural Architecture Search techniques rely on handdesigned exploration spaces. themselves are engineered by hand. For example, a recent paper found that bottlenecking could be applied in the spatial domain.^{11} This can now be added to a list of candidate operations. However, spatial bottlenecking is automatically captured within our framework as the combination of existing transformations. As an example, consider the convolution in row 2 of Figure 1. We use the shorthand notation $\left[i\right]\stackrel{B\left(b\right)}{\to}\left[i\left(b\right)\right]$ to denote bottlenecking domain $i$ by factor $b$ and $[i,j]\stackrel{\text{int}}{\to}[j,i]$ to denote interchange. Then the spatial bottleneck operation can be constructed as the following sequence of transformations:
$$\begin{array}{c}{T}_{S}:[{C}_{o},{C}_{i},H,W,{K}_{h},{K}_{w}]\stackrel{\text{int.}}{\to}\\ \phantom{\rule{0ex}{0ex}}\phantom{\rule{0ex}{0ex}}[H,W,{C}_{o},{C}_{i},{K}_{h},{K}_{w}]\stackrel{\text{B(b).}}{\to}\\ \phantom{\rule{0ex}{0ex}}\phantom{\rule{0ex}{0ex}}[H\left(b\right),W,{C}_{o},{C}_{i},{K}_{h},{K}_{w}]\stackrel{\text{int.}}{\to}\\ \phantom{\rule{0ex}{0ex}}\phantom{\rule{0ex}{0ex}}[W,H\left(b\right),{C}_{o},{C}_{i},{K}_{h},{K}_{w}]\stackrel{\text{B(b).}}{\to}\\ \phantom{\rule{0ex}{0ex}}\phantom{\rule{0ex}{0ex}}[W\left(b\right),H\left(b\right),{C}_{o},{C}_{i},{K}_{h},{K}_{w}]\stackrel{\text{int.}}{\to}\\ \phantom{\rule{0ex}{0ex}}\phantom{\rule{0ex}{0ex}}[{C}_{o},{C}_{i},H\left(b\right),W\left(b\right),{K}_{h},{K}_{w}]\end{array}$$The expressiveness of the polyhedral model, equipped with these new transformations, means that there are novel transformations that can be derived from this unified framework.
Implementation and Setup
We expressed our new transformations as polyhedral transformations and implemented the resulting operators in TVM.^{3}
Transfromation Space: An overview of the transformations used in this paper is given in Table 1. The first four are standard program transformations available within TVM. To these, we add our two neural architecture transformations. Finally, for GPU platforms, we enable GPU mapping transformations.
Optimization  Description 

Program Transformations  
reorder  Interchange nested loops 
tile  Cache and register blocking 
unroll  Loop unrolling 
prefetch  Memory coalescing between threads 
split  Divide iteration into multiple axes 
fuse  Combine two axes into one 
Neural Architecture Transformations  
bottleneck  Reduce domain by factor $B$ 
group  Slice and offset two loops by factor $G$ 
Mapping to GPU  
blockIdx  Blockwise parallelism 
threadIdx  Threads within blocks 
vthread  Striding thread access 
Baseline TVM: TVM allows users to implement schedules for each new operator by hand. Due to the large number of operators involved, for each device we use TVM’s default schedules; automatic schedule design^{19} has yet to be incorporated into TVM. We then enable autotuning of parameter values within the schedule to find best performance.
Search: Our current search process is relatively naive. We enumerate random sequences of transformations through TVM, and generate 1000 configurations of the resulting operations for each network. We then check which candidates pass our implementation of the Fisher Potential legality test and select the best performing one.
Platforms: We evaluate on an ARM A57 mobile CPU (mCPU), and an Nvidia 128Core Maxwell mobile GPU (mGPU) available on the Jetson Nano board. We also evaluate an Intel Core i7 (CPU) and an Nvidia 1080Ti (GPU).
Models: We evaluate our methodology on three popular networks: ResNet34,^{7} ResNeXt29,^{17} and DenseNet161.^{9} These networks were chosen to represent a range of convolutional architectures, from standard $3\times 3$ convolutions in ResNet34 to grouped convolutions in ResNeXt and a heavy reliance on $1\times 1$ convolutions in DenseNet.
Comparison: The models are implemented with each operation written as a TVM Tensor Expression,^{3} which is an einsumstyle syntax for expressing tensor computations. This is lowered to TVM IR, where our transformations can be employed. This allows for a fair comparison of each approach.
For each network, we consider three approaches. First we compile the model using TVM using its default schedule (labeled TVM). We report the best performance achieved after autotuning. Next, we use BlockSwap^{14} as NAS to compress the modifiable convolutions in the network, followed by compilation with TVM (labeled NAS). Finally, we apply our unified approach as described in the previous section (labeled Ours).
Results
CIFAR10 Network Performance
The results for each network are shown in Figure 4. The combination of TVM and NAS is a strong baseline, however, for each of the models our unified method is able to generate models with improved hardware performance. In general there is more performance to gain on the mGPU than other platforms, as relaxed memory pressure from smaller designs is of increasing importance.
ResNet34: NAS is able to find a modest improvement of 1.12$\times $ speedup over TVM on the GPU platform, increasing to 2$\times $ on i7 CPU, showing the power of compressed convolutions. Our unified approach is able to find further improvement giving 2$\times $ and 3$\times $ speedup respectively. The improvement is more dramatic on the smaller platforms with 5$\times $ and 10$\times $ speedups on the mCPU and mGPU.
ResNext292x64d: NAS is unable to find any improvement here due to the already highly compact structure of the network. There are simply no NAS options that improve performance over the TVM baseline across all platforms. Despite this our approach is able to find small improvements of 1.3$\times $ and 1.1$\times $ on the CPU and GPU platforms by combining NAS and program transformations. This increases to 1.4$\times $ on the embedded CPU and 7$\times $ on the mGPU.
DenseNet161: The impact of NAS is much more varied here. While it can find 2.2$\times $ improvement on the CPU, it has a negligible improvement over TVM on the GPU. It also struggles on the mCPU but finds 6$\times $ on the mGPU. Our approach is able to improve by over 3$\times $ for both server platforms, but only finds 1.2$\times $ on the mCPU while acheiving 10$\times $ on the mGPU.
Analysis
Accuracy. For all CIFAR10 networks pictured in Figure 4, changes in accuracy were less than 1% in absolute difference. The ResNet34 in Figure 5 had an original ImageNet Top1 and Top5 accuracy of 73.2% and 91.4% respectively. The final network compiled with our transformations had a Top1 and Top5 accuracy of 73.4% and 91.4%; it was slightly more accurate than the original but considerably faster. To give a broader overview, we present the accuracy of several ImageNet models in Figure 7, which shows that for each model accuracy degradation is small or nonexistent.
Size: One benefit of these neural transformations is their effect on model size, both in terms of weights and runtime memory usage. We found that CIFAR10 networks could be compressed in size by $2\u20133\times $. Likewise, the ImageNet ResNet34 in Figure 5 was compressed from 22M parameters to 9M without a loss in accuracy.
Search time: Our search process involves suggesting configurations and rejecting or accepting them based on Fisher Potential. Since Fisher Potential is extremely cheap to compute, the search time overhead introduced by our method is small, less than 5 minutes on a CPU. During this time we were able to explore 1000 different configurations, discarding approximately 90% of the candidate transformation sequences through the Fisher Potential legality check.
Transformation Sequence Case Studies
There were 3 particular transformation sequences that dominated the list of best performing transformations. In neural architecture terms, the resulting operations of each of these sequences of transformations is a new convolutionlike operator, previously unavailable to NAS.
Sequence 1
applies grouping to the kernels over the spatial domain of the input. These are then concatenated to form one output. The exact sequence is: [split $\to $ interchange $\to $ group $\to $ interchange $\to $ fuse]
.
Sequence 2
is an operator in which the output channels have been unrolled by factor 16 and then the remaining iteration domain has been grouped by factor $G=2$. The exact sequence is: [unroll $\to $ group $\to $ interchange]
Sequence 3
is an operator that was devised by splitting up the iteration domain of the output channels, and applying different levels of channel grouping to each new domain (e.g. $G=2$ on first half, $G=4$ on the second half). The exact sequence is: [split $\to $ group $\to $ interchange $\to $ group]
. Each of them is applicable across all networks and may be more widely applicable as standard tensor convolutions for other networks.
Exploring LayerWise Optimizations
In Figure 5, we examine the impact of our sequences layerbylayer on one network on one platform: ResNet34 on the i7. The exact configurations mirror the experiment in the original TVM paper.^{3} While we achieve a 3x improvement overall as shown in Figure 4, the speedup achieved varies considerably across layers. In 4 of the 11 layers no performance improvement is found, as Fisher Potential marks these individual layers to be extremely sensitive to compression. Simple grouping with a factor $G=2$ is able to give around 2$\times $ speedup across 7 of the 11 layers. Sequence 1
is able to give a small improvement in most cases — particularly layer 11 — where spatial reduction yields new parallelization opportunities. Sequence 2
provides a slight improvement in many cases, and through the later layers it is able to offer larger speedups through improved data reuse. Sequence 3
is the best option for most early layers but suffers compared to other sequences in the later layers.
Alternative NAS comparison
To provide an alternative evaluation against an existing NAS technique, we reimplement FBNet^{16} using the convolutional blocks available in our NAS space, and our three baseline networks as the skeletons into which the selected cells are inserted. Figure 6, shows the performance of resulting networks on CIFAR10 relative to TVM, NAS and our approach on the Intel Core i7 for each network. In each case, FBNet does provide a modest improvement over NAS. However, it is worth noting that this involves an expensive training step at each stage of evaluation (requiring $\sim $3 GPU days per network). Our approach is able to consistently improve over FBNet, with no training required.
ImageNet Network Performance
We also applied our technique to the ImageNet classification dataset, to show that the method transfers beyond CIFAR10. In Figure 7, we first show the resultant networks for running our method on ResNet18, ResNet34, DenseNet161, DenseNet201, and DenseNet169, with their inference times recorded on the Intel i7 CPU. For each network, accuracy is within 2%, and there are significant gains in inference time.
Interpolating Between Models
The additional expressive power of our method allows us to explore the search space of networks in a more finegrained manner than traditional NAS techniques. This is because NAS techniques simply choose operations from a list, whereas we generate new ones. To illustrate this consider Figure 8 where we plot the accuracy and inference time of a ResNet34 with two BlockSwapbased models (the blue points labeled NASA and NASB). We can explore alternatives by a series of parametrized transformations in our framework, yielding the points in red. Each of these points is a new possible blocktype that would not be accessible to a traditional NAS technique unless explicitly written by the human designer. During this interpolation we are also able to find a Pareto optimal point.
Conclusion and Future Work
This paper presents a new unified program transformation approach to optimizing convolution neural networks by expressing neural model operations as formal program transformations. We develop a new legality check based on Fisher potential, and show that different types of convolution can be described as transformations of more fundamental building blocks. We implemented this approach in TVM and show that our combined approach significantly outperforms both TVM and stateoftheart NAS. We eliminate the need to train while searching, dramatically reducing search time. Future work could consider further transformations in the neural architecture space in order for a more extensive analysis, and to expand the possibilities beyond that of just convolutional networks.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment