Sign In

Communications of the ACM

Review articles

Transformers Aftermath: Current Research and Rising Trends

View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook
shiny blue tubes on clear plane, illustration

Credit: Serp / Shutterstock

Natural language processing (NLP) is a main subject for artificial intelligence research, on par with computer vision and scene understanding. Knowledge is encoded and transferred among individuals through text, following a formally defined structure. Nonetheless, distinct languages, context, and subtle particularities of different communication channels are complex challenges that researchers must cope with. Hence, the task of general language modeling and understating was divided into multiple subtasks. For example, question and answering, image captioning, text summarization, machine translation and natural language generation. Recently, attention mechanisms became ubiquitous among state-of-the-art approaches, allowing the models to selectively attend to different words or sentences in order of relevance. The goal of this review is to gather and analyze current research efforts that improve on—or provide alternatives to—attention mechanisms, categorize trends, and extrapolate possible research paths for future works.

Back to Top

Key Insights


Recurrent Neural Networks (RNNs) were broadly adopted by the NLP community and achieved important mile-stones.37 However, it is computationally expensive to encode long-term relations among words in a sentence, or among sentences in a document using RNNs. In tasks such as text generation, encoding these dependencies is fundamental, and the inference time may become prohibitively slow. Pursing a solution for these limitations, the seminal work of Vaswani et al.33 engineered the Transformers architecture. The core idea was to put attention mechanisms in evidence, discarding recurrence. Soon after, the Transformers became the de facto backend model for most NLP. From the original December 2017 publication to date, there have been over 4,000 citations and remarkable work on the concept. Due to its early success, the community focus became the development of larger models, containing scaled-up Transformers and longer context windows, leading to a shift from task-specific training procedures to more general language modeling. On the other hand, although current models are capable of generating text with unprecedented perplexity, they rely mostly on statistical knowledge contained in the text corpora, without encoding the actual meaning behind words. Hence, the produced text has surprisingly middling segments contrasting with meaningless word sequences. This lack of meaning cannot be trivially solved through attention alone.

Back to Top

Current Research

Most NLP tasks can be designed under a sequence-to-sequence modeling framework. For instance, a sequence of image pixels being transformed in a caption composed of sequences of words; or a sentence in a given input language being translated to a correspondent sentence in an output language. On both examples, the input tokens (for example, pixels or words) must be mapped to the output tokens (words) by similarity. Sequence-to-sequence modelling is defined as mapping the distribution probability of the next token yn on the output sequence y = (y1,y2, …, yn-1), given a variable length input sequence x = (x1, x2, …, xn). In practice, the mapping is not done directly and the model also yields intermediate representations z = (z1, z2, …, zn), which, ideally, encode the context that the input tokens are exposed to.5 Such intermediate representations are referred to as context vectors in the NLP literature, and the two-step procedure as encoder-decoder methods.

Initially, the community leaned toward recurrent models for both decoders and encoders, mainly to the RNNs.8 Part of the reviewed works rely on sentence-level embeddings,24 but here we assume that the set of input tokens is a vector of words encoded in a continuous representation.22 Further, for the sake of clarity, we will provide mostly examples based on machine translation, which was the main target for the original Transformers.

Recurrent Neural Networks. A RNN receives the input sequence x and processes it in a linear fashion by updating its hidden state hi at every timestep i. The updating procedure follows hi = f (hi−1, xi), where f is an activation function that provides non-linearity4 and xi is the received input token at timestep i. In short, regarding NLP tasks, RNNs learn to predict the next input xi+1 given the distribution of previous ones. Nonetheless, RNN-based models have clear shortcomings. For one, relying on an unidirectional pipeline, in which the network has access to the full context at the end of a sentence, but earlier iterations do not have information on the incoming tokens. The solution, previous to the Transformers, was to use bidirectional models stacking multiple layers of RNNs, as in the well-known work of Bahdanau et al.2 on attention-based methods for NLP. Yet, activations for earlier tokens had a tendency to fadeout through the pipeline, and long-term relations were lost.33 As an example, in the sentence "The student failed the class due to stress," the token "stress" is correlated to the "student" and not to the "class," but the influence of the token "student" on the hidden state may have vanished by the time the model reaches the "stress" token. In order to solve this short-memory constraint (vanishing gradients), RNNs enhanced by forgetting mechanisms (units that manipulate the hidden state non-sequentially) got in evidence. Most reviewed approaches were based on either LSTMs31 GRU.5 Unfortunately, for all the aforementioned recurrent models, the time to train increases exponentially as the size of the context vector increases. In turn, constraining its size makes so that bigger context windows end up having worse representations, that is, the information must be compressed more aggressively.

From encoder-decoder to attention. Alongside RNNs, recent sequence-to-sequence models were based on the encoder-decoder architecture.4 As a general overview of such methods, we provide a brief explanation. The encoder is an RNN, fed with the set of input tokens x. It encodes x in a context vector, represented by its hidden state at current timestep i, where hi ∈ (h0, h1, …, hn). The context vector changes over time but keeps a fixed length throughout the whole encoding process, while the size of the input set is variable. Similarly, the decoder is also a RNN, but instead of a sequence of tokens, it is fed the last hidden state hn from the encoder. It yields a set of output tokens (y1,y2, … yn) generated one by one, given p(y|dj, hn), where dj is the decoder's hidden state at time step j. This standard pipeline is illustrated in Figure 1(a).

Figure 1. Difference between encoder-decoder methods (a) without and (b) with attention. Notice that the circles represent the same set of weights changing at different timesteps.

Currently, the correlation between context and output is augmented through attention. Within the attention framework, the encoder yields the full set of hidden state vectors at all iterations, instead of only the last one. Afterward, a classifier (ranker) is applied to the yielded set in order to measure the relevance of each hidden state with respect to each output token. Having a measure of relevance is the main aspect of attention mechanisms. More formally:


where l is typically a softmax function, and hi × l (wij) is the weighted context vector for the decoder hidden state dj. This enhanced pipeline is depicted in Figure 1(b).

Transformers. In 2017, Vaswani et al.33 proposed the Transformers model arguing that there is no need for recurrence nor forgetting on sequence-to-sequence modeling. Transformers are a simpler architecture composed of a stack of encoders and a stack of decoders. Each encoder is an identical layer to other encoders, without weight sharing, and the same holds for decoders. In the context of machine translation, the topmost decoder uses a feedforward neural network (FNN) to provide logits, which feed a softmax layer to yield a probability distribution for the next output token.6

Formally, in the Transformers architecture, context is represented by the following simplified set of formulas, called Scaled Dot-Product Attention:




where the vector of inputs x is project into three new vector spaces: query (decoder), keys and values (encoder), represented by q, k and v. Keys can be seen as feature labels, while the values are its potentials regarding x. Queries define which features the previous decoder demanded. In addition, dk stands for the dimensionality of the vector of keys. These projections are made with different trainable weight matrices, initialized randomly during training. In the first step, represented by Equation 2, the query from token i goes through the dot-product with k, containing all keys. Such operation yields the similarity of vector qi with regard to each of the keys. Hence, the higher the dot-product, the more informative (relevant) the given key (feature) is to the proposed query. Next, through softmax, the method guarantees that the relevance will sum up to 1 and be positive, yielding a probability distribution over all keys with regards to q. Such distribution indexes the values vector.

In other words, the softmax use the exponential function to increase the gap between weights of relevant keys and less relevant ones, avoiding ambiguity. The resulting set of weights W (q1, k) goes through cross-product with the vector of values (Equation 4) to construct the context vector z, composed of weighted values.

Contrary to additive attention, which weights the values by the sum of q and k, multiplicative attention can be defined using matrices. Hence, it is faster to calculate in current hardware, but grows too fast as dk gets larger, overflowing the softmax function. This is the reasoning behind the normalizing factor 1/dk in Equation 2. Also, during training the decoder receives the target sentence as input. Future tokens are masked out, so it has no access to tokens that would be unavailable outside the training scope. For simplicity, we refer to this set of masked inputs as M (y) and abstract secondary steps (residual connections and positional encoding). Figure 2 further illustrates the Transformers' main modules.

Figure 2. An example of Transformers composed of two encoders and two decoders. Notice that the decoders receive the context—projected in two vectors v and k—from the topmost encoder.

Advantages of attention mechanisms. Instead of increasing the length of the context vector, Transformers selectively look for the most informative tokens at each timestep, through multiple forms of attention. There is self-attention on both the encoder and the decoder, and there is a global encoder-decoder attention. Self-attention, is an attention mechanism that correlates different positions of the same sequence. Through self- attention, each cell of the context vector is informed by all previous inputs, resulting in a large receptive field over the whole sentence.8 It was a paradigm shift from RNNs, since it would backpropagate gradients for the whole sentence, instead of one input token at a time. Therefore, reducing the number of computational steps that information has to flow through in order to have an impact. Self-attention further provides a higher degree of interpretability. Interpretability is a measure of how understandable by a human the model and its predictions are.11 Relying on black box models as the weights of a deep neural network. (DNN) lacks semantic interpretation. In contrast, some ablation studies on the original Transformers point out that the language model exhibits a behavior related to both the syntactical and semantic structure of the sentences. Pushing the degree of interpretability of DNNs models is in itself a huge advantage in favor of attention. Additionally, the number of sequential operations increases linearly with the input. Thus, self- attention layers are faster than recurrent layers when the input sequence is shorter than the context vector, which happens quite often in practice.33

The original Transformers also implements multihead attention: each attention head gives a different set of weights, similarly to an ensemble model. Assuming a sentence as input, one head would attend to the relation among subject and action, while others attend to adjectives and pronouns. Initially, the number of heads was chosen arbitrarily, yet, pruning and selection methods are currently available.34 Since RNNs update their hidden state one input at a time sequentially, they are inherently slow to train. In contrast, Transformers have dependencies between inputs in the self-attention layer, but the FNN is shared by all context vectors and can therefore be run in parallel, as well as each of the attention heads.

Following, we give an overview of the main milestones achieved due to the Transformers architecture and its self-attention mechanism. Yet, some alternative methods were also impactful in the last couple years, such as the Conv Seq2Seq10 model. Comparing RNNs to Convolutional Networks (CNNs) for sequence-to-sequence modeling, CNNs are more parelliazable, and map the context into a layer-based hierarchical structure, in which long dependencies are naturally captured by higher layers. Besides the architecture, Gehring et al.10 also proposed and alternative attention method, the multi-step attention, in which the decoder receives a matrix of attention weights from the previous decoder and then calculate its own, instead of sharing the same attention weights matrix across all decoders. As an early example of a mixed approach, the Universal Transformers8 (UT) combined the parallelism of the Transformers, with the recurrent inductive bias of RNNs, which seems to be better suited to a range of sequence-to-sequence problems. At each recurrent timestep, UT applies a self-attention mechanism and generates a context vector that attend to all input tokens. After, it applies a transition function to the next timestep, instead of the next encoder or decoder.

Unsupervised transformers. Language modeling is a key task for NLP, requiring the detection of very long-term dependencies. Given x as input, language modeling can be defined as the conditional probability p(xi) = p(xi | (x1, x2, …, xi−1), θ), where θ is the set of model weights, and the context (x1, x2, …, xi−1) is encoded as a vector z. Semi-supervised approaches use unsupervised pre-training—with language modeling as objective—and then fine-tune the model in a task-specific data-set. An advantage of unsupervised pre-training is that it implicitly adds a regularization step, enabling better generalization.24 In addition, by removing the need of human labeling, it allows for the use of larger datasets. It was a huge shift in the recent NLP literature to transfer entire pretrained models among distinct works, instead of only word embeddings.29

Two earlier examples of successful semi-supervised approaches are the ELMo23 and the OpenAI Generative Pretrained Transformers24 (GPT). In particular, the GPT is based entirely on the Transformers architecture. GPT's authors assume that since the objective is to predict the next token, the encoder stack can be discarded, and the decoders are directly correlated to the task. Contrary to other semi-supervised methods, the GPT model does not change its architecture for each sub-task, rather, it converts the input and output into an ordered sequence that the model can process, relying solely on the language model yielded by the training procedure. For example, on question and answering, the model concatenates the given document, the question and the set of answers, delimiting the input (document + question) by a $, followed by the output (answers). Then, GPT predicts a probability distribution over all possible answers. It was the first task agnostic model able to outperform state-of-the-art methods.24 Task agnostic refers to maintaining a single model architecture across all tasks. For example, the work of Radford et al.25 used "TL;DR" as a delimiter between input and target on summarization tasks.

Unidirectional language models such as the GPT provides narrow self-attention modules,9 that is, they apply a left-to-right architecture where the tokens can only attend to previous tokens. In contrast, another variation of the original Transformers named Bidirectional Encoder Representations from Transformers9 (BERT) used a bidirectional unsupervised pre-training. It hides parts of the input using a mask and forces the model to fill these gaps relying solely on context. Consequently, it allows for the current token to attend to both previous (left-to-right) and future (right-to-left) tokens. This bidirectional context-awareness has been explored for LSTMs as well, in the aforementioned work of Peters et al.23 on ELMo. BERT is still present as part of state-of-the-art models to date. Yet, the modern BERT approach has changed overtime. In the original paper there are two losses: masked language modeling and next sentence prediction (NSP). NSP is a binary classification loss to predict if two sentences follow each other in a document, which has proven to be inefficient.39 Nonetheless, modeling the relationship between sentences is an important aspect of language understanding.15 As an alternative to NSP, Lan et al.15 proposed sentence-order prediction loss, in which positive examples are two consecutive sequences in a document, just like in NSP, but there are also negative examples in inverse order, lacking coherence. Thus, the model is led into learning textual coherence features.

Radford et al.25 argue that focusing on the development of larger task-specific datasets will be a hard path due to the scale to which the current models are conditioned; and the answer is to develop new unsupervised models through multitask learning. Methods that either fine-tune, or are solely trained, on task specific datasets are limited to the presented context, while completely unsupervised training leads to general architectures, conditioned to much richer language models. Recently, the authors proposed the GPT-2 model, arguing that Transformers are flexible enough to allow for task agnostic models given enough data. In order to justify such claims, they applied their language model to various tasks using zero-shot (no fine-tuning) and, quite remarkably, achieved state-of-the-art performance on some of them.

Back to Top

What Lies Beyond?

The objective of this review is to explore the recent approaches that use attention mechanisms to map long-term relations between tokens (words, sentences, documents) in NLP tasks. We applied a variation of the systematic review process,14 by hard constraining the set of articles to those that either cite the original Transformers, improve on benchmark results, or had a widely known contribution (number of citations) to the state of the art. After reviewing recent work, our final corpusa comprises of 485 articles from 2018 and 2019, separated in 9 categories based on employed methods. Figure 3 details the proposed categorization of the reviewed literature. Although Multitask learning can be framed as regularization, we separate them from works that target other forms of regularization, since they are a main cluster among them. The same holds for Multilingual pretraining. Data augmentation without new method proposals were discarded, due to the reliability on the state-of-the-art.

Figure 3. Impact of Transformers on the NLP literature.

The review process converged on two main open issues that go beyond the capabilities of current attention mechanisms: commonsense reasoning and multitask learning. For the former, even GPT-2 language model yields plausible, yet, meaningless text in the sense that there is no understanding of words, just statistical knowledge on the distribution of these words in the training set. This statement can be further asserted by RNNs outperforming Transformers on simple tasks when the sentences' length during test differ too much from the ones used during training,8 or by the problem of non-literal meaning being completely missed by current solutions. To solve this limitation, we point out to the trend of using Knowledge Graphs (KGs) within unsupervised pretraining. The interpretability of KGs is explicit, while attention mechanisms may yield questionable explanations.13 In addition, KGs allow for transfer learning, and the use of smaller datasets, since part of the domain's features are previously encoded in the graph itself.

Another plausible alternative to the laborious task of developing larger datasets is to explore data augmentation and increase data diversity instead of volume. For example, Yu et al.40 applied a data augmentation technique for question and answering consisting of translating the context to other languages, providing another perspective to the same sentence (paraphrasing). However, data augmentation tends to be task-specific, and generating large synthetic datasets may be unfeasible.32

With respect to multitask learning, to date, semi-supervised methods still outperform fully unsupervised ones given the same model size. Nonetheless, the scores achieved by GPT-2 on multiple NLP tasks point to unsupervised training being a core step toward multitask models. Authors of GPT-2 even speculate that, given sufficient capacity, the language model starts to infer the task itself. It is a reasonable speculation, since the global minimum for both supervised and unsupervised training procedures is the same. Supervision just constrains the search space. On the other hand, unsupervised scenarios are much slower to train, and have no guarantee of convergence. Albeit aforementioned achievements, GPT-2 performance is close to random on some tasks.

We frame multitask learning as the closest milestone after unsupervised pretraining. As an example of applicability, BERT learns universal word representations that can be used for various tasks, yet, after fine-tuning this generalization is lost due to overfitting. Multitask learning is a plausible solution for such cases.17 Figure 4 provides an overview of a complete NLP model, based on promising research trends. Next, we detail the most successful solutions proposed so far: enhancing the original Transformers and exploring domain knowledge.

Figure 4. Architecture for a complete and up-to-date NLP model.

Enhancing the Transformers. The Transformer-XL6 model achieved superior results for language modeling than both the Transformers and RNNs, by being able to map longer dependencies among input tokens. The authors argue that the main limitation of original Transformers on context-heavy tasks is to encode all the context information on a fixed size vector. Just splitting the context itself may break important dependencies due to incorrect boundaries. Thus, the authors proposed to cache the current context representation, and propagate information from previous segments to further recurrent steps. Nonetheless, models combining recurrence and Transformers are a minority in the researched corpus, the state-of-the-art is heavily reliant on the original BERT as backend model.

BERT-based models. In their work on RoBERTa, Liu et al.20 argue the original BERT was undertrained and could reach state-of-the-art performance by increasing training time alone. The main proposed changes to the training procedures were the use of bigger batches and removing the NSP step, both which became standards in the posterior literature. Specifically, larger batches as a way to improve training efficiency was first proposed, within the Transformers context, on You et al.39 Their layer-wise adaptive large batch optimization allowed for batch sizes of 32868 sequences of tokens, achieving the memory limit of their test TPU. Remarkably, fully consuming the hardware reduced their training time from 3 days up to 76 minutes.

A multitask alternative was proposed in the work of Liu et al.18 on the Multi-Task DNN (MT-DNN). It is a BERT-based model, composed of a set of shared layers, and a set of task-specific sub-layers. The input sequence x is fed to the shared layers, and a bidirectional encoder captures contextual information through self-attention. Afterward, both contextual information and x are used as input to task-specific layers. BERT-based models can even be extrapolated beyond the scope of NLP, for instance, the ViLBERT model21 combined language modeling to visual inputs, aiming at applications such as image captioning. The authors propose a two-streams model, one for text and other for images, interacting through co-attention transformer layers. Co-attention refers to exchanging keys and values vectors among different attention heads (multi-head attention), mixing contextual features captured on both the language and the visual streams.

Pretraining. Among pretraining objectives, the two most successful38 approaches are either Autoregressive (AR) language modeling or Autoencoding (AE), such as GPT and BERT, respectively. AR language modeling seeks to estimate the probability distribution of a text corpus, while AE reconstruct original data from partial input. The advantage of the latter is to enable bidirectional context encoding. In turn, artificially masking out input tokens results in a discrepancy between unsupervised pre-training and fine-tuning (no masks). Combining the two approaches is a recent trend that achieved new milestones in the works of Yang et al.38 and Song et al.30

Yang et al.38 propose the XLNet to learn the dependency log p(x|U), where U is a subset of tokens in x that encodes its context. Differently from AR models, it can map dependencies of x and U regardless of the order. In other words, instead of adopting left-to-right or right-to-left models, it learns with respect to all possible permutations of tokens in x. XLNet outperformed the original BERT and other contemporary models on 18 out of 20 proposed tasks.

Song et al.30 is a successful training procedure for encoder-decoder based natural language generation (NLG), which is a key application of language modeling, widely impacted by Transformers.25 Their model is the MAsked Sequence to Sequence pretraining (MASS). Both BERT and GPT-2 train encoders and decoders separately, while MASS trains both jointly. The encoder receives a subset of x with randomly masked sets of consecutive tokens. Next, the decoder receives the remaining subset masked and must learn the context only from the masked inputs, which are now available. Consequently, the encoder is forced to extract more context information in its hidden state to aid the decoder. Also, the decoder relies only on context, thus, it must be able of language understanding, that is, encoding meaning of words.

Larger models. A milestone on the trend of training ever larger models was achieved by NVIDIA in the Megatron-LM29 paper. The authors noticed that precise placement of the normalization step was important on very large models. By manipulating residual connection, and the placement of normalization layers, their model showed monotonically increasing in performance as the model grew larger. For faster training, each attention head was processed in a different GPU, enabling a new level of parallelism. Due to the scalability of their model, they achieved state-of-the-art on GLUE tasks by increasing the size of the available BERT model from 340M up to 3.9B.

In contrast to the parallel nature of the Megatron-LM model, ALBERT15 provides a method to reduce the number of parameters, without an equivalent loss in performance. By improving the model scaling rather than its size, they are the current state-of-the-art on the GLUE36 benchmark, while having fewer parameters than the original BERT. Two methods are applied: sharing parameters across all layers, enabling deeper networks without overflowing the number of parameters, and decomposing the word embedding matrix in smaller sets, enabling smaller hidden layers. Noteworthy, although ALBERT has less parameters than BERT, it is more computationally expensive due to its architecture.

Two models stand out in size. Turing-NLGbb is a 17 billion parameter language model by Microsoft., which follows a similar training procedure to Megaton-LM, but in a larger scale. Following the architecture proposed for GPT-2, Brown et al.3 proposes GPT-3, a 175 billion parameters autoregressive model, which was a huge increase in size from the previous largest model in the literature. It was widely evaluated on many NLP tasks in three scenarios: few-shot, one-shot and zero-shot. One-shot regards a single demonstration of the task, while zero-shot allows only for natural language instructions, forcing the model to rely solely on the pretraining. In the less constrained few-shots scenario, the GPT-3 was able to surpass the state-of-the-art, composed mostly of fine-tuned models, in some of the tasks. Remarkably, on the NLG task, GPT-3 produced news articles (up to 500 words), which were hard to distinguish from news written by humans.

Domain knowledge. To date, NLP is mostly centered around algorithms that can be trained on available tasks-specific labeled and unlabeled training samples.1 In contrast, humans rely on past structured knowledge of the world when facing new challenges. Recently, the search for models that encode the meaning of text, in tasks such as reading comprehension, led the NLP community toward task agnostic models. The popularity of multitask benchmarks (GLUE) is a consequence of this change in focus. Properly encoding domain knowledge is important toward creating task agnostic models that actually map meaning, instead of an "empty" statistical prediction.7 We identified two promising trends to provide domain knowledge on top of the training procedure: KGs and Knowledge Distillation (KD).

Knowledge graphs. To provide prior facts for DNNs, the community recurs to KGs. Within the KGs framework, knowledge means an organized set of world information, and prior facts are mapped as a graph of object state changes. As a clear advantage, when a DNN has access to prior knowledge it can be trained with less labeled data. For instance, triplets of <object, relation, object> as in <Italy, capital, Rome>, where the objects are the vertices of a KG and the relations are its edges.1 Moreover, by leveraging prior world knowledge the model can avoid storing these straightforward correlations, and spend trainable parameters on complex statistical reasoning.

Combining attention and KGs, Annervaz et al.1 proposed the extraction of knowledge through self-attention, in order to reduce the attention search space. They trained a multitask model following the supervised learning paradigm, using max p (y|x, xw, θ) as objective function, where θ is the set of model weights, and the optimization process also factor in the world knowledge input xw. It outperformed similar architectures trained fully on labeled data. Das et al.7 evince another key advantage of KGs: enabling the use of the graph theory toolbox. They propose a model based on KGs for reading comprehension. KG is framed as a dynamic memory model that form a graph correlating objects to its location in the world and clustering them by spatial proximity. For each input sentence, the model queries the state of all objects and propagate changes to all other nodes. After a series of ablation studies, the authors show that their KG encodes enough knowledge priors to achieve state-of-the-art performance. Noteworthy, the model learned commonsense constraints from data, without manual interference. An alternative way to impose priors to the training was proposed by Shirish Keskar et al.,28 in the CTRL model. Their target application is NLG. It is a 1.63B parameters pre-trained Transformer, conditioned to control codes. These codes refers to actual labels defining additional features of text segments. For instance, text style, topic, date or entities' names.

Knowledge distillation. New neural architectures are constantly surpassing previous ones and the complexity of such models increases as fast. In contrast, these networks are impractical on mobile or real-time scenarios, which represent many real-world applications with limited resources.32 KD is the process of compressing the knowledge of a huge model (teacher) into a lighter representation (student), while minimizing the performance loss. Differently from KG, the KD transfers knowledge by teaching the student model to mimic the teacher and yield a similar prediction. In a general form, the student model optimizes the negative log-likelihood, given a set of train-able parameters θt, and the set of teacher parameters θt, the loss function use:19


where (x,y) is the set of training instances (inputs and ground truth labels) and Q(y|x, θt) is the distribution yield by the teacher model over the whole training set. When the KD is based on an ensemble, the distribution is given by the average of all n teacher models.

Following the framework proposed by You et al.39 for training BERT models, the DistilBERT27 applies knowledge distillation to the pre-training step. The key idea is to yield a pre-trained BERT model with reduced size, which is achieved by keeping 97% of the model performance on downstream tasks, at 40% (student model) of its size. Their method proposed a loss function composed of three potentials: masked language modeling (original BERT), distillation loss, and a cosine distance to align context vectors between the student and teacher models.

Shallow models. Tang et al.32 argues that shallow neural models are not obsolete and can achieve results on par with their very deep counterparts through KD. They experiment using the original BERT model fine-tuned for machine translation as a teacher, and slightly augment the dataset with synthetic data. The key difference is the use of a single-layer bidirectional LSTM as student model, which not only compresses way less parameters, but is also architecturally different. These discrepancies assert that KD is indeed a model-agnostic procedure, having direct impact only on the objective function. More remarkably, the shallow bidirectional LSTM achieved comparable results to ELMo, while having 100 times fewer parameters and 15-fold faster inference.

Ensembles. With respect to ensembles, in the work of Liu et al.,18 an ensemble of MT-DNNs reached state-of-the-art performance on GLUE tasks. Although ensembles provide improved generalization and good benchmark scores, they are inadequate for some applications due to the huge size of the complete model. For example, an ensemble of GPT-2 models would require an unreasonable number of parameters for current standards. Therefore, Liu et al.17 applied KD to generate a single model capable of maintaining the ensemble scores on 7 out of 9 GLUE tasks, outperforming previous single models by a large margin. The authors apply the technique proposed by Hinton et al.12 to distill the knowledge from the MT-DNNs ensemble. Initially, they trained one ensemble model for each task, and created soft targets for each of the training instances. Soft targets are the average of the predictions of all the individual DNNs on the ensemble. Next, soft targets are used to inform the student model about how the ensemble generalizes.12

Back to Top


We compared current state-of-the-art models by GLUE score and number of parameters on the accompanying table. Enabling the training of models such as the GPT-3, two orders of magnitude larger than the already massive BERT, is the latest milestone achieved in recent literature. The table shows large models at the top of current benchmarks, ALBERT being the exception. Novel benchmarks, improving upon GLUE, have been released for future research: SuperGLUE35 proposes harder language understanding tasks, and XGLUE16 enable cross-lingual model evaluation and training, by providing labels for every task on multiple languages. Lan et al.15 suggests as an alternative venue exploring additional representation power, that is, engineering the self-supervised losses and proxy tasks in order to map additional dimensions of the data. Given the score achieved by ALBERT, parameter sharing methods can greatly improve BERT-based models efficiency perparameter, while also imposing additional regularization. Moreover, Tang et al.32 argue that shallow models, such as LSTMs, are capable of more expressiveness than they currently yield, if conditioned to more robust training procedures and KD. Noteworthy, most models listed on the table are based on the original BERT architecture. Yet, the XLnet combined advantages of two successful approaches, BERT and GPT-2, pointing out that AR and AE models should be explored equally in hybrid solutions.

Table. Comparing Transformer-based models by score on the GLUE benchmark. Higher reported scores among listed papers, or GLUE's leaderboard. Number of parameters from Sanh et al.27 when not available in the original paper. Score* refers to averaged scores over a subset of the GLUE tasks.

A clear open issue regards commonsense reasoning. As noticed by the Brown et al.,3 increasing model size shows diminishing returns on commonsense reasoning tasks. Das et al.7 argues that KGs are a valid solution for augmenting models with common-sense. There is a clear research opportunity on augmenting the proposed KG with other types of object relations, besides world location. With respect to data augmentation, adding finer-grained metadata (for example, control codes28) for each sequence imposes commonsense reasoning on current models. However, adding prior knowledge to the data could also be harmful. Imposing biases to the raw data, assuming a pretrained model, and then proceeding through multiple training routines for fine-tuning, may yield unexpected outcomes in regard to ethical concerns.26 Therefore, the degree of interpretability associated with novel architectures is a valuable criteria, as happened to the Transformer's self-attention.

Sample efficiency is another concern, since pretraining requires more text than a human would have access to in a lifetime.3 In turn, the performance on most few-shot scenarios is way lower than human baselines. It further highlights the need for additional knowledge encoding methods and remains as an open issue. Consequently, few-shot learning is a huge research trend and will be the main focus of the community for the years to come. On the other hand, as argued by Brown et al.,3 zero- or one-shot scenarios are the closer benchmarks to actual humans.

Back to Top


This article has no intention to lessen the groundbreaking contributions of current literature to the NLP research. From attention to unsupervised training, work as the original Transformers, BERT and GPT-2 were a huge step toward NLU. Instead, we highlight open issues and propose challenges that must be faced henceforward. Attention, mainly self-attention, is indeed a standard on current literature, being part of every recent model reviewed. Nonetheless, to achieve NRU with meaningful models, attention is not enough. Future research efforts can either explore multitask ensemble knowledge in a lighter format or use structured knowledge priors. Extracting knowledge from ensembles rely mostly on KD, which can also be used in a self-learning paradigm, having the ensemble being both the teacher and the student. On the other hand, structured knowledge could also be encoded in a plethora of ways, yet, we argue that KGs still have much uncovered potential. As a final insight, we highlight the potential of older methods, such as LSTMs and GRU, growing on par with DNNs due to multitask learning, knowledge transfer and knowledge priors. We conjecture that DNNs themselves can still benefit further from these techniques.

Back to Top


1. Annervaz, K.M., Chowdhury, S.B.R. and Dukkipati, A. Learning beyond datasets: Knowledge graph augmented neural networks for natural language processing. NAACL-HLT, 2018.

2. Bahdanau, D., Cho, K. and Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473, 2014.

3. Brown, T.B.B. et al. Language models are few-shot learners. 2020; arXiv:2005.14165 (2020).

4. Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of the 2014 Workshop on Syntax, Semantics and Structure in Statistical Translation. Association for Computational Linguistics, Doha, Qatar, 103–111;

5. Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation; arXiv:1406.1078 (2014).

6. Dai, Z., Yang, Z., Yang,Y., Carbonell, J.G., Le, Q.V., and Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. ACL (2019).

7. Das, R., Munkhdalai, T., Yuan, X., Trischler, A. and McCallum, A. Building dynamic knowledge graphs from text using machine reading comprehension; arXiv:1810.05682 (2018).

8. Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J. and Kaiser, L. Universal transformers; arXiv:1807.03819 (2018).

9. Devlin, J., Chang, M-W, Lee, K. and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conf. North American Chapter of the ACL: Human Language Technologies 1. Association for Computational Linguistics, Minneapolis, MN, 4171–4186;

10. Gehring, J., Auli, M., Grangier, D., Yarats, D. and Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the 34th Intern. Conf. Machine Learning 70. JMLR. org, 2017, 1243–1252.

11. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., and Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. 51, 5, Article 93 (Aug. 2018);

12. Hinton, G., Vinyals, O. and Dean, J. Distilling the knowledge in a neural network. In Proceedings of the 2015 NIPS Deep Learning and Representation Learning Workshop.

13. Jain, S. and Wallace, B.C. Attention is not explanation. NAACL-HLT, 2019.

14. Kitchenham, B. and Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering. Technical Report. Keele University, 2007.

15. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceeding of the Intern. Conf. Learning Representations. (2020)

16. Liang, Y. et al. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. To be published;

17. Liu, X., He, P., Chen, W. and Gao, J. Improving multi-task deep neural networks via knowledge distillation for natural language understanding; arXiv:1904.09482 (2019).

18. Liu, X., He, P., Chen, W. and Gao, J. Multi-task deep neural networks for natural language understanding. ACL.2019.

19. Liu, Y., Che, W., Zhao, H., Qin, B. and Liu, T. Distilling knowledge for search-based structured prediction. In Proceedings of the 56th Annual Meeting of the ACL 1. Association for Computational Linguistics, 2018, Melbourne, Australia, 1393–1402;

20. Liu, Y. et al. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019).

21. Lu, J., Batra, D., Parikh, D. and Lee, S. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, 2019; arXiv:cs.CV/1908.02265

22. Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013).

23. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conf. North American Chapter of the ACL: Human Language Technologies 1. Association for Computational Linguistics, New Orleans, LA, 2227–2237;

24. Radford, A., Narasimhan, K., Salimans, T and Sutskever, Improving language understanding by generative pre-training, 2018; (2018).

25. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskeve, I. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019).

26. Rajani, N.F., McCann, B., Xiong, C. and Socher, R. Explain yourself! Leveraging language models for commonsense reasoning. ACL, 2019.

27. Sanh, V., Debut, L., Chaumond, J. and Wolf, T. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter; arXiv:1910.01108 (2019).

28. Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C. and Socher, R. CTRL: A conditional transformer language model for controllable generation, 2019, arXiv:1909.05858.

29. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J. and Catanzaro, B. Megatron-LM: Training multi-billion parameter language models using model parallelism, 2019; arXiv:cs.CL/1909.08053

30. Song, K., Tan, X., Qin, T., Lu, J. and Liu, T-Y. MASS: Masked sequence to sequence pre-training for language Ggeneration. ICML, 2019;

31. Sutskever, I., Vinyals, O. and Le, Q.V. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 2014, 3104–3112.

32. Tang, T., Lu, Y., Liu, L., Mou, L., Vechtomova, O. and Lin, J. Distilling task-specific knowledge from BERT into simple neural networks. arXiv:1903.12136 (2019).

33. Vaswani, A. et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 5998–6008.

34. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ACL, 2019.

35. Wang, A. et al . SuperGLUE: A stickier benchmark for general-purpose language understanding systems. CoRR abs/1905.00537 (2019). arXiv:1905.00537

36. Wang, A., Singh, A., Michael, J., Hill, F. Levy, O. and Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR abs/1804.07461 (2018).

37. Wu, Y et al. Google's neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).

38. Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R. and Le, Q.V. XLNet: Generalized autoregressive pretraining for language understanding. NeurIPS, 2019.

39. You, Y. et al. Large batch optimization for deep learning: Training BERT in 76 minutes. In Proceedings of the 2019 Intern. Conf. Learning Representations.

40. Yu, A.W. et al. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv:1804.09541 (2018).

Back to Top


Eduardo Souza Dos Reis is a researcher at Softwarelab, Unisinos, Brazil.

Cristiano André Da Costa is a professor at Softwarelab, Unisinos, Brazil.

Diórgenes Eugênio Da Silveira is a researcher at Softwarelab, Unisinos, Brazil.

Rodrigo Simon Bavaresco is a researcher at Softwarelab, Unisinos, Brazil.

Rodrigo Da Rosa Righi is an assistant professor at Softwarelab, Unisinos, Brazil.

Jorge Luis Victória Barbosa is a professor at Softwarelab, Unisinos, Brazil.

Rodolfo Stoffel Antunes is an assistant professor at Softwarelab, Unisinos, Brazil.

Márcio Miguel Gomes is a researcher at Softwarelab, Unisinos, Brazil.

Gustavo Federizzi is senior manager at Dell Inc., Brazil.

Back to Top


a. The full set of articles can be accessed at:

b. Available at

Copyright held by authors/owners.
Request permission to (re)publish from the owner/author

The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.


No entries found