Sign In

Communications of the ACM

Review articles

Transformers Aftermath: Current Research and Rising Trends

shiny blue tubes on clear plane, illustration

Credit: Serp / Shutterstock

Natural language processing (NLP) is a main subject for artificial intelligence research, on par with computer vision and scene understanding. Knowledge is encoded and transferred among individuals through text, following a formally defined structure. Nonetheless, distinct languages, context, and subtle particularities of different communication channels are complex challenges that researchers must cope with. Hence, the task of general language modeling and understating was divided into multiple subtasks. For example, question and answering, image captioning, text summarization, machine translation and natural language generation. Recently, attention mechanisms became ubiquitous among state-of-the-art approaches, allowing the models to selectively attend to different words or sentences in order of relevance. The goal of this review is to gather and analyze current research efforts that improve on—or provide alternatives to—attention mechanisms, categorize trends, and extrapolate possible research paths for future works.

Back to Top

Key Insights


Recurrent Neural Networks (RNNs) were broadly adopted by the NLP community and achieved important mile-stones.37 However, it is computationally expensive to encode long-term relations among words in a sentence, or among sentences in a document using RNNs. In tasks such as text generation, encoding these dependencies is fundamental, and the inference time may become prohibitively slow. Pursing a solution for these limitations, the seminal work of Vaswani et al.33 engineered the Transformers architecture. The core idea was to put attention mechanisms in evidence, discarding recurrence. Soon after, the Transformers became the de facto backend model for most NLP. From the original December 2017 publication to date, there have been over 4,000 citations and remarkable work on the concept. Due to its early success, the community focus became the development of larger models, containing scaled-up Transformers and longer context windows, leading to a shift from task-specific training procedures to more general language modeling. On the other hand, although current models are capable of generating text with unprecedented perplexity, they rely mostly on statistical knowledge contained in the text corpora, without encoding the actual meaning behind words. Hence, the produced text has surprisingly middling segments contrasting with meaningless word sequences. This lack of meaning cannot be trivially solved through attention alone.


No entries found

Log in to Read the Full Article

Sign In

Sign in using your ACM Web Account username and password to access premium content if you are an ACM member, Communications subscriber or Digital Library subscriber.

Need Access?

Please select one of the options below for access to premium content and features.

Create a Web Account

If you are already an ACM member, Communications subscriber, or Digital Library subscriber, please set up a web account to access premium content on this site.

Join the ACM

Become a member to take full advantage of ACM's outstanding computing information resources, networking opportunities, and other benefits.

Subscribe to Communications of the ACM Magazine

Get full access to 50+ years of CACM content and receive the print version of the magazine monthly.

Purchase the Article

Non-members can purchase this article or a copy of the magazine in which it appears.
Sign In for Full Access
» Forgot Password? » Create an ACM Web Account