Artificial Intelligence and Machine Learning Research

Show It or Tell It? Text, Visualization, and Their Combination

When communicating information, language should be considered as co-equal with visualization.

Posted Oct 1 2023

Introduction
Key Insights
Striking a Balance Between Text and Vis
Case Study: Comparisons
Text Alone
Preferences or Literacy?
Cognitive Models for Combining Text and Vis
Language as a User Interface for Vis
Conclusion
Acknowledgments
References
Author
Footnotes

The field of information visualization studies how visual representations can express relationships among abstract data. Visualization (Vis) is often compared with alternative forms of presentation, such as tables of numbers. Although research assumes that visualizations are embedded in context—within newspapers, textbooks, social media posts, and presentation slides—the composition and placement of the language used in charts is usually an afterthought. For example, Apple recently released a set of user-interface guidelines which include patterns for designing charts.² This anatomy of a chart contains a carefully designed layout but entirely omits guidelines for the placement of the title and textual annotations. This omission reflects the assumptions within the field of information visualization, including many of its textbooks.

Key Insights

The interplay between language and visualizations for communication is not well understood.
The field of data visualization should treat textual content as a primary design element.
Recent advances in automatically generating visual content using language commands are likely to transform how visualizations are created.

These assumptions persist despite the fact that the seminal work of Borkin et al.⁴ showed that the language components of visualization are of key importance. That study compared a very large number of infographic designs, finding that the written text was the more memorable part of the visualization.

This article argues that language should be considered as co-equal with visualization when communicating information. This is a rather radical statement for the visualization community; that said, there has been a recent surge of interest in this topic, including a new workshop on natural language and visualization.³⁰ This article is an attempt to bring some of the questions and the results to a wider audience.^a

The remainder of this article discusses these main themes:

Combining Text and Vis: How much text should be placed on a visualization and where should it go? What should that text consist of? How can results from linguistics be integrated into this research?
Text Alone: Empirical work in visualization should, as a standard practice, compare charts against a baseline of no visualizations at all—a baseline of expressing the same information on the chart in language. Empirical evidence suggests there is a significant minority of people who tend to prefer no visualizations.
Literacy: The visualization field has theories of visual literacy, but it should incorporate theories of reading literacy as well.
Cognitive Models: The field does not use verified cognitive theories of how combinations of language and visualizations are read, perceived, and understood.
Language as the UI for Vis: The rather spectacular advances that are happening in natural language processing may have major impacts on how visualizations are created in the future.

Striking a Balance Between Text and Vis

Within the visualization community, Scott McCloud’s brilliant book, Understanding Comics,²¹ is an inspirational classic. To illustrate the tradeoffs between what is depicted in images versus words, McCloud introduces a running example in Chapter 5. In the first view, the message is expressed only with pictures, no words. The image shows the scene, the mood, and the action, freeing up the words to express something else, such as the inner thoughts of the character in the scene. In the reverse case, McCloud shows a comic consisting only of words above empty spaces. In this comic, the words carry the weight of describing the scene, action, and character’s internal state. When images are added, they can zoom in to show just a piece of the action or to convey the mood.

McCloud shows that if the image takes on one part of the description, the text is freed up to show some other content and vice versa. This framework can be applied to research questions about how text and visualization should be combined. We can break this further into:

What is the nature of the text that should appear on the visualization?
Where should it be placed?
How much is too much? And how do the visual and the language components interact?

The following subsections describe research addressing each of these questions.

Where? Kim et al.¹⁶ investigated the question: How do captions influence what people take away from charts? To better observe the influence of the text, they developed a method to determine which parts of a univariate line chart are most visually salient (see Figure 1, left). After finding the salient regions (the “where”), Kim et al.¹⁶ created captions that corresponded to each of these salient parts of the line chart (for instance, a sharp peak). The experimenters wanted to know if the caption content influences what people take away from the charts and if that text can override the most visually salient parts.

Figure 1. (Left) A study of where on a chart the most visually salient components are, from Kim et al.¹⁶ (Middle) Four levels of semantic description, from low (1) to high (4), from Lundgard and Satyanarayan.¹⁹ (Right) A schematic showing how much text to place on a chart, based on Stokes and Hearst.³³

The experimenters found that if the captions referred to the most salient parts of the chart, the most salient parts were recalled. But if the caption referred to the parts of the chart that were not the most visually salient, people recalled the parts called out by the text rather than the most salient regions; in other words, the text overruled the visuals.

But in the final case, if the caption referred to something not visually salient at all, then what participants recalled was more influenced by the chart. (This work was recently verified and extended.⁶)

These findings suggest that there is a complex relationship between the effects of the visuals versus the effects of the textual. There seemed to be a tipping point between when visual had more sway than textual.

What kind? Lundgarden & Satyanarayan investigated the question: What kind of language is preferred for describing charts by blind and low-vision (BLV) people vs. sighted people?¹⁹

In this study, the experimenters first asked participants to write descriptions of chart. They then analyzed these texts and identified four levels of semantics, L1 through L4 (see Figure 1, middle). The lowest level, L1, describes the components of the chart, while the highest level, L4, describes external contextualizing information not visible on the chart. These semantic levels categorize “what kind” of text is used to describe visualizations.

Interestingly, the BLV participants preferred different kinds of textual information than sighted people. In particular, the majority of BLV readers opposed high-level L4 expression, which by contrast was favored by the majority of sighted readers; the converse was true for low-level L1 language. These results are important in themselves for informing how to write alternative text for accessibility purposes.

How much? Armed with “where” and “what kind”, I and several collaborators conducted a study that asked, “How much?”, as in how much text is too much for annotation as an overlay on a chart³⁵ (see Figure 1, right).

Working with univariate line charts, we systematically varied chart design from all chart and no text, all the way to all text and no chart, shown in Figure,. We created the charts by first finding the visually salient components as in Kim et al.¹⁶ and then labeling those components with the different semantic levels as in Lundgard and Satyanarayan,¹⁹ varying them for a controlled experiment with crowd-workers. We assessed these designs in two ways: with preference questions and according to how people took away information from the charts.

In terms of preference, a majority of participants preferred more text context (type C in Figure 2). In a subsequent analysis,³³ we found that although text can at first glance make the chart appear more cluttered, in actuality, this extra context was helpful and preferred, so long as the text was relevant and not redundant. This study also found the surprising result that 14% of participants ranked choice D, the all-text paragraph in Figure 2, as their top choice.

Figure 2. Four charts, from no text annotations (A) to all text (D), used to compare participant preferences, from the study of Stokes and Hearst.³³

In terms of what information participants took away from the charts, our findings were:

How much: Use relevant text, do not worry extensively about the clutter issue.
Where: The best position depends on the type of semantic content (level) being shown.
What kind: The best semantic level depends on the message being conveyed. In summary, this study found that more text was better. That said, more research is needed to look at more complex and diverse chart types.

Case Study: Comparisons

A case study of the differences and dependencies between language and visualization can be understood through the case study of comparisons. Comparisons have been studied by both linguistics and visualization research. Both fields recognize the challenge of comparisons and both fields see this construct as being expressed in a diverse manner. On the linguistics side, for instance, Friedman¹⁰ states, “The comparative is a difficult structure to process for both syntactic and semantic reasons. Syntactically, the comparative is extraordinarily diverse.” On the visualization side, Gleicher¹² writes: “Supporting comparison is a common and diverse challenge in visualization.”

Despite this commonality, their methods for addressing comparisons are quite divergent. The difficulties in linguistics are the variation in expression, the challenge of determining what entities are being compared, and what those relationships are. The following statements present two ways of expressing comparisons, taken from a camera reviews collection.¹⁵ The syntactic structure and lexical choices between just these two examples are very different:

“I felt more comfortable with XTi and some of my friends felt more comfortable with D80.”

“On the other hand I actually prefer the D80 handling with smaller lenses, which is what’s on my camera 80% of the time.”

By contrast, the visualization literature assumes the entities and relationships being compared are known; the literature instead asks how to show those relationships and how to make them scale. A single sentence of language can only compare a few things at a time, but visualization compares dozens, hundreds, or thousands of items at once.

Comparisons with vague modifiers. In the field of cognitive linguistics, Schmidt et al.²⁸ examine the question: How do people decide what the meaning is of “tall”? What is “tall” vs “not tall”? The answer, determined empirically and with modeling, is it depends on the distribution of the data points. For instance, a step function yields more agreement than an exponential drop-off (see Figure 3).

Figure 3. (Top) From Schmidt et al.,¹⁹ the degree of agreement between human judges for what items are considered “tall” varies based on the distribution of those items. (Bottom) Using this result to label charts in a conversational interface, as described in Hearst et al.¹⁴

My colleagues and I used this result to determine how to show visualizations in response to natural-language comparisons containing vague modifiers like “tall” and “cheap.”¹⁴ We used the results of Schmidt et al.²⁸ to determine which bars to highlight for a response to a superlative comparison question, such as “Show the heights of the tallest buildings.” For instance, for the exponential drop off, the cognitive linguistics model shows us which bars we should highlight, depending on the shape of the curve (see Figure 3).

Comparison questions in chat interfaces. My colleagues and I did another study to assess a related question: how to show visualizations in a conversational interface for an intelligent assistant like Siri or Alexa.¹³ The goal was to determine what kind of visual context people prefer after they ask a comparison question with a simple answer, such as “Which Olympic sport has the tallest players: rowing or swimming?”

We found that many people preferred bar charts, so long as the chart did not get too long. However, we also found that 41% of participants did not want to see any chart at all in this context. They preferred text alone. Statements by participants showed that those who preferred bar charts preferred seeing the data points in the context of other bars. People who preferred text said that it is precise and not overly complicated. A few participants switched from preferring text to preferring charts, when the situation merited it.

In summary, comparisons are a good case study for delving deeply into questions about the differences between visualizing a concept versus expressing it in language; findings from cognitive and computational linguistics can help shed light.

Text Alone

The previous section recounted two cases in which a sizable minority of participants preferred text without a chart. This is surprising from the perspective of information visualization research. Although not common, other examples can be found in the literature in which the text-alone condition was tested.

For instance, McKenna et al.²² conducted a study comparing different ways of presenting a scrollytelly design (a design in which visualizations appear dynamically within the text as the user scrolls down a long Web document), including one design with no visualizations at all. Surprisingly, a notable minority of 10% of participants said they preferred this no-visual condition.

In an investigation by Ottley et al.,²³ researchers experimented with methods to help explain Bayesian reasoning. They compared text alone, visualization alone, and the two juxtaposed. They found that visualization was not more accurate than text for this purpose. They also found that when text and visualizations were presented together, participants did not seem to take advantage of the distinct affordances of each.

The last example is the well-known “Explaining the Gap” paper by Kim et al.¹⁷ Its goal was to compare how well people recalled data depending on whether they had to first predict a trend or not. Their experiment design included a comparison between text-only and visualization-only conditions. The authors had three major findings with respect to text. First, presenting data as text helps people recall those values better than with a visualization. Second, the visualizations were better than text at helping people recall trends. Third, the aid to prediction was found for the visualization but not for the text.

In summary, studies provide strong evidence that a text-only variant should be tested when assessing the efficacy of a design.³⁴

Preferences or Literacy?

One explanation for why people prefer text alone over visualization is personal preferences; it could be that some people prefer reading while others prefer visual information. But this begs the question as to why people prefer one over the other.

Within education circles, there is much discussion of math literacy (or numeracy) and computational literacy.³⁸ Within the visualization community, the notion of visualization literacy has recently interested researchers. Solen³² defines it as “…the ability to critically interpret and construct visualizations.” An explanation for the differences in preferences could be that some people have not learned how to interpret charts, or have not had enough practice interpreting them, to be “fluent” at reading them, which results in their not preferring them.

However, visualization research rarely considers the flip side—the role of reading literacy—the original meaning of the word. Reading expert Maryanne Wolf explains what is understood about cognition and reading, and relates it to the importance of fluency in reading.³⁹ Wolf notes that true literacy can be achieved only when readers become expert—that is fluent—readers. Wolf opens her book with, “We were never born to read”, meaning that although most humans innately learn spoken language, reading is not innate. She points out that to learn to read, special pathways need to be formed across many different brain regions that were not evolved for reading.

Wolf points out that researchers have gathered extensive evidence that the processing of words occurs in the parafovea, before the word is directly fixated on. An expert reader uses peripheral vision to pick up on visual characteristics such as word shape. This peripheral vision does not usually indicate the word’s meaning, but it can approximate the general shape of what is to come. Wolf notes that this preview of what lies ahead on the line contributes to fluent reading. Wolf also talks about why fluent reading is so important—it gives enough time to the executive system to direct attention where it is most needed—to infer, understand, and predict. In other words, to think while you are reading. Thus, literacy with fluent reading opens the door to developing new understanding while reading.

A growing trend in news reporting and scholarly publishing is to insert visualizations within the body of text paragraphs; Figure 4 shows a constructed example. This practice does not take into account how integrating visualizations within text can disrupt fluent reading. In looking at the figure, consider how easy or difficult it is to read the text. Do you read the paragraph straight through, or does your eye dart from the visualizations to the text, and back again in an erratic manner?

Figure 4. Example of visualizations inserted within a paragraph that most likely impede fluent reading; text adapted from Wikipedia.³⁷

There is evidence that when the paragraph contains unexpected images, it can disrupt fluent reading. Consider hyperlinks. Fitzsimmons et al.⁹ found that readers focus on hyperlinks when skimming, and they tend to use these links as markers for important parts of the text. Similarly, studies show that emoji icons embedded within text can slow down reading.^3,7 Thus, given the prior results, it is likely that embedded visualizations as shown in Figure 4 will have deleterious effects on fluent reading.

In summary, to shine a light on the reasons for people’s preferences for text versus visualization, future work should consider both reading and visualization literacy and fluency when combining language and charts.

Cognitive Models for Combining Text and Vis

Cognitive models are used to understand why and how mental processes work and to aid in formulating predictions, such as how a person viewing a visualization will interpret it. For instance, Padilla et al.²⁴ used a cognitive model to identify the specific process underlying why people misinterpreted hurricane forecast visualizations. However, within the visualization field, no commonly used cognitive model exists to shed light on the question of how text and visualizations are mentally processed when combined. In the Bayesian reasoning experiment described above, the authors noted that the field of visualization does not have sophisticated guidelines for understanding how to combine the two modalities.²³

Mayer²⁰ has done extensive research on combining language and visuals for the purposes of education. This work considers the placement and modality of text (written or audio) within images in the limited context of physical process explanations.

The cognitive theory that Mayer employs is the dual-channel model,²⁰ which assumes separate cognitive systems or channels for processing pictorial versus verbal information. It assumes each channel has limited capacity, and meaningful learning involves actively building connections between the two.

From the field of journalism, Sundar³⁶ presents three main cognitive model theories. The first is the dual-channel theory just mentioned, which states there are two cognitive subsystems for language versus image, and they operate independently when coding information into memory. The next is cue-summation theory, which posits that when text and visuals are presented together, text provides additional learning cues, particularly at memory retrieval time. The third is the limited capacity information processing theory, which states that combining multiple modalities overwhelms the system. Together, these theories cover all of the cases: Text plus visuals are either independent, additive, or interfering. There does not seem to be any consensus or even strong evidence for which is correct.

In summary, there is no widely accepted cognitive model for how text and images are perceived together, which may cause empirical results to be less predictive than if such a model existed. One remedy is for the visualization community to engage with cognitive scientists on this important and underexplored question.

Language as a User Interface for Vis

There has been extensive prior work on incorporating natural language processing (NLP) into information visualization systems.³¹ This includes using language as a query against data to create a visualization²⁹ and using natural language to build and refine designs of visualizations.^8,11 However, these systems use technology that predates the recent advances in NLP. They often consist of a software pipeline of diverse algorithms—often including tokenizers, part-of-speech taggers, syntax parsers, entity recognizers, and a variety of semantic analyzers. Each stage requires its own hand-labeled training data and format, and information is lost from one stage in the pipeline to the next. These systems are either limited in scope or are unable to get robust coverage of the possible ways to express relevant concepts.

The visualization field has theories of visual literacy, but it should incorporate theories of reading literacy as well.

Large generative language models (LLMs), such as GPT-3 and T5, are transforming the fields of both NLP and computer vision. Although the models are very large in terms of input data and parameters trained, their architectures are in some sense simple compared to the NLP pipelines of the past. In the new transformer-based or diffusion-based approaches, one model is used for a wide variety of applications. For instance, T5 is trained on many tasks simultaneously, with the input represented as a textual description of the task.²⁶ For some tasks, training is self-supervised, meaning that the training phase does not require hand-labeled examples. These systems require enormous amounts of data and compute power to train on, but when that training is complete, the resulting models can be used as is or fine-tuned on a specific problem, often with few labeled training examples. Today, LLMs allow for language to be used as the interface to generate general images. Some of the new models train simultaneously on text and image input, creating a model that represents the two modalities jointly. Applications such as DALL-E 2, Imagen, and Midjourney can produce photo-realistic images in response to textual prompts and have become a popular way for non-technical users to generate sophisticated, humorous, and surreal images. For instance, Figure 5 shows the output of DALL-E 2²⁷ in response to the text input, “Interior of a library filled with books, with a stock line chart on an easel in the center, oil painting.”

Figure 5. Image generated with the help of DALLE-2²⁷ in response to the text prompt: “Interior of a library filled with books, with a stock line chart on an easel in the center, oil painting.”

LLMs are also having great success at automatic code suggestions based on natural language commands, as seen for instance in Github Copilot, based on Codex.⁵ These LLMs can aid in the creation of programmatically defined visualizations. For example, when Copilot is given the commented command:

# plot this as a bar chart with bars colored blue unless the car name starts with "M"

Copilot responds with:

plt.bar(cars, values, color= ['blue' if not car.startswith("M") else 'red' for car in cars]

An entire matplotlib program can be quickly specified to both make up data and plot it on a chart. Copilot, as part of Github, is already used by hundreds of thousands of programmers, many of whom think it enhances their productivity⁴⁰ (although studies suggest that its use can result in programming errors or security flaws²⁵).

As the models improve, it will become more feasible to express what is wanted using natural language than by writing commands, code, or even using a graphical user interface (GUI). There has been a long-running debate about which is better for exploring, analyzing, and visualizing data: using a GUI or writing code. People’s preference depends on which tool they are most comfortable with, and many experts use a combination of the two.¹ The very rapid uptake and popularity of tools like Copilot suggests that the answer in the long run is going to be: not GUI, not code, but language. We will simply speak or type how we want the data to be visualized, perhaps augmented with pointing or gesturing. (It will be important to have the models trained on high-quality designs so that they do not reproduce designs with poor usability.)

This preference for using language as the user interface is not new. In the late 1990s a search engine called Ask Jeeves that purported to let people enter natural language questions rather than type keywords was enormously popular, despite being brittle and having low coverage.¹⁸ Ask Jeeves employed people to create enormous databases of questions and answers; it took 20 years of development before major search engines could perform this task reliably.

It is important to draw attention to the many known problems and concerns with LLMs. First, they are trained on huge collections of data; if care is not taken, they repeat the biases and injustices that are inherent in those collections. Another major problem is that the field does not really understand how they work, and furthermore, the results they produce cannot be predicted or explained in a way that makes sense to people. The third problem is that they are far from perfect, and they produce compelling output without having what we would consider an understandable internal representation of what it is they are producing. They are huge, not available to all researchers or users due to their size, and they are costly to train in terms of compute time, and to a lesser degree, energy consumption (although some of these drawbacks may subside due to research efforts). There are concerns about how the building of models from other people’s intellectual property perhaps violates their ownership rights. Perhaps the biggest drawback of all is how these models can contribute to misinformation and make it very hard to determine what information is generated by humans versus by computer software.

In summary, although large language models today are still far from fully able to be used to generate visualizations, they are likely to significantly transform how we generate them in future.

Conclusion

This article has advocated for and described research about the complex interactions when combining text with visualizations, the importance of considering text alone when assessing visualization designs, the need for better cognitive models that combine reading and understanding visualizations, and a future projection of language as the user interface for visualizations.

There are many other topics in this space that are not covered here, including the role of bias and slant, misinformation and deception, multi-lingual text, visualizing text itself, linking interactive visualizations within documents, and spoken versus written text in combination with visualization. The field is ripe for innovation and increased understanding.

Acknowledgments

I thank David Ebert, Hendrick Strobelt, and Danielle Szafir for inviting me to give the IEEE Vis 2022 keynote talk, upon which this article is based, and Chase Stokes, Cindy Xiong, Aiman Gaba, and Huichen (Will) Wang for feedback on the talk. Finally, I thank the anonymous reviewers for their feedback and suggestions for improvement of this manuscript.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Show It or Tell It? Text, Visualization, and Their Combination

View in the ACM Digital Library

This work is licensed under a Creative Commons Attribution-NoDerivs International 4.0 License.

DOI

10.1145/3593580

October 2023 Issue

Published: October 1, 2023

Vol. 66 No. 10

Pages: 68-75

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Aug 29 2025

Feeling Cranky About AI and CS Education

Valerie Barr

Artificial Intelligence and Machine Learning

teacher standing near chalkboard breaks a wooden stick

BLOG@CACM Aug 27 2025

The Power of Digital Twins in Cybersecurity

Alex Williams

Architecture and Hardware

BLOG@CACM Aug 26 2025

Remembering Marge Hoogeboom: A Remote Software Development Pioneer (in 1969!)

Joel C. Adams

Architecture and Hardware

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Key Insights

Striking a Balance Between Text and Vis

Case Study: Comparisons

Text Alone

Preferences or Literacy?

Cognitive Models for Combining Text and Vis

Language as a User Interface for Vis

Conclusion

Acknowledgments

Show It or Tell It? Text, Visualization, and Their Combination

DOI

October 2023 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.