News
Artificial Intelligence and Machine Learning

Language Models Wrestle with Gaps in Understanding

Language models seem to be more than stochastic parrots. Does this knowledge stop them from making mistakes, or do they need more help?

Posted
dice with arrows and question marks

If you want a job done well, you are probably better off not using a language model to do it. Thanks to the internal connections they create from terabytes of data ingested during pretraining, they produce results that can seem like rudimentary reasoning. Yet even a simple change to a query, such as switching the order of items, will cause the AI to provide different, sometimes wildly incorrect answers.

“People have called it ‘jagged intelligence’: it works when it works,” said Subbarao Kambhampati, professor of computer science at Arizona State University’s Fulton School of Engineering.

Yet there are hints that language models are, at least sometimes, doing more than basic statistical matching. The internal structures that language models build seem to improve their overall accuracy, although they do not prevent them from making egregious mistakes all too often. Ben Levinstein, associate professor of Philosophy at the University of Illinois at Urbana-Champaign, is among those who take the view that these AI systems are creating reasonable world models. These models help boost performance even though the pretraining regime is not designed to create them; it is only designed to find the most probable sequence of words generated one by one in response to a prompt.

“They aren’t master world models. They are abstracted hypotheses about how some components of the world work that help the language model decide what to say,” Levinstein explained, noting that current research points to these basic models working alongside shallower heuristics that arise from the statistical nature of the AI.

Harvard University postdoctoral fellow Keyon Vafa and coworkers at the Massachusetts Institute of Technology (MIT) and Cornell University trained a language model on New York taxi journeys to see if it would build an internal model of streets of Manhattan. To some extent it did, delivering usable route plans. But a graph created from analysis of the language model’s internal state showed flyovers and direct connections between streets that do not exist in the real world. They led to the AI “hallucinating” impossible routes when the prompt included closures and diversions.

It seems model capacity and training focus both seem to play a role in how well a language model can build logical representations. MIT Ph.D. student Charles Jin used the code from simple robot-control programs written in Karel and their inputs and outputs to train a language model. The results paralleled the evolution of language models themselves. Tests after completing the early training steps showed it doing little more than babbling; it spat out sequences of random instructions that did not work.

As training reached the midway point, the model seemed to acquire the correct syntax for the language, but the model still failed to generate programs that controlled the virtual robot properly. Then, about three-quarters of the way through training, the model seemed to build a model of the language semantics good enough to generate correct programs in response to more than 90% of prompts. Even so, the question remains whether language models are doing more than implementing huge lookup tables.

To settle this question, researchers have come up with ways to probe changes in language models’ internal “state of mind.” Because of the huge number of weights that can generate each token the model outputs in its answers, these probes themselves rely on machine learning techniques to reduce the information to a set of human-readable states, such as which direction a simulated robot points in the case of Jin’s work. This training can lead to false positives, because the probe may learn the task by itself rather than showing the language model’s operation. Jin took the approach of tampering with the semantics of Karel to see if the probe changed its behavior to match. It did not, implying the language model was keeping track of the robot’s position and direction as it moved according to the program statements.

More evidence of the AI going beyond statistics to implement some basic reasoning comes from attempts to identify the path of information through the stack of feedforward layers used by all language models. Scientists in a group led by Tel Aviv University assistant professor of Computer Science Mor Geva looked at how signals propagate in a series of logical hops through the model’s stack of neuronal layers. They tested it using prompts like “find the mother of the singer of the hit song Superstition.

The probes showed language models can readily find Stevie Wonder’s name by the end of the first hop. Where this process fails on the second hop is when the information from the first hop does not propagate quickly, leading to the wrong answer being delivered at the output. With more layers, the chances of success improve, but Geva’s group found that bigger models tend to use the first half of their layer stack to extract a single fact, no matter how many they have available in total. “There is nothing that pushes the model to do it with fewer layers,” she said, which seems to limit how many connections a model can make in a single pass through its internal, latent space.

One way to give the models more time to come up with the right result is to use chain-of-thought (CoT) prompting. This decomposes multi-hop problems into a sequence of simpler requests that the language model has a better chance of answering correctly. Traditionally, this kind of prompting has been a manual process. OpenAI “Strawberry,” also known as OpenAI 01, instead uses a second language model to decompose a request into a sequence of CoT prompts. This model responds to failures in intermediate steps by backtracking and generating alternative paths to solving a problem.

Kambhampati argues that the language models involved in a system like OpenAI 01 cannot provide guarantees of correctness. He sees the combination of language models with symbolic tools as a way of delivering more reliable results. An example of this “LLM-Modulo” architecture, named after a technique used by formal satisfiability solvers, is Google’s AlphaProof.

To solve Math Olympiad problems, the developers trained AlphaProof to write proofs in the formal language Lean. The verification engine designed for Lean then checked the solution, forcing the model to generate new attempts on every rejection until a working proof emerged. Kambhampati sees similar systems being used for more general program synthesis and in planning as, again, these applications can harness formal verification tools and solvers. However, the additional tools would have to be tuned to the target application, and the architecture is not a good fit for chatbots.

For chatbots, retrieval augmented generation (RAG) can act as an external aid and is already in widespread commercial use. RAG works on the assumption that hallucinations will be less common if the LLM is constrained to generate only text using the facts stored in a local knowledge base, rather than relying on the data it ingested during training.

RAG remains a long way from preventing language models from delivering fictional answers. The interface to the knowledge base is very similar to that used to interpret and store text learned during pretraining: a vector into a huge multidimensional space. There is no guarantee of extracting the right data if several elements sit close to each other in that vector space. To deal with that, some researchers are trying to use a language model’s internal signals coupled with external tools to check consistency between the model’s decisions and what it retrieves from the knowledge base.

Design decisions seem to exacerbate the problem, which may provide tactics to reduce errors. By selecting the most probable token at the end of each cycle, the commonly used greedy-decoding method may make the situation worse compared to a system that looks at a wider range of options. One way to spot this mismatch is to probe the language model’s state at runtime.

Ph.D. student Hadas Orgad in Yonatan Belinkov’s group at Israel’s Technion probed the way internal states changed as the model generated factual answers. They found significant changes in the calculated probabilities of candidate words when the model lacked confidence in an answer and could use the probe to deliver the correct words and terms. The bad news for hallucination mitigation in general is that work by Orgad and others shows the probability patterns change depending on how the model encodes its knowledge. That will make it hard to build a white-box lie detector for all situations.

“It is really hard for models to express their uncertainty about an output accurately,” Geva said. At the same time, the differences in representation may provide avenues to use more fine-grained approaches to detecting errors. “It will be interesting to think about different classes of hallucination,” she added.

There may be a deeper issue: whether probes are exploring the attributes computer scientists believe they are. The instruction tuning that follows pretraining may exacerbate the problem. Human operators may inadvertently reward responses without promoting truthfulness to the extent that current interpretations of what a model believes to be fact do not correspond with what humans understand to be facts.

“Does it make sense to say that ChatGPT thought it was true? Or that it said it because it was what I thought I wanted to hear? And what do these questions mean? These aren’t even human minds,” Levinstein said.

Levinstein takes the view that work in this area may profit from greater interactions between computer scientists and philosophers who look more closely at what constitutes truth and falsity in the data language models store. That may, in turn, yield better signals that language models can use to indicate when they risk making a mistake, so users can decide whether to trust their answers or not.

Further Reading

  • Biran, E., Gottesman, D., Yang, S., Geva, M., and Globerson, A.
    Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 14113-14130. arXiv pre-print, arxiv.org/abs/2406.12775
  • Jin, C. and Renard, M.
    Emergent Representations of Program Semantics in Language Models Trained on Programs. Proceedings of the 41st International Conference on Machine Learning. PMLR 235 (2024). arXiv pre-print: arxiv.org/abs/2305.11169
  • Kambhampati, S., Valmeekam, K., Guan, L., Verma, M., Stechly, K., Bhambri, S., Saldyt, L., and Murthy, A.
    Position: LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks. Proceedings of the 41st International Conference on Machine Learning. PMLR 235 (2024). arXiv pre-print, arxiv.org/abs2402.01817
  • Levinstein, B.A., Herrmann, D.A.
    Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks. Philosophical Studies (2024). arXiv pre-print, arxiv.org/abs/2307.00175
  • Orgad, H., Toker, M., Gekhman, Z., Reichart, Roi, Szpektor, I., Kotek, H., Belinkov, Y.
    LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. arXiv pre-print, arxiv.org/abs/2410.02707

 

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More