As all readers of this essay know, I am not in any way expert in machine learning (ML) and large language models (LLMs), so my descriptions and observations are, at best, lightweight cartoons of what is actually going on. Please keep this in mind as you read this.
Some of you may remember Spock’s death in Star Trek II (Wrath of Khan) and the brief scene where Spock mind-melds with Dr. McCoy: Spock says “remember” while depositing his katra in McCoy’s brain in anticipation of self-sacrifice to save the starship Enterprise. As I read about yet another new breakthrough in artificial intelligence (AI) from Google Research, I thought of that scene. The new idea, christened “TITAN”, is for a ML system to continue learning while in use after training.a Ironically, part of the innovation is to learn to forget. Humans forget. One of the pioneers of AI, A.M. Turing Award recipient Edward Feigenbaum, emulated this property with his Elementary Perceiver and Memorizer (EPAM)b that exhibited difficulty recalling information it had earlier ingested as new information was ingested.
The idea that practice makes perfect formed the basis of a wonderful science fiction novel called The Practice Effect by David Brin,c in which devices improved in quality through use. If not used, they degraded. People would hire other people to use their clothes, furniture, and other artifacts to maintain or improve their quality. TITANs evidently can improve with use.
The TITAN paper is richly supported by 132 references, illustrating the pace at which research in AI is proceeding. In particular, the new TITAN model has layers designed to add memory to the process of operating LLMs. These new layers continue to modify the weights of the nominal trained model while it is in use. The paper references another important piece of work on test time training (TTT) that details ways in which a trained model and its associated layers and weights can reflect the model’s learnings (and forgettings).d One motive for forgetting is to conserve memory needed for continued use. The weight adjustments act to remember “surprises” containing the most information in the information-theoretic sense, and to save space by forgetting weights that are no longer relevant to output production.
Assuming this learning process directly influences the foundational model, and assuming the model is in use concurrently by multiple parties, one wonders how the aggregate of a concurrently applied LLM can coherently contribute to the evolution of the model. This makes me think of federated learning and the desirable ability to integrate parallel model learnings from multiple, independently running instances of the model.
Apart from the major breakthrough, I also came away impressed by the degree of computational sophistication found in the reference papers. Very clever proofs are offered to show that efficient and parallel computation can be used to achieve the same results as slower serial methods.
We already know that LLMs are regularly subject to jailbreaking prompts (getting around intended constraints for outputs) and hallucinations. Training with bad information can lead to bad (for example, counterfactual) output. Imagine a learning chatbot that ingests bad information deliberately offered during use. Just as there are various attacks against software-based systems, one might imagine deliberate poisoning of a learning model. Makes me think of Domain Name System cache poisoning!
For the same reasons we have had to learn to detect various kinds of attack against other software systems upon which we have become reliant, this new feature, potentially improving the models with use, may also need guarding.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment