Sign In

Communications of the ACM

ACM Careers

Decomposing Language Models Into Understandable Components

View as: Print Mobile App Share:
white model imposed on old dictionary pages, illustration

Researchers present activating dataset examples and downstream logit effects from 90 learned dictionaries.

Credit: Anthropic

Neural networks are trained on data, not programmed to follow rules. With each step of training, millions or billions of parameters are updated to make the model better at tasks, and by the end, the model is capable of a dizzying array of behaviors. We understand the math of the trained network exactly — each neuron in a neural network performs simple arithmetic — but we don't understand why those mathematical operations result in the behaviors we see.

For those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.

In our latest paper, we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations.

From Anthropic
View Full Article


No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account