What is meant by Mechanistic over Interpretability?

(I don't like the strict taxonomy in this section, but explain everything clearly, we first accept it and then forget about it.)

Basically, when we are talking about interpretability, we want to know how a machine-learning model produces its output by some observations. Intuitively, as an example, if we can find which part of the input the model focuses on, we can know how the model judges its output, and how such output can be trusted. Such a technique is called Test-sample Saliency-based Interpretation, by visualizing the saliency maps, i.e., plotting the saliency scores in the form of heatmaps, users can easily understand why model makes decisions by examining the most salient part(s) [1].

There are some methods to calculate the Saliency [1]. While, besides the inference characterization by Saliency methods, some training-based methods, such as Influence-function, are used to identify which training instances or units of training instances in the training set are responsible for the model’s prediction at a specific test example [1].

Traditional Interpretability Regards DL Models Inseparable. The above introduction clarifies that traditional interpretability research views the model as an indivisible whole (Black-box). These studies do not focus on the internal operational dynamics of the model but instead infer the model's functioning based solely on its external responses, even if such inferences can be highly designed: Training examples are carefully constructed, while detailed metrics are created.

However, the insights that such a methodology can provide are limited, especially when it comes to improving the model's design and training process. We do not know which part of the model is functioning and how the model is actually working, making it difficult to refine or edit it precisely. In the past, such shortcomings were acceptable. When we found that the model's behavior did not meet expectations, we could easily retrain a model with new settings. However, with the emergence of large language models, everything has changed: we now tend to make minor adjustments rather than completely rebuilding the entire model due to the enormous expenses.

Mechanistic Interpretability Disassembles DL Models. From the aforementioned motivation, a new type of interpretation is emerging: They primarily decompose the model into understandable pipelines (circuits), for example, grammar understanding-semantic understanding-task alignment pipeline for some downstream tasks [3]. Such decomposing-styled research methodology is called mechanistic interpretability, with analogous to the process in mechanics of breaking down a machine into its individual components, which allows us, when the model's behavior does not meet expectations, to correct a very small part—much like a mechanical engineer replacing a damaged component with a new one, rather than rebuilding the entire machine from scratch. For example, when you find that your model is generating harmful outputs, you might choose to suppress the output of these offensive tokens in the language model head [5, 6], rather than spending several months retraining it on an expensive dataset that has been purged of harmful samples.

An example of Saliency map. Borrowed from [2].

A sample of circuits from a mechanistic interpretability research. Borrowed from [4].

A sample of circuits from a mechanistic interpretability research. Borrowed from [4].

Methodology Paradigm

Before starting, you need to choose the model inference behavior you are interested in, such as “in-context learning” or “knowledge evoking”. You should fully understand this behavior to the extent that you know what intriguing characteristics it has, such as “in-context learning is not sensitive to label correctness”. This will guide you in designing experiments, and if your experimental results can explain these characteristics, you will be greatly encouraged.

1. Circuit Identify

As a research hypothesis, you need to propose a reasonable inference circuit framework, allowing you to seek evidence for it. This step often requires inspiration, which you typically cannot obtain by just sitting in front of a computer in the lab. Go sightseeing, sing, or relax in a hot spring—good ideas often come when you're relaxed. But if you're still running out of ideas, some methodologies can also be useful: consider how humans solve your task and draw an analogy to model computation. Note: while applying human thought processes to models often involves high bias, it's still better than having no inspiration at all.

For example, to solve a counting problem in a vision-language task shown in the right figure, you might expect the model to decompose the inference process into counting triangles (4) and circles (2) to arrive at the answer of 6, but the model might actually count blue objects (3) and green objects (3); it could even count in columns (2, 2, 2). In other words, the circuit derived from human intuition may not faithfully reflect the model's actual computation process. I love saying this: an end-to-end model does not operate according to human reasoning. This is also why I don’t trust most studies on logical reasoning in language models.

An example of an item counting task.

What is a good circuit? (1). Your circuit should be sufficient and necessary: the circuit you propose should be able to complete the task independently, and without any redundant modules. (2). Your circuit should not violate mathematics or the model's feedforward process. If you claim that in a decoder Transformer, earlier tokens absorb the semantics of later tokens, it is highly likely to be incorrect—unless you provide very strong evidence. (3). Your circuit should be convincing to humans. A circuit that seems completely non-functional is often not a good circuit. (4). Your circuit shouldn’t be overly complex. Models are generally trained by gradient descent, so you can't expect GD to activate highly complex circuits within the model. However, if you can provide strong evidence proving the existence of complex circuits, that would also be very interesting, and more exciting if you could point out how the circuits are trained (like this work: [14] (one of my favorite paper)).

2. Circuit Measurement

Existence Evidence. You must prove that your circuit really exists. Generally speaking, your complete inference circuit will consist of several parts, and first you can verify the existence of each part individually. Your evidence can be posterior, meaning that if you find output results consistent with the inference steps you expect, you can basically claim that your inference steps exist. It can also be prior, meaning that you directly find consistent form in the model's feedforward computation (such as when you claim that the model will replicate the previous token at this step, and you indeed find a diagonal attention matrix with a location bias of 1). In that case, your inference steps are also valid.

Component Connection. After verifying the independent existence of each part, it is necessary to connect them into a complete circuit. The simplest way is to locate the layer indices in the neural network for different steps, thus naturally connecting the parts according to the network's inference process. This is generally satisfactory, but you need to be aware that due to the effect of multi-head attention, the steps you locate may be non-local: a function could be executed by multiple layers, and a single layer may perform multiple functions by multiplexing different subspaces in the hidden space [4]. Additionally, residual connections allow for skipping layers—for example, the attention mechanism of the 4th layer might access the attention output from the 1st layer [8]. Therefore, we need to identify the input and output spaces of each module to anchor them together. This is significantly challenging in real Transformer models. I am working on this.

A sample of non-local function modules. The intensity is represented by the shading of the color. Borrowed from [4].

Characterization. Simply proving the existence of your circuit does not offer particularly profound insights. Therefore, beyond proving existence, you are expected to examine the finer properties of each part of the circuit. This kind of fine-grained measurement is highly customized (every paper uses different methods, which is a kind of the charm of MI research), but if you lack inspiration, you should basically apply these measurements: (1). Which part of the hidden space do these mechanisms act upon? We can assume that the model encodes some key features along certain subspaces (directions [10] or manifolds [11]) in the hidden space. Therefore, by identifying the scope of influence of each circuit step, we can discern these directions, which is beneficial to investigate the superposition or multiplex [12, 13] of hidden spaces. (2). How are these mechanisms conducted? For example, certain behaviors (such as the aforementioned “previous token copying operation”) can be triggered by specific patterns in attention scores. Precisely locating the detailed model operations that lead to these behaviors is crucial for further understanding and intervening in these operations. (3). Which part of the model is responsible for this step? I recommend a highly fine-grained investigation: it’s best to be as precise as identifying specific attention heads or FFN modules, rather than just recognizing which network layer performs the corresponding operation.

What is meant by Mechanistic over Interpretability?

Methodology Paradigm

1. Circuit Identify

2. Circuit Measurement

Tools