Topic: Measure and modeling ICL and IWL training dynamics on toy models.

Prepare

(The experimental variables are highlighted in red.)

This study focuses on observing the training dynamics of in-context learning (ICL) using a simplified model and classification tasks, so we first define the model and task.

Model. A two-layered Transformer model shown in the figure. Inner dimension = 128, token embedding dimension = 63, positional embedding dimension = 65 (one-hot). “We use a deep classifier to ensure perfect IWL is feasible.”

Task. A $L$-way classification task, in an item set sampled in the 63-dimensional linear space. In detail,

Step 1: Class Centroid Generation. For each class $k$ in the $K$ ($K \geqslant L$) classes, we sample a centroid vector from 63-dimensional $\mathcal{N}(0, 1/63)$. Notice: Here is $K$, not $L$.
Step 2: Class Frequency Weighting. For each class $k$, a frequency weight $f(k)\sim k^{-\color{#d44c47}{\alpha}}$ is assigned to control the frequency ratio of these item presented in the final sampled set.
Step 3: Item Sampling. For each class $k$, some items are sampled like $\tilde{x}_k = \frac{\mu_k+\color{#d44c47}{\epsilon}\color{#000000}\eta}{\sqrt{1+\color{#d44c47}\epsilon^{\color{#000000}2}}}$, to make an item set. Notice: the inter-class ratio follows the result in Step 2.
Step 4: Label Assignment. For each class $k$, one of the label $l$ in the $L$ labels are assigned to the class. Notice: that means, it is possible that two classes have the same label.
- The context of labels (tokens) are also vectors sampled from 63-dimensional $\mathcal{N}(0, 1/63)$.

Training Input. The 2-layered Transformer is trained on sequences of ICL sequence (ratio: $p_C$) and IWL sequence (the rest).

ICL Training Sequence Makeup.

0: Roll a die to determine whether it is a "burst sequence” for a probability $p_B$.

if "burst sequence”:
1. Step 1: Demonstrations Sample. Sample $N$ items in the item set as the demonstrations, where $B$ of them are from the same class.
if not "burst sequence”:
1. Step 1: Demonstrations Sample. Sample $N$ items in the item set as the demonstrations, i.i.d.
2. Step 2: Query Sample. Sample 1 item, whose class is same as at least one of the demonstrations, as the query.
IWL Training Sequence Makeup.
1. Step 1: Demonstrations Sample. Sample $N$ items in the item set as the demonstrations, i.i.d.
2. Step 2: Query Sample. Sample 1 item, whose class is not presented in demonstrations, as the query.

Train. 20k iters. “For training, we use a batch size of 128 and vanilla SGD with learning rate 0.01”.

“We use $L = 32, N = 8, D = 63, ε = 0.1, α = 0$ unless otherwise specified.”

Model used in this paper.

Evaluation. The ICL and IWL accuracies are evaluated.

ICL Accuracy: Accuracy on ICL sequence (the ground-truth is presented in the context), but the item and centroids