Topic: Measure and modeling ICL and IWL training dynamics on toy models.

Prepare

(The experimental variables are highlighted in red.)

This study focuses on observing the training dynamics of in-context learning (ICL) using a simplified model and classification tasks, so we first define the model and task.

Model. A two-layered Transformer model shown in the figure. Inner dimension = 128, token embedding dimension = 63, positional embedding dimension = 65 (one-hot). “We use a deep classifier to ensure perfect IWL is feasible.”

Task. A $L$-way classification task, in an item set sampled in the 63-dimensional linear space. In detail,

  1. Step 1: Class Centroid Generation. For each class $k$ in the $K$ ($K \geqslant L$) classes, we sample a centroid vector from 63-dimensional $\mathcal{N}(0, 1/63)$. Notice: Here is $K$, not $L$.
  2. Step 2: Class Frequency Weighting. For each class $k$, a frequency weight $f(k)\sim k^{-\color{#d44c47}{\alpha}}$ is assigned to control the frequency ratio of these item presented in the final sampled set.
  3. Step 3: Item Sampling. For each class $k$, some items are sampled like $\tilde{x}_k = \frac{\mu_k+\color{#d44c47}{\epsilon}\color{#000000}\eta}{\sqrt{1+\color{#d44c47}\epsilon^{\color{#000000}2}}}$, to make an item set. Notice: the inter-class ratio follows the result in Step 2.
  4. Step 4: Label Assignment. For each class $k$, one of the label $l$ in the $L$ labels are assigned to the class. Notice: that means, it is possible that two classes have the same label.

Training Input. The 2-layered Transformer is trained on sequences of ICL sequence (ratio: $p_C$) and IWL sequence (the rest).

Train. 20k iters. “For training, we use a batch size of 128 and vanilla SGD with learning rate 0.01”.

“We use $L = 32, N = 8, D = 63, ε = 0.1, α = 0$ unless otherwise specified.”

Model used in this paper.

Model used in this paper.

Evaluation. The ICL and IWL accuracies are evaluated.