2026-05-29

When Cross-Tokenizer Distillation Breaks Math: X-Token Recovers the 6× Cliff

llmdistillationtokenizersnvidiaknowledge-distillation

Knowledge distillation transfers a larger teacher model's output distribution to a smaller student. Standard KD uses KL divergence over per-token probability distributions - which requires the teacher and student to share a tokenizer. If you commit to a Llama-family student (BPE tokenizer), you cannot train against a stronger Qwen or Phi teacher (SentencePiece or different BPE) without bridging the vocabularies.

X-Token, from NVIDIA researchers Sharath Turuvekere Sreenivas, Pavlo Molchanov, and colleagues, addresses this with a sparse projection matrix and two complementary loss formulations. The paper uses a Llama-3.2-1B student to validate the approach across three teachers: Llama-3.2-3B (same tokenizer, for reference), Qwen3-4B, and Phi-4-mini-Instruct.

The Uncommon-Token Failure

The current state of the art, GOLD, partitions tokens into a 1-to-1 string-matched common set (trained with standard KL) and an uncommon remainder (matched via rank-sorted L1 distance). This partition works when tokenizers fragment text similarly. It breaks when they diverge.

Qwen3 splits multi-digit numerals digit-by-digit ("201" becomes ["2","0","1"]) while Llama packs them as single tokens. In the GOLD partition, all 1,100 multi-digit Llama numerals land in the uncommon set. Two effects degrade these tokens:

Rank-based matching pairs each student token with a teacher token of similar rank, ignoring semantic correspondence. "201" gets matched to unrelated teacher tokens like special characters.
Suppressive gradients from the common-KL term propagate through the full-vocabulary softmax. Even though uncommon tokens do not appear explicitly in the loss, the normalization drives down their probability mass.

The result: GSM8k drops to 2.56, compared to 12.89 for same-tokenizer KD from a weaker teacher (Llama-3B). Continued pre-training without any teacher reaches 10.25 on GSM8k. The partition-based hybrid loss degrades GSM8k scores when critical tokens fall into the unmatched subset.

The Fix: Two Loss Modes, One Projection Matrix

X-Token introduces a sparse projection matrix W that maps between student and teacher vocabularies. It is constructed deterministically in two passes: exact-match (canonicalized string equality) and multi-token rules (re-tokenizing each student token under the teacher tokenizer with exponential weight decay). The matrix is truncated to top-4 teacher tokens per student token.

Two loss formulations target different mismatch regimes:

P-KL removes the partition entirely. It projects the student distribution into teacher vocabulary space through W and applies KL divergence directly. This recovers signal for uncommon but critical tokens - "201" is correctly routed to teacher tokens ["2","0","1"] through the projection.

H-KL keeps the partition structure but relaxes matching. Each student token is paired with its top-ranked teacher mapping under W, expanding the common set beyond strict string equality. This preserves sharper identity-aligned supervision when the partition is structurally sound.

The selection between them uses a coverage audit: P-KL when critical token categories (multi-digit numerals) fall outside the common set; H-KL when alignment is preserved.

Results

Setting	GSM8k	Avg. (5 benchmarks)
No distillation (Llama-1B base)	5.69	33.96
Continued pre-training (no teacher)	10.25	36.63
Same-tokenizer KD (Llama-3B)	12.89	38.40
GOLD (Qwen-4B, cross-tokenizer)	2.56	35.03
X-Token P-KL (Qwen-4B)	15.54	38.85
X-Token H-KL (Phi-4-mini)	19.11	39.18
Multi-teacher (Phi-mini + Llama-3B)	20.39	40.48

P-KL on Qwen3-4B improves GSM8k 6× over GOLD (2.56 to 15.54) and beats same-tokenizer KD from Llama-3B on math. The overall average gain over GOLD is +3.82 points. H-KL on Phi-4-mini adds +0.52 over GOLD, with the partition remaining structurally sound on that pair.

The multi-teacher setup combines Phi-4-mini (H-KL) with Llama-3B (same-tokenizer KL) under static weighting and reaches 40.48 average - exceeding the best single-teacher result by +1.3. Teacher complementarity drives the gain: Phi-4-mini contributes math and reasoning, Llama-3B contributes commonsense knowledge. Combining two reasoning-heavy cross-tokenizer teachers (Phi-4-mini + Qwen-4B) performs worse, suggesting overlapping strengths dilute the signal.

Dynamic Loss Scaling

X-Token also introduces dynamic KD/CE scaling. Instead of fixed loss weights, each training step rescales the KD term to match the current cross-entropy magnitude using a stop-gradient ratio. This outperforms three static weight configurations across KD-heavy, balanced, and CE-heavy regimes on a 3,000-step sweep.

Why This Matters

Cross-tokenizer distillation removes the teacher tokenizer lock-in. A practitioner can now distill from any strong teacher regardless of vocabulary - Qwen into Llama, Phi into Gemma - without being locked to same-family models. The projection matrix W is constructed from tokenizer rules with no training required, making it a drop-in replacement for standard KD loss.

The loss mode selection is operationalized through a simple coverage scan: count how many tokens in critical categories (digits, punctuation, multi-byte) land in the common set. If multi-digit numerals all fall out, use P-KL. If alignment is clean, H-KL preserves sharper per-pair supervision.

The paper, code, and pre-trained projection matrices are available on arXiv.[^1]

[^1]: Sharath Turuvekere Sreenivas et al. "X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation." arXiv:2605.21699, 2026.