When Cross-Tokenizer Distillation Breaks Math: X-Token Recovers the 6× Cliff

Knowledge distillation transfers a larger teacher model's output distribution to a smaller student. Standard KD uses KL divergence over per-token probability distributions - which requires the teacher and student to share a tokenizer. If you commit to a Llama-family student (BPE tokenizer), you cannot train against a stronger Qwen or Phi teacher (SentencePiece or different BPE) without bridging the vocabularies.
X-Token, from NVIDIA researchers Sharath Turuvekere Sreenivas, Pavlo Molchanov, and colleagues, addresses this with a sparse projection matrix and two complementary loss formulations. The paper uses a Llama-3.2-1B student to validate the approach across three teachers: Llama-3.2-3B (same tokenizer, for reference), Qwen3-4B, and Phi-4-mini-Instruct.
The Uncommon-Token Failure
The current state of the art, GOLD, partitions tokens into a 1-to-1 string-matched common set (trained with standard KL) and an uncommon remainder (matched via rank-sorted L1 distance). This partition works when tokenizers fragment text similarly. It breaks when they diverge.
Qwen3 splits multi-digit numerals digit-by-digit ("201" becomes ["2","0","1"]) while Llama packs them as single tokens. In the GOLD partition, all 1,100 multi-digit Llama numerals land in the uncommon set. Two effects degrade these tokens:
- Rank-based matching pairs each student token with a teacher token of similar rank, ignoring semantic correspondence. "201" gets matched to unrelated teacher tokens like special characters.
- Suppressive gradients from the common-KL term propagate through the full-vocabulary softmax. Even though uncommon tokens do not appear explicitly in the loss, the normalization drives down their probability mass.
The result: GSM8k drops to 2.56, compared to 12.89 for same-tokenizer KD from a weaker teacher (Llama-3B). Continued pre-training without any teacher reaches 10.25 on GSM8k. The partition-based hybrid loss degrades GSM8k scores when critical tokens fall into the unmatched subset.
The Fix: Two Loss Modes, One Projection Matrix
X-Token introduces a sparse projection matrix W that maps between student and teacher vocabularies. It is constructed deterministically in two passes: exact-match (canonicalized string equality) and multi-token rules (re-tokenizing each student token under the teacher tokenizer with exponential weight decay). The matrix is truncated to top-4 teacher tokens per student token.
Two loss formulations target different mismatch regimes:
P-KL removes the partition entirely. It projects the student distribution into teacher vocabulary space through W and applies KL divergence directly. This recovers signal for uncommon but critical tokens - "201" is correctly routed to teacher tokens ["2","0","1"] through the projection.
H-KL keeps the partition structure but relaxes matching. Each student token is paired with its top-ranked teacher mapping under W, expanding the common set beyond strict string equality. This preserves sharper identity-aligned supervision when the partition is structurally sound.
The selection between them uses a coverage audit: P-KL when critical token categories (multi-digit numerals) fall outside the common set; H-KL when alignment is preserved.
Results
| Setting | GSM8k | Avg. (5 benchmarks) |
|---|---|---|
| No distillation (Llama-1B base) | 5.69 | 33.96 |
| Continued pre-training (no teacher) | 10.25 | 36.63 |
| Same-tokenizer KD (Llama-3B) | 12.89 | 38.40 |
| GOLD (Qwen-4B, cross-tokenizer) | 2.56 | 35.03 |
| X-Token P-KL (Qwen-4B) | 15.54 | 38.85 |
| X-Token H-KL (Phi-4-mini) | 19.11 | 39.18 |
| Multi-teacher (Phi-mini + Llama-3B) | 20.39 | 40.48 |
P-KL on Qwen3-4B improves GSM8k 6× over GOLD (2.56 to 15.54) and beats same-tokenizer KD from Llama-3B on math. The overall average gain over GOLD is +3.82 points. H-KL on Phi-4-mini adds +0.52 over GOLD, with the partition remaining structurally sound on that pair.
The multi-teacher setup combines Phi-4-mini (H-KL) with Llama-3B (same-tokenizer KL) under static weighting and reaches 40.48 average - exceeding the best single-teacher result by +1.3. Teacher complementarity drives the gain: Phi-4-mini contributes math and reasoning, Llama-3B contributes commonsense knowledge. Combining two reasoning-heavy cross-tokenizer teachers (Phi-4-mini + Qwen-4B) performs worse, suggesting overlapping strengths dilute the signal.
Dynamic Loss Scaling
X-Token also introduces dynamic KD/CE scaling. Instead of fixed loss weights, each training step rescales the KD term to match the current cross-entropy magnitude using a stop-gradient ratio. This outperforms three static weight configurations across KD-heavy, balanced, and CE-heavy regimes on a 3,000-step sweep.
Why This Matters
Cross-tokenizer distillation removes the teacher tokenizer lock-in. A practitioner can now distill from any strong teacher regardless of vocabulary - Qwen into Llama, Phi into Gemma - without being locked to same-family models. The projection matrix W is constructed from tokenizer rules with no training required, making it a drop-in replacement for standard KD loss.
The loss mode selection is operationalized through a simple coverage scan: count how many tokens in critical categories (digits, punctuation, multi-byte) land in the common set. If multi-digit numerals all fall out, use P-KL. If alignment is clean, H-KL preserves sharper per-pair supervision.
The paper, code, and pre-trained projection matrices are available on arXiv.[^1]
[^1]: Sharath Turuvekere Sreenivas et al. "X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation." arXiv:2605.21699, 2026.