ICML 2026

IDLM: Inverse-distilled Diffusion Language Models

David Li^*1 Nikita Gushchin^*2,3 Dmitry Abulkhanov Eric Moulines^1,4 Ivan Oseledets^3,2 Maxim Panov¹ Alexander Korotin^2,3

¹Mohamed Bin Zayed University of AI ²Applied AI Institute ³AXXX ⁴EPITA ^*Equal contribution

IDLM distills pretrained Diffusion Language Models into few-step generators, reducing inference steps by 4x-64x while preserving generation quality.

Overview diagram contrasting Diffusion Language Models with Inverse-distilled Diffusion Language Models. — Standard DLMs produce high-quality samples but require many inference steps. IDLM trains a distilled generator that keeps the teacher behavior while moving to fast inference.

Key Innovations

Inverse distillation for tokens. IDLM adapts inverse distillation from continuous diffusion models to discrete language generation.
A valid optimization target. The method addresses non-uniqueness in the inverse objective by establishing a uniqueness guarantee for the discrete setting.
Stable training relaxations. Gradient-stable relaxations make backpropagation through discrete token spaces practical.
Fast few-step sampling. Experiments across multiple DLM teachers report a 4x-64x reduction in inference steps while maintaining entropy and generative perplexity.

Abstract

Diffusion Language Models have become a promising route for text generation, but their iterative sampling process can make inference slow. IDLM extends inverse distillation to this discrete setting, using a pretrained DLM as the teacher and training a student generator for substantially fewer sampling steps.

The paper tackles the theoretical issue of non-unique inverse objectives and the practical difficulty of unstable discrete-space gradients. Its solution combines a uniqueness result with differentiable training relaxations, yielding a post-training framework that accelerates DLM inference while preserving core quality metrics.

Text Generation: DLM vs. IDLM

Animated comparison where a diffusion language model and IDLM perform the same generation task, but IDLM completes it in far fewer model calls. — The DLM side uses 64 sampling steps, while the IDLM side reaches completed text after only 4 distilled model calls.

IDLM keeps the discrete diffusion teacher's quality target, but compresses sampling into a few-step generator for substantially faster inference.

BibTeX

@article{li2026idlm,
  title={IDLM: Inverse-distilled Diffusion Language Models},
  author={Li, David and Gushchin, Nikita and Abulkhanov, Dmitry and Moulines, Eric and Oseledets, Ivan and Panov, Maxim and Korotin, Alexander},
  journal={arXiv preprint arXiv:2602.19066},
  year={2026}
}