Diffusion Large Language Model
4/3/26Less than 1 minute
Diffusion Large Language Model
相比于 ARM(Auto-Regression Model)
LLaDA-V
MMaDA, Princeton, SEED
Fast-dllm, NVIDIA:引入KV-Cache
速度很快,但是效果一般
Traditional autoregressive language models generate text one word – or token – at a time. This sequential process can be slow, and limit the quality and coherence of the output.
Diffusion models work differently. Instead of predicting text directly, they learn to generate outputs by refining noise, step-by-step. This means they can iterate on a solution very quickly and error correct during the generation process. This helps them excel at tasks like editing, including in the context of math and code.
缺点
- 无 Cache
- 定长
变长思路:Block Diffusion
双向注意力机制,可以看到上下文,做修改是非常合适的场景
加速
- Cache
- Sampling
损失少量精度,提高大量速度
ARM
- 量化,也是精度损失
长序列
采样策略
自回归
半自回归(Block Diffusion)
LLaDA
问题:
- scaling
- 训练过程中很难scale长度
