Generative Models

David Liu12/15/25Less than 1 minute

The basic goal of AI is to develop intelligent machines.

This consists of many sub-goals:

• Perception
- Multimodal foundation models learn to answer questions about images (and text in images)
- Diffusion models can be used as zero-shot classifiers
• Reasoning
- LLMs are also (unexpectedly) good at certain reasoning tasks
- cf. Chain-of-Though Prompting (an ex. of in-context learning)
• Control / Motion / Manipulation
- DayDreamer learns a generative model of experiences for RL, i.e. a World Model, without simulation
- Quadruped robot learns to walk in under 1 hour
• Planning
- LLMs are already being used for grounded planning for embodied agents, c.f. LLMPlanner
• Communication
- Communication comprises the comprehension and generation of human language.
- Large language models (LLMs) excel at both
- (Even though they are most often trained autoregressively, i.e. to generate a next word, given the previous ones)
• Creativity
- Text-to-image models [Midjourney’s Discord server has 18 million members (1.7 million were online this morning)]
- Text-to-music models [MusicGen capable of conditioning on text and audio sample]
• Learning

Q: What does Generative AI have to do with any of these goals?

A: It’s making in-roads into all of them.

ARM: next token prediction

Text-to-image

过时的：GAN

normalizing flow

基于 masked 掩码方式的离散扩散模型

next token prediction: ARM

next frequency prediction: DM

next "set of token" prediction: MaskGiT

VAR

损失函数分离