Diffusion Models Explained¤
-
Progressive Denoising
Learn to reverse a gradual noising process, iteratively refining random noise into coherent data
-
Stable Training
Straightforward MSE objective with no adversarial dynamics—far more stable than GANs
-
State-of-the-Art Quality
Achieves the highest quality generative results, powering DALL-E 2, Stable Diffusion, and Sora
-
Exceptional Controllability
Natural framework for conditional generation, inpainting, editing, and guidance techniques
Overview¤
Diffusion models are a class of deep generative models that learn to generate data by reversing a gradual noising process. Unlike GANs which learn through adversarial training or VAEs which compress to latent codes, diffusion models systematically destroy data structure through noise addition, then learn to reverse this process for generation.
What makes diffusion models special? They solve the generative modeling challenge through an elegant two-stage process: a fixed forward diffusion that gradually corrupts data into pure noise over many timesteps, and a learned reverse diffusion that progressively denoises random samples into realistic data. This approach offers unprecedented training stability, superior mode coverage, and exceptional sample quality.
The Intuition: From Ink to Water and Back¤
Think of diffusion like watching a drop of ink dissolve in water:
-
The Forward Process is like dropping ink into a glass of water and watching it gradually diffuse. At first, you clearly see the ink drop. Over time, it spreads and mixes until the water appears uniformly tinted—all structure is lost.
-
The Reverse Process is like learning to run this process backwards: starting from uniformly tinted water and gradually reconstructing the original ink drop. This seems impossible by hand, but a neural network can learn the "reverse physics."
-
The Training teaches the network to predict: "Given tinted water at some mixing stage, what did it look like one step earlier?" Repeat this prediction many times, and you recover the original ink drop from fully mixed water.
The critical insight: while the forward diffusion is fixed and simple (just add noise), the reverse process is learned and powerful. The model learns to undo corruption at every noise level, enabling generation from pure random noise.
Mathematical Foundation¤
The Forward Diffusion Process¤
The forward process defines a fixed Markov chain that gradually corrupts data \(x_0\) by adding Gaussian noise over \(T\) timesteps (typically \(T=1000\)):
where \(\beta_t \in (0,1)\) controls the variance of noise added at timestep \(t\). The complete forward chain factors as:
Key property: We can sample \(x_t\) at any arbitrary timestep directly without simulating the full chain. Defining \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{i=1}^t \alpha_i\):
This can be reparameterized as:
graph LR
A["x₀<br/>(Clean Data)"] -->|"β₁"| B["x₁<br/>(Slightly Noisy)"]
B -->|"β₂"| C["x₂"]
C -->|"β₃"| D["..."]
D -->|"βₜ"| E["xₜ"]
E -->|"..."| F["xₜ<br/>(Pure Noise)"]
style A fill:#c8e6c9
style F fill:#ffccc9
style E fill:#fff3e0
Intuition: As \(t \to T\), the distribution \(q(x_T | x_0)\) approaches an isotropic Gaussian \(\mathcal{N}(0, \mathbf{I})\), ensuring the endpoint is tractable pure noise. The forward process is designed so that \(\bar{\alpha}_T \approx 0\).
The posterior conditioned on the original data is also Gaussian with tractable parameters:
where:
The Reverse Diffusion Process¤
The reverse process learns to invert the forward diffusion, starting from noise \(x_T \sim \mathcal{N}(0, \mathbf{I})\) and progressively denoising to data \(x_0\):
The complete generative process:
graph RL
F["xₜ<br/>(Pure Noise)"] -->|"Neural Network<br/>Denoising"| E["xₜ₋₁"]
E -->|"Denoise"| D["..."]
D -->|"Denoise"| C["x₂"]
C -->|"Denoise"| B["x₁"]
B -->|"Denoise"| A["x₀<br/>(Generated Data)"]
style F fill:#ffccc9
style A fill:#c8e6c9
style E fill:#fff3e0
Three Equivalent Parameterizations:
Noise Prediction (most common): The network predicts the noise \(\epsilon\) that was added:
Data Prediction: The network directly predicts the clean image:
Score Prediction: The network predicts the gradient of the log probability:
These are mathematically equivalent through Tweedie's formula:
Mathematical Equivalence
Predicting noise is equivalent to predicting the score function, unifying diffusion models with score-based generative modeling. This connection reveals deep theoretical relationships between different approaches.
The ELBO Derivation¤
Diffusion models are Markovian hierarchical VAEs. The evidence lower bound decomposes as:
For Gaussian posteriors, the KL divergence terms simplify. The key loss term becomes:
Substituting the reparameterization yields:
Ho et al.'s key empirical finding: The simplified objective works better:
This reduces training to simple mean-squared error between predicted and actual noise!
Variance Schedules¤
The noise schedule \(\{\beta_1, \ldots, \beta_T\}\) fundamentally affects training and sampling quality:
Linear Schedule (Ho et al. 2020):
Typically \(\beta_1 = 0.0001\), \(\beta_T = 0.02\). Simple but can add too much noise early.
Cosine Schedule (Nichol & Dhariwal 2021):
with \(s = 0.008\). Provides smoother transitions and empirically superior performance.
Schedule Selection
The cosine schedule has become the de facto standard due to its superior empirical performance. It provides more balanced denoising across timesteps and avoids adding excessive noise in early steps.
Score-Based Perspective¤
The score function \(\nabla_x \log p(x)\) points toward regions of higher probability density. Score-based models train a network \(s_\theta(x, t)\) to approximate this gradient field through denoising score matching:
Given the learned score, generation proceeds via Langevin dynamics:
The connection to diffusion: the score equals the negative scaled noise.
Stochastic Differential Equations¤
The continuous-time formulation generalizes discrete diffusion as an SDE:
Variance Preserving (VP) SDE corresponds to DDPM:
The reverse-time SDE enables generation:
There exists an equivalent probability flow ODE:
This ODE formulation enables exact likelihood computation and deterministic sampling.
Architecture Design¤
U-Net Backbone with Skip Connections¤
The U-Net architecture dominates diffusion models through its encoder-decoder structure with skip connections:
graph TB
subgraph "Encoder (Downsampling)"
A["Input Image<br/>256×256×3"] --> B["Conv + ResBlock<br/>128×128×128"]
B --> C["Conv + ResBlock<br/>64×64×256"]
C --> D["Conv + ResBlock<br/>32×32×512"]
D --> E["Bottleneck<br/>16×16×1024"]
end
subgraph "Decoder (Upsampling)"
E --> F["Upsample + ResBlock<br/>32×32×512"]
F --> G["Upsample + ResBlock<br/>64×64×256"]
G --> H["Upsample + ResBlock<br/>128×128×128"]
H --> I["Output<br/>256×256×3"]
end
D -.->|"Skip Connection"| F
C -.->|"Skip Connection"| G
B -.->|"Skip Connection"| H
style E fill:#fff3e0
style A fill:#e1f5ff
style I fill:#c8e6c9
Key Components:
- Contracting path: Progressive downsampling (256→128→64→32→16→8) while increasing channels
- Expanding path: Upsampling reconstructs output at original resolution
- Skip connections: Critical for propagating spatial details lost in bottleneck
- ResNet blocks: \(\text{output} = \text{input} + F(\text{input}, \text{time\_emb})\)
- Group normalization: Dividing channels into groups (~32) for stability
Why U-Net for Diffusion?
- Input and output have identical dimensions (essential for iterative refinement)
- Skip connections preserve fine details through bottleneck
- Multi-scale processing captures both coarse structure and fine texture
- No information bottleneck—maintains full spatial information
Time Embedding Through Sinusoidal Encoding¤
Timestep information must flow through the network since denoising behavior depends critically on noise level. Sinusoidal position embeddings provide the standard approach:
where \(t\) is timestep, \(i\) is dimension index, and \(d\) is embedding dimension (typically 128-256).
Implementation:
- Embed timestep \(t\) into 128-256 dimensions via sinusoidal encoding
- Pass through 2-layer MLP projecting to model dimension
- Inject via FiLM (Feature-wise Linear Modulation):
where \(\gamma\) and \(\beta\) are learned functions of time embedding.
Time Conditioning
FiLM is the most effective injection method, allowing the network to adaptively modulate its processing for each noise level through learned affine transformations.
Attention Mechanisms¤
Self-Attention captures long-range spatial dependencies:
Applied selectively due to \(O(N^2)\) complexity where \(N = H \times W\):
- Lower resolutions (16×16, 8×8): Apply multi-head self-attention
- Higher resolutions (32×32, 64×64): Skip attention (too expensive)
Cross-Attention enables conditioning on external information (e.g., text):
- Queries \(Q\): From image features
- Keys \(K\) and Values \(V\): From conditioning signal (CLIP/T5 text embeddings)
Different image regions attend to relevant text parts, enabling fine-grained control. Used extensively in Stable Diffusion.
graph LR
subgraph "Image Processing"
A["Image Features"] --> B["Query Q"]
end
subgraph "Text Conditioning"
C["Text Embeddings<br/>(CLIP/T5)"] --> D["Keys K"]
C --> E["Values V"]
end
B --> F["Cross-Attention"]
D --> F
E --> F
F --> G["Conditioned<br/>Features"]
style A fill:#e1f5ff
style C fill:#fff9c4
style G fill:#c8e6c9
Model Parameterization Choices¤
ε-Prediction (Noise Prediction) — Most Common:
Network predicts \(\epsilon_\theta(x_t, t) = \epsilon\) where \(x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1-\bar{\alpha}_t} \, \epsilon\)
- Loss: \(L_{\text{simple}} = \mathbb{E}[\|\epsilon - \epsilon_\theta(x_t, t)\|^2]\)
- Advantages: Most stable, used by DDPM, Stable Diffusion, most implementations
x₀-Prediction (Data Prediction):
Network directly predicts clean image \(\hat{x}_\theta(x_t, t) = x_0\)
- Advantages: Better quality in some cases
- Disadvantages: More prone to out-of-distribution predictions requiring clipping
v-Prediction (Velocity Prediction):
Predicts velocity \(v_t = \alpha_t \epsilon - \sigma_t x_0\)
- Loss: \(L = \mathbb{E}[\|v_t - v_\theta(x_t, t)\|^2]\)
- Advantages: More balanced learning across noise levels, better numerical stability
- Usage: Imagen, Google models
Equivalence
All three parameterizations are mathematically equivalent and can be converted between each other. Most implementations default to ε-prediction, though v-prediction is gaining popularity.
Training Process¤
The Simplified Training Objective¤
The simplified loss ignores theoretical weightings from ELBO:
Training Algorithm:
- Sample training image \(x_0 \sim q(x_0)\)
- Sample timestep \(t \sim \text{Uniform}(1, T)\)
- Sample noise \(\epsilon \sim \mathcal{N}(0, \mathbf{I})\)
- Compute noisy image \(x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1-\bar{\alpha}_t} \, \epsilon\)
- Predict noise \(\hat{\epsilon} = \epsilon_\theta(x_t, t)\)
- Compute loss \(L = \|\epsilon - \hat{\epsilon}\|^2\)
- Update \(\theta\) via gradient descent
Remarkably simple: Just MSE between predicted and actual noise!
Loss Function Variants¤
Variational Lower Bound (VLB):
The full ELBO includes weighted terms for each timestep. While theoretically principled, optimizing full VLB is harder in practice.
Hybrid Objective (Nichol & Dhariwal 2021):
Combines \(L_{\text{simple}}\) for mean prediction with VLB terms for variance learning:
Min-SNR-γ Weighting:
Clips weights at \(w_t = \min(\text{SNR}(t), \gamma)\) where \(\text{SNR}(t) = \bar{\alpha}_t / (1-\bar{\alpha}_t)\):
Typical \(\gamma = 5\). Achieves 3.4× faster convergence by preventing over-weighting easy steps.
Training Stability and Best Practices¤
Essential Training Practices
Exponential Moving Average (EMA): Critical for quality. Maintain running average:
with \(\beta = 0.9999\). Always use EMA weights for inference, not raw training weights.
Gradient Clipping: Prevents exploding gradients. Clip gradient norms to 1.0.
Mixed Precision Training: FP16/BF16 provides 2-3× speedup, 40-50% memory reduction.
Normalization:
- Group Normalization: Divide channels into groups (~32) for stability
- Layer Normalization: Alternative for transformer-based models
- No Batch Normalization: Batch statistics interfere with noise conditioning
Regularization:
- Weight Decay: \(10^{-4}\) to \(10^{-6}\) with AdamW optimizer
- Dropout: Sometimes used (rate 0.1-0.3) but less common than in other architectures
Hyperparameter Selection¤
Timesteps: \(T = 1000\) is standard for training. More steps provide finer granularity but slower sampling.
Noise Schedules:
- Cosine schedule outperforms linear empirically
- Critical: Ensure \(\bar{\alpha}_T \approx 0\) for pure noise at final step
Learning Rates:
- Standard: \(1 \times 10^{-4}\) to \(2 \times 10^{-4}\) with AdamW
- Sensitive domains (faces): \(1 \times 10^{-6}\) to \(2 \times 10^{-6}\)
- Use linear warmup over 500-1000 steps
Batch Sizes:
- Small images (32×32): 128-512
- Medium (256×256): 32-128
- Large (512×512): 8-32
- Use gradient accumulation to simulate larger batches
Optimizer Configuration:
Training Dynamics and Monitoring¤
Common Training Issues
Loss Plateaus: Normal behavior—loss doesn't directly correlate with quality. Monitor visual samples!
NaN Losses: Usually from exploding gradients. Enable gradient clipping and mixed precision loss scaling.
Poor Sample Quality: Check EMA is enabled, noise schedule is correct, sufficient training steps completed.
What to Monitor:
- Training Loss: Should decrease initially, then plateau
- Visual Samples: Generate every 5k-10k steps at fixed noise seeds
- FID Score: Compute on validation set every 25k-50k steps
- Gradient Norms: Should be stable, not exploding
- Learning Rate: Track warmup and decay schedules
Checkpoint Management:
- Save both regular and EMA weights
- Keep checkpoints every 50k-100k steps
- Save best checkpoint based on FID score
- Include optimizer state for resuming training
Computational Requirements¤
GPU Requirements:
- Minimum: 10GB VRAM (RTX 3080)
- Recommended: 24GB VRAM (RTX 3090/4090)
- Large-scale: 40-80GB (A100/H100)
Training Times:
- Small datasets (10k images): Days on single GPU
- Medium (100k images): Weeks on multiple GPUs
- Large-scale (millions): Months on hundreds of GPUs
- ImageNet 256×256 on 8× A100: 7-14 days
Memory Optimizations:
- Gradient Checkpointing: 30-50% memory reduction, 20% slowdown
- Mixed Precision: 40-50% memory reduction, 2-3× speedup
- Smaller Batch Sizes: Use gradient accumulation to maintain effective batch size
Sampling Methods¤
DDPM Sampling: The Iterative Reverse Process¤
The foundational DDPM sampling starts from pure noise \(x_T \sim \mathcal{N}(0, \mathbf{I})\) and iteratively denoises:
where \(z \sim \mathcal{N}(0, \mathbf{I})\) and \(\sigma_t\) controls stochasticity.
Algorithm:
- Sample \(x_T \sim \mathcal{N}(0, \mathbf{I})\)
- For \(t = T, T-1, \ldots, 1\):
- Predict noise: \(\hat{\epsilon} = \epsilon_\theta(x_t, t)\)
- Compute mean: \(\mu_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \hat{\epsilon} \right)\)
- Sample: \(x_{t-1} = \mu_t + \sigma_t z\)
- Return \(x_0\)
Characteristics:
- Stochastic: Introduces randomness at each step
- Slow: Requires \(T=1000\) neural network evaluations (seconds to minutes)
- High Quality: Excellent sample quality with sufficient steps
- 1000-2000× slower than single-pass generators like GANs
graph LR
A["x_T<br/>Pure Noise"] -->|"Denoise Step T"| B["x_{T-1}"]
B -->|"Denoise Step T-1"| C["x_{T-2}"]
C -->|"..."| D["x_t"]
D -->|"..."| E["x_1"]
E -->|"Final Denoise"| F["x_0<br/>Generated Image"]
style A fill:#ffccc9
style F fill:#c8e6c9
style D fill:#fff3e0
DDIM: Fast Deterministic Sampling¤
Paper: "Denoising Diffusion Implicit Models" (Song et al., 2021)
DDIM constructs non-Markovian forward processes sharing DDPM's marginals but enabling much larger reverse steps:
When \(\sigma_t = 0\), sampling becomes fully deterministic.
Key Advantages:
- 10-50× speedup: Reduces from 1000 steps to 50-100 steps
- No retraining: Works with any pre-trained DDPM checkpoint
- Deterministic: When \(\eta = 0\), enables consistent reconstructions
- Interpolation: Meaningful latent space interpolation
Algorithm (Deterministic \(\sigma_t = 0\)):
- Sample \(x_T \sim \mathcal{N}(0, \mathbf{I})\)
- Choose subset of timesteps \(\{\tau_1, \tau_2, \ldots, \tau_S\}\) where \(S \ll T\)
- For \(i = S, S-1, \ldots, 1\):
- Predict \(x_0\): \(\hat{x}_0 = \frac{x_{\tau_i} - \sqrt{1-\bar{\alpha}_{\tau_i}} \epsilon_\theta(x_{\tau_i}, \tau_i)}{\sqrt{\bar{\alpha}_{\tau_i}}}\)
- Compute \(x_{\tau_{i-1}} = \sqrt{\bar{\alpha}_{\tau_{i-1}}} \hat{x}_0 + \sqrt{1-\bar{\alpha}_{\tau_{i-1}}} \epsilon_\theta(x_{\tau_i}, \tau_i)\)
- Return \(x_0\)
DDIM in Practice
DDIM became the standard inference method for production systems. Stable Diffusion defaults to 50 DDIM steps for a good quality/speed trade-off. Fewer steps (20-25) work for quick previews.
Advanced ODE Solvers¤
DPM-Solver (2022):
Treats diffusion sampling as solving ODEs with specialized numerical methods:
- Higher-order solver (order 2-3) with convergence guarantees
- Achieves FID 4.70 in 10 steps, 2.87 in 20 steps on CIFAR-10
- 4-16× speedup over previous samplers
DPM-Solver++ (2023):
Addresses instability with large classifier-free guidance scales:
- Uses data prediction with dynamic thresholding
- Performs well with 15-20 steps for guided sampling
- Better numerical stability than DPM-Solver
PNDM (Pseudo Numerical Methods):
Treats DDPMs as solving differential equations on manifolds:
- Generates higher quality with 50 steps than 1000-step DDIM
- 20× speedup with quality improvement
Consistency Models: One-Step Generation¤
Papers:
- "Consistency Models" (Song et al., 2023)
- "Improved Techniques for Training Consistency Models" (Song et al., 2023)
Paradigm shift: Learn a consistency function \(f\) that directly maps any point on a trajectory to its endpoint:
The self-consistency property:
Consistency Distillation:
Train by distilling from pre-trained diffusion model:
where \(\hat{x}_{t_n}^\phi\) is one step of ODE solver from \(x_{t_{n+1}}\) using teacher model.
Results:
- FID 3.55 on CIFAR-10 in one step
- Improved techniques: FID 2.51 in one step, 2.24 in two steps
- Consistency Trajectory Models (CTM): FID 1.73 in one step
- Zero-shot editing without task-specific training
Revolutionary Speed
Consistency models achieve 1000× speedup over DDPM while maintaining competitive quality. This makes diffusion viable for real-time applications.
Guidance Techniques¤
Classifier Guidance:
Modifies the score using gradients from a separately trained classifier:
where \(w\) is guidance scale, \(c\) is class label.
Advantages: State-of-the-art results when well-tuned Disadvantages: Requires training noise-aware classifiers at all noise levels
Classifier-Free Guidance (Ho & Salimans, 2022):
Eliminates the classifier by jointly training conditional and unconditional models:
During training, randomly drop condition \(c\) with probability ~10%.
Advantages:
- No auxiliary classifier needed
- Often better quality than classifier guidance
- Single guidance scale \(w\) controls trade-off
- Industry standard: Used by DALL-E 2, Stable Diffusion, Midjourney, Imagen
Common Guidance Scales:
- \(w = 1.0\): No guidance (unconditional)
- \(w = 3-5\): Moderate guidance, balanced quality/diversity
- \(w = 7-8\): Standard guidance (Stable Diffusion default: 7.5)
- \(w = 10-15\): Strong guidance, high fidelity but lower diversity
- \(w > 20\): Over-guided, saturated colors, artifacts
graph LR
A["Conditional<br/>ε(x_t, c)"] --> C["Guidance<br/>Interpolation"]
B["Unconditional<br/>ε(x_t, ∅)"] --> C
C --> D["Guided Noise<br/>Prediction"]
D --> E["Denoising<br/>Step"]
style A fill:#e1f5ff
style B fill:#ffccc9
style D fill:#c8e6c9
Diffusion Model Variants¤
Latent Diffusion Models (LDM / Stable Diffusion)¤
Paper: "High-Resolution Image Synthesis with Latent Diffusion Models" (Rombach et al., 2022)
Revolutionary insight: Run diffusion in VAE latent space instead of pixel space.
Two-Stage Approach:
- Train autoencoder compressing images 8× (512×512×3 → 64×64×4)
- Run diffusion in compressed latent space
Architecture:
graph TB
A["Input Image<br/>512×512×3"] --> B["VAE Encoder"]
B --> C["Latent Space<br/>64×64×4<br/>(8× compression)"]
C --> D["Diffusion U-Net<br/>(with cross-attention)"]
E["Text Prompt"] --> F["CLIP/T5<br/>Encoder"]
F --> D
D --> G["Denoised Latent<br/>64×64×4"]
G --> H["VAE Decoder"]
H --> I["Generated Image<br/>512×512×3"]
style C fill:#fff3e0
style D fill:#e1f5ff
style I fill:#c8e6c9
Key Benefits:
- 2.7× training/inference speedup
- 1.6× FID improvement
- Massively reduced memory: ~10GB VRAM for 512×512 generation
- Cross-attention conditioning: Text embeddings guide generation
Stable Diffusion Implementation:
- 860M-parameter U-Net in latent space
- Trained on LAION-5B (5 billion text-image pairs)
- Open-source release democratized text-to-image generation
- Versions: 1.4, 1.5, 2.0, 2.1, SDXL (2.3B), SD3 (8B)
SDXL (2023):
- 2.3B parameters
- Dual text encoders (OpenCLIP + CLIP)
- Native 1024×1024 resolution
- Two-stage: base model + refiner
Stable Diffusion 3 (2024):
- Rectified Flow Transformer replacing U-Net
- Multimodal Diffusion Transformer
- Sizes: 800M, 2B, 8B parameters
- State-of-the-art text rendering in images
Impact
Latent diffusion made high-quality generation accessible. Stable Diffusion has 100M+ users and massive ecosystem of fine-tunes, LoRAs, and community tools.
Conditional Diffusion Models¤
Class-Conditional Generation:
Add class label \(y\) as conditioning:
Typically implemented via:
- Class embeddings concatenated with time embeddings
- Conditional batch normalization: Modulate batch norm with class info
- Cross-attention: Attend to class token
Text-to-Image Models:
DALL-E 2 (OpenAI, 2022):
- Two-stage: CLIP prior diffusion + decoder
- CLIP prior maps text embeddings to image embeddings
- Decoder generates images from CLIP embeddings
- Up to 1024×1024 resolution
Imagen (Google, 2022):
- Frozen T5-XXL (4.6B parameters) as text encoder
- Cascaded diffusion: 64×64 → 256×256 → 1024×1024
- Key finding: Scaling text encoder improves quality more than scaling U-Net
- FID 7.27 on COCO (state-of-the-art at time)
- Not publicly released
Midjourney (2022-2024):
- Proprietary diffusion model
- Exceptional aesthetic quality
- Versions V1 → V6 (2024)
- ~15M users, multi-million dollar revenue
Cascade Diffusion Models¤
Generate through multiple resolution stages, each a separate diffusion model:
- Base model: Generate at 64×64
- Super-resolution 1: Upscale to 256×256
- Super-resolution 2: Upscale to 1024×1024
Advantages:
- Each stage focuses on different detail scales
- More efficient than single high-resolution model
- Better quality through specialized models
Disadvantages:
- Complexity of training multiple models
- Error accumulation across stages
Used in DALL-E 2 and Imagen.
Video Diffusion Models¤
Extend spatial generation to temporal dimension:
3D U-Net Architecture:
- Spatial convolutions → Spatio-temporal convolutions
- Process video as Time × Height × Width tensor
- Temporal attention captures motion
Sora (OpenAI, 2024):
- Diffusion Transformer on spacetime patches
- Videos as sequences of patches (like LLM tokens)
- Compressed into latent space via video VAE
- Up to 1 minute of 1080p video
- Variable aspect ratios
- Emergent physics understanding, object permanence
Sora 2 (2025):
- Synchronized audio generation
- Improved physics simulation
- Multi-shot video with persistent world state
- Instruction-following for complex scenes
Open-Source Video Models:
- CogVideoX: 2B and 5B parameter models
- HunyuanVideo: 7B and 14B parameters with scaling studies
- Stable Video Diffusion: Extension of Stable Diffusion
3D Diffusion Models¤
DreamFusion (Google, 2022):
Uses 2D text-to-image models as priors for 3D generation via Score Distillation Sampling (SDS):
- Initialize random 3D NeRF
- Render 2D views from random camera angles
- Apply noise and use Imagen to denoise
- Backpropagate through rendering to update NeRF
- Repeat
Key Innovation: No 3D training data required—leverages 2D diffusion models.
Capabilities:
- Text-to-3D generation
- Viewable from any angle
- Relightable
- Exportable as meshes
Stable-DreamFusion:
- Uses Stable Diffusion instead of Imagen
- Open-source implementation
- Enables text-to-3D and image-to-3D
Applications: Game assets, VR/AR content, product design, 3D scene reconstruction
Comparison with Other Generative Models¤
Diffusion vs. GANs¤
| Aspect | Diffusion Models | GANs |
|---|---|---|
| Sample Quality | State-of-the-art (FID 1.81-7.72) | High quality (FID ~2.97) |
| Training Stability | Very stable, straightforward MSE | Unstable, adversarial balancing |
| Mode Coverage | Excellent, likelihood-based | Prone to mode collapse |
| Inference Speed | Slow (25-1000 steps) | Fast (1 forward pass) |
| Training Ease | Forgiving hyperparameters | Requires careful tuning |
| Controllability | Excellent (guidance, editing) | Limited, no natural framework |
| Latent Space | No explicit latent | No encoder, implicit latent |
Speed Comparison:
- GANs: ~0.01 seconds per image (single pass)
- DDPM: 10-60 seconds per image (1000 steps)
- DDIM: 1-5 seconds per image (50 steps)
- Consistency Models: 0.1-0.5 seconds per image (1-4 steps)
When to Use Each
Choose Diffusion: - Quality/diversity paramount - Training stability important - Denoising, super-resolution, inpainting tasks - Computational resources available
Choose GANs: - Real-time generation required - Single-pass critical (e.g., video style transfer) - Limited inference compute - Interactive applications
Diffusion vs. VAEs¤
| Aspect | Diffusion Models | VAEs |
|---|---|---|
| Sample Quality | Sharp, high-fidelity | Often blurry from MSE loss |
| Latent Space | No explicit low-dim latent | Explicit interpretable latent |
| Likelihood | Tractable via ODE | Explicit ELBO lower bound |
| Training | Straightforward, stable | Straightforward, very stable |
| Speed | Slow (multi-step) | Fast (single pass) |
| Representation | Implicit in noise trajectory | Explicit learned encoding |
Hybrid Approach: Latent Diffusion¤
Combines strengths of both:
- VAE: Compresses images to latent space (8× reduction)
- Diffusion: Operates in compressed latent space
Results: 2.7× speedup, better quality, significantly reduced memory.
Diffusion vs. Autoregressive Models¤
| Aspect | Diffusion | Autoregressive |
|---|---|---|
| Speed | Faster for long sequences (>256 tokens) | Faster for short (<100 tokens) |
| Quality | Best for images/audio/video | Best for text/language |
| Parallelization | Fully parallel generation | Sequential (can't parallelize) |
| Controllability | Excellent (edit any step) | Limited (left-to-right) |
| Variable Length | Fixed-length outputs | Natural variable-length |
Complementary Strengths:
- Diffusion: Images, audio, video, fixed-length sequences
- Autoregressive: Text, language, variable-length, sequential coherence
Hybrid Models:
HART (Hybrid Autoregressive Transformer):
- Autoregressive for coarse structure
- Diffusion for fine details
- 9× faster than pure diffusion
- 31% less compute
Advanced Topics¤
Flow Matching and Rectified Flow¤
Flow Matching (2022):
Trains Continuous Normalizing Flows by regressing vector fields:
- Simulation-free training—no ODE solving during training
- More stable dynamics than traditional diffusion
- Compatible with general Gaussian probability paths
- State-of-the-art ImageNet results
Rectified Flow (2023):
Learns ODEs following straight paths between noise and data:
where trajectories are straight lines from \(x_0\) to \(x_1\).
Reflow Operation: Iteratively straightens trajectories:
- Sample pairs \((x_0, x_1)\) from data and noise
- Train model on straight paths
- Use trained model to generate new paired data
- Retrain on straightened paths
- Repeat
Results:
- Can achieve high quality in 1-2 steps after reflow
- InstaFlow: Generates images in 0.12 seconds
- Used in Stable Diffusion 3 as core architecture
EDM: Elucidating Design Spaces¤
Papers:
- "Elucidating the Design Space" (Karras et al., 2022)
- "Analyzing and Improving Training Dynamics" (Karras et al., 2024)
EDM provides unified framework separating design choices:
Optimal Preconditioning:
Normalizing inputs/outputs of network for better training:
Karras Noise Schedule:
Widely adopted in practice, superior to linear/cosine.
Results: FID 1.79 on CIFAR-10 (state-of-the-art)
EDM2 (2024):
- Analyzed training dynamics at scale
- Redesigned architecture with magnitude preservation
- Post-hoc EMA: Set parameters after training without retraining
- FID 1.81 on ImageNet-512 (previous record: 2.41)
Distillation for Fast Sampling¤
Progressive Distillation (Salimans & Ho, 2022):
Iteratively halves the number of steps:
- Start with 8192-step teacher
- Train 4096-step student to match
- Train 2048-step student from previous student
- Continue halving to 4 steps
Results: FID 3.0 on CIFAR-10 in 4 steps
Distribution Matching Distillation:
- One-step FID 2.62 on ImageNet
- FID 11.49 on MS-COCO
- 20 FPS generation
Simple and Fast Distillation (NeurIPS 2024):
- 1000× faster fine-tuning
- FID 4.53 in 2 steps with only 0.64 hours on A100
Diffusion Transformers (DiT)¤
Paper: "Scalable Diffusion Models with Transformers" (Peebles & Xie, 2023)
Replaces U-Net with transformer architecture:
Architecture:
- Images as patch sequences (like ViT)
- Standard transformer blocks instead of U-Net
- Adaptive Layer Norm (adaLN-Zero) for conditioning:
Key Finding: Compute (Gflops) drives performance, not specific architecture.
Results:
- DiT-XL/2: FID 2.27 on ImageNet 256×256
- Scales better than U-Nets to billions of parameters
- Influenced Sora and Stable Diffusion 3
U-ViT (Bao et al., 2023):
Combines ViT with U-Net principles:
- Token-based processing
- Long skip connections between encoder-decoder (crucial!)
- FID 2.29 on ImageNet 256×256
Production Considerations¤
Deployment Challenges¤
Model Size:
- DDPM (256×256): ~200-500MB
- Stable Diffusion 1.5: 860M parameters (~3.5GB)
- SDXL: 2.3B parameters (~6.9GB)
- SD3: 800M to 8B parameters
Optimization Strategies:
- Model Pruning: Remove 30-50% weights with <5% quality loss
- Quantization: INT8/FP16 reduces size 2-4× with minimal quality loss
- Knowledge Distillation: Train smaller student model
Inference Optimization:
- ONNX Runtime: 10-30% speedup
- TensorRT: 2-5× speedup on NVIDIA GPUs
- torch.compile(): 10-30% speedup with PyTorch 2.0+
- Flash Attention: 2-3× speedup for attention layers
Hardware Requirements:
- Inference: Minimum RTX 3060 (12GB) for 512×512
- Recommended: RTX 4090 (24GB) or professional GPUs
- Edge Deployment: Optimized models run on mobile (SD Turbo, LCM)
Monitoring and Quality Control¤
Quality Drift:
Monitor generated samples over time for:
- Artifacts or distortions
- Color shifts
- Mode collapse
- Prompt adherence degradation
Metrics:
- FID: Track on validation set every N samples
- CLIP Score: For text-to-image, measure alignment
- Human Evaluation: A/B tests for subjective quality
- Diversity Metrics: Ensure mode coverage
A/B Testing:
Compare model versions using:
- FID/IS on held-out data
- Human preference studies (typically 1000+ comparisons)
- Production metrics (engagement, retention, quality reports)
Ethical Considerations¤
Responsible Deployment
Deepfakes and Misinformation:
Diffusion models enable photorealistic fake images/videos. Mitigation strategies:
- Watermarking generated content
- Provenance tracking (C2PA metadata)
- Detection models for synthetic content
- Usage policies and terms of service
Bias and Fairness:
Models inherit biases from training data (LAION-5B, etc.):
- Underrepresentation of minorities
- Stereotypical associations
- Geographic/cultural biases
Mitigation:
- Balanced training data curation
- Bias evaluation across demographics
- Red-teaming for harmful generations
Copyright and Attribution:
Training on copyrighted images raises questions:
- Fair use vs. infringement debates ongoing
- Artist consent and compensation
- Attribution for training data
Best practices:
- Respect opt-out requests (Have I Been Trained)
- Consider ethical training data sources
- Transparent documentation of training data
Environmental Impact:
Large-scale training requires massive compute:
- ImageNet training: 1000s of GPU-hours
- Stable Diffusion: ~150,000 A100-hours
- SDXL: ~500,000 A100-hours
Mitigation:
- Efficient architectures (Latent Diffusion)
- Distillation for deployment
- Carbon-aware training scheduling
- Renewable energy for data centers
Safety Filters and Content Moderation¤
Safety Classifiers:
Pre-deployment filters to prevent harmful content:
- NSFW Detection: Classify unsafe content
- Violence Detection: Flag graphic violence
- Hate Symbol Detection: Block extremist imagery
Prompt Filtering:
- Block harmful prompt patterns
- Detect adversarial prompts
- Rate limiting for abuse prevention
Post-Generation Filtering:
- Run safety classifier on outputs
- Block unsafe images before showing user
- Log violations for monitoring
Summary and Key Takeaways¤
Diffusion models have revolutionized generative AI through an elegant approach: learning to reverse a gradual noising process. By systematically destroying data structure through fixed forward diffusion and learning the reverse process through neural networks, these models achieve state-of-the-art quality with remarkable training stability.
Core Principles:
- Forward diffusion gradually corrupts data into pure noise over \(T\) timesteps
- Reverse diffusion learns to progressively denoise, reconstructing data from noise
- Training objective reduces to simple MSE between predicted and actual noise
- Sampling iteratively applies learned denoising, refining noise into data
Key Variants:
- DDPM: Foundational stochastic sampling with 1000 steps
- DDIM: Deterministic fast sampling reducing to 50-100 steps
- Latent Diffusion: Operates in VAE latent space for 2.7× speedup (Stable Diffusion)
- Consistency Models: One-step generation achieving 1000× speedup
- Rectified Flow: Straight-line trajectories for efficient sampling (SD3)
Architecture Innovations:
- U-Net backbone with skip connections preserving spatial details
- Sinusoidal time embeddings with FiLM conditioning
- Cross-attention enabling text-to-image and conditional generation
- Diffusion Transformers (DiT) scaling to billions of parameters
Training Best Practices:
- Use cosine noise schedule for better dynamics
- Apply EMA with decay 0.9999—critical for quality
- Gradient clipping and mixed precision for stability
- Monitor visual samples not just loss curves
- Min-SNR weighting for 3.4× faster convergence
Sampling Methods:
- DDPM: 1000 steps, highest quality, slowest
- DDIM: 50-100 steps, 10-20× speedup, deterministic
- DPM-Solver: 10-20 steps with ODE solvers
- Consistency Models: 1-4 steps, near real-time
Guidance Techniques:
- Classifier-free guidance as industry standard
- Guidance scale \(w=7-8\) balances quality and diversity
- Higher \(w\) increases fidelity, reduces diversity
When to Use Diffusion Models:
- Quality and diversity are paramount
- Denoising, super-resolution, inpainting, editing tasks
- Text-to-image, text-to-video generation
- Scientific applications (protein design, drug discovery)
- Training stability more important than inference speed
Current Landscape (2025):
- Diffusion dominates high-quality image generation
- Speed improvements enable near real-time generation
- Multimodal models handle text, image, video, 3D, audio
- Open-source ecosystem (Stable Diffusion) democratizes access
- Proprietary leaders (Midjourney, DALL-E) push quality boundaries
Future Directions:
- Further sampling efficiency improvements
- Unified architectures for multiple modalities
- Better theoretical understanding of training dynamics
- Edge deployment and mobile generation
- Scientific applications in biology, chemistry, materials science
Diffusion models represent a paradigm shift in generative modeling, offering unmatched quality, stability, and controllability. As architectures scale and sampling becomes more efficient, they will likely remain dominant for visual generation while expanding into new modalities and applications.
Next Steps¤
-
Practical usage guide with implementation examples and training workflows
-
Complete API documentation for DDPM, DDIM, Latent Diffusion, and variants
-
Step-by-step hands-on tutorial: train a diffusion model on MNIST from scratch
-
Explore Stable Diffusion, video diffusion, and state-of-the-art architectures
Further Reading¤
Seminal Papers (Must Read)¤
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics"
arXiv:1503.03585 | ICML 2015
The foundational paper introducing diffusion probabilistic models
Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models"
arXiv:2006.11239 | NeurIPS 2020
DDPM: Made diffusion models practical with simplified training objective
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations"
arXiv:2011.13456 | ICLR 2021 Outstanding Paper
Unified framework via SDEs, probability flow ODEs
Song, J., Meng, C., & Ermon, S. (2021). "Denoising Diffusion Implicit Models"
arXiv:2010.02502 | ICLR 2021
DDIM: Fast deterministic sampling without retraining
Nichol, A., & Dhariwal, P. (2021). "Improved Denoising Diffusion Probabilistic Models"
arXiv:2102.09672 | ICML 2021
Learned variances, cosine schedule, hybrid objective
Dhariwal, P., & Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis"
arXiv:2105.05233 | NeurIPS 2021
Showed diffusion superiority, introduced classifier guidance
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models"
arXiv:2112.10752 | CVPR 2022
Latent diffusion / Stable Diffusion: 2.7× speedup in VAE latent space
Tutorial Papers and Surveys¤
Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., ... & Cui, B. (2023). "Diffusion Models: A Comprehensive Survey of Methods and Applications"
arXiv:2209.00796
Comprehensive 150+ page survey covering theory and applications
Cao, H., Tan, C., Gao, Z., Chen, G., Heng, P. A., & Li, S. Z. (2023). "A Survey on Generative Diffusion Models"
arXiv:2209.02646
Covers mathematical foundations, applications, and future directions
Luo, C. (2022). "Understanding Diffusion Models: A Unified Perspective"
arXiv:2208.11970
Excellent tutorial connecting ELBO, score matching, and SDEs
Important Variants and Extensions¤
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). "Hierarchical Text-Conditional Image Generation with CLIP Latents"
arXiv:2204.06125
DALL-E 2: CLIP prior diffusion for text-to-image
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., ... & Norouzi, M. (2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding"
arXiv:2205.11487 | NeurIPS 2022
Imagen: T5 text encoder with cascaded diffusion
Ho, J., & Salimans, T. (2022). "Classifier-Free Diffusion Guidance"
arXiv:2207.12598
Industry-standard guidance without auxiliary classifiers
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., ... & Rombach, R. (2023). "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis"
arXiv:2307.01952
SDXL: 2.3B parameter upgrade to Stable Diffusion
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., ... & Rombach, R. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis"
arXiv:2403.03206
Stable Diffusion 3: Multimodal diffusion transformer
Sampling and Acceleration¤
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps"
arXiv:2206.00927 | NeurIPS 2022
10-20 step high-quality sampling via ODE solvers
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2023). "DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models"
arXiv:2211.01095
Improved stability for classifier-free guidance
Liu, L., Ren, Y., Lin, Z., & Zhao, Z. (2022). "Pseudo Numerical Methods for Diffusion Models on Manifolds"
arXiv:2202.09778 | ICLR 2022
PNDM: 20× speedup with quality improvement
Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). "Consistency Models"
arXiv:2303.01469 | ICML 2023
One-step generation via consistency distillation
Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). "Improved Techniques for Training Consistency Models"
arXiv:2310.14189 | ICLR 2024 Oral
FID 2.51 in one step, 2.24 in two steps
Salimans, T., & Ho, J. (2022). "Progressive Distillation for Fast Sampling of Diffusion Models"
arXiv:2202.00512 | ICLR 2022
Iteratively halve sampling steps through distillation
Architecture Innovations¤
Peebles, W., & Xie, S. (2023). "Scalable Diffusion Models with Transformers"
arXiv:2212.09748 | ICCV 2023
DiT: Replaces U-Net with transformer, scales to billions of parameters
Bao, F., Nie, S., Xue, K., Cao, Y., Li, C., Su, H., & Zhu, J. (2023). "All are Worth Words: A ViT Backbone for Diffusion Models"
arXiv:2209.12152 | CVPR 2023
U-ViT: Combines ViT with U-Net skip connections
Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). "Elucidating the Design Space of Diffusion-Based Generative Models"
arXiv:2206.00364 | NeurIPS 2022
EDM: Unified framework, optimal preconditioning, FID 1.79
Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., & Laine, S. (2024). "Analyzing and Improving the Training Dynamics of Diffusion Models"
arXiv:2312.02696 | CVPR 2024 Oral
EDM2: FID 1.81 on ImageNet-512, post-hoc EMA
Flow Matching and Optimal Transport¤
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). "Flow Matching for Generative Modeling"
arXiv:2210.02747 | ICLR 2023
Simulation-free training of continuous normalizing flows
Liu, X., Gong, C., & Liu, Q. (2023). "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow"
arXiv:2209.03003 | ICLR 2023
Straight-line ODE paths, reflow operation
Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., ... & Rombach, R. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis"
arXiv:2403.03206
Stable Diffusion 3 technical report
Video and 3D Generation¤
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., ... & Salimans, T. (2022). "Imagen Video: High Definition Video Generation with Diffusion Models"
arXiv:2210.02303
Cascaded video diffusion with temporal attention
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., ... & Ramesh, A. (2024). "Video Generation Models as World Simulators"
OpenAI Technical Report
Sora: Diffusion transformer on spacetime patches
Poole, B., Jain, A., Barron, J. T., & Mildenhall, B. (2022). "DreamFusion: Text-to-3D using 2D Diffusion"
arXiv:2209.14988
Score Distillation Sampling for 3D generation
Lin, C.-H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., ... & Fidler, S. (2023). "Magic3D: High-Resolution Text-to-3D Content Creation"
arXiv:2211.10440 | CVPR 2023
Two-stage coarse-to-fine 3D generation
Scientific Applications¤
Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., ... & Baker, D. (2023). "De novo design of protein structure and function with RFdiffusion"
Nature 620, 1089–1100
Protein design via diffusion achieving experimental validation
Corso, G., Stärk, H., Jing, B., Barzilay, R., & Jaakkola, T. (2022). "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
arXiv:2210.01776 | ICLR 2023
Molecular docking as generative modeling
Hoogeboom, E., Satorras, V. G., Vignac, C., & Welling, M. (2022). "Equivariant Diffusion for Molecule Generation in 3D"
arXiv:2203.17003 | ICML 2022
E(3)-equivariant diffusion for drug molecule design
Image Editing and Control¤
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., & Van Gool, L. (2022). "RePaint: Inpainting using Denoising Diffusion Probabilistic Models"
arXiv:2201.09865 | CVPR 2022
Mask-agnostic inpainting with pretrained models
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J. Y., & Ermon, S. (2022). "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations"
arXiv:2108.01073 | ICLR 2022
Edit images via noise addition and denoising
Brooks, T., Holynski, A., & Efros, A. A. (2023). "InstructPix2Pix: Learning to Follow Image Editing Instructions"
arXiv:2211.09800 | CVPR 2023
Edit images from natural language instructions
Zhang, L., Rao, A., & Agrawala, M. (2023). "Adding Conditional Control to Text-to-Image Diffusion Models"
arXiv:2302.05543 | ICCV 2023
ControlNet: Spatial conditioning with edges, depth, pose
Online Resources and Code¤
Lilian Weng's Blog: "What are Diffusion Models?"
lilianweng.github.io/posts/2021-07-11-diffusion-models
Comprehensive blog post with excellent visualizations and intuitions
Yang Song's Blog: "Generative Modeling by Estimating Gradients of the Data Distribution"
yang-song.net/blog/2021/score
Deep dive into score-based models and SDEs
Hugging Face Diffusers Library
github.com/huggingface/diffusers
Production-ready implementations: DDPM, DDIM, Stable Diffusion, ControlNet
Stability AI: Stable Diffusion Official Repository
github.com/Stability-AI/stablediffusion
Official implementation of Stable Diffusion models
CompVis: Latent Diffusion Models
github.com/CompVis/latent-diffusion
Original latent diffusion implementation
Denoising Diffusion PyTorch
github.com/lucidrains/denoising-diffusion-pytorch
Clean, well-documented PyTorch implementations
Books and Comprehensive Tutorials¤
Prince, S. J. D. (2023). "Understanding Deep Learning"
Chapter on Diffusion Models | udlbook.github.io/udlbook
Excellent pedagogical treatment with visualizations
Murphy, K. P. (2023). "Probabilistic Machine Learning: Advanced Topics"
Chapter on Score-Based and Diffusion Models | MIT Press
Rigorous mathematical treatment
Hugging Face Diffusion Models Course
huggingface.co/learn/diffusion-course
Hands-on tutorials from basics to advanced topics
Ready to build with diffusion models? Start with the Diffusion User Guide for practical implementations, check the API Reference for complete documentation, or dive into the MNIST Tutorial to train your first diffusion model from scratch!