Normalizing Flows Explained¤
-
Exact Likelihood
Compute exact log-likelihood through tractable Jacobian determinants, enabling precise density estimation
-
Bijective Transformations
Invertible mappings allow both efficient sampling and exact inference through forward and inverse passes
-
Flexible Distributions
Transform simple base distributions into complex target distributions through learned compositions
-
Fast Generation
Single-pass or few-step sampling with modern architectures achieving real-time performance
Overview¤
Normalizing flows have emerged as a uniquely powerful class of generative models that provide exact likelihood computation and efficient sampling through invertible transformations. Unlike VAEs that optimize approximate lower bounds or GANs that learn implicit distributions, flows transform simple base distributions into complex data distributions via learned bijective mappings with tractable Jacobian determinants.
What makes normalizing flows special? Flows solve a fundamental challenge in generative modeling: simultaneously enabling precise density estimation and efficient sampling. By learning invertible transformations with structured Jacobians, flows:
- Compute exact likelihood for any data point without approximation
- Generate samples through fast inverse transformations
- Perform exact inference without variational bounds or adversarial training
- Train stably using straightforward maximum likelihood objectives
Recent breakthroughs in 2023-2025—including flow matching, rectified flows, and discrete flow variants—have dramatically closed the performance gap with diffusion models while maintaining the core advantages of one-step generation and stable training.
The Intuition: Probability Transformations¤
Think of normalizing flows like a sequence of coordinate transformations on a map:
-
Start with simple terrain (base distribution) - a flat, uniform grid easy to sample from
-
Apply transformations - each step warps, stretches, and reshapes the terrain while maintaining a perfect one-to-one correspondence between original and transformed coordinates
-
Track volume changes - the Jacobian determinant measures how much each region expands or contracts, ensuring probability mass is conserved
-
Compose transformations - stack multiple simple warps to create arbitrarily complex landscapes (data distributions)
The critical insight: by carefully designing transformations where we can efficiently compute both the forward mapping and the volume change, we get a model that can both generate samples (apply the transformation) and evaluate probabilities (apply the inverse and account for volume changes).
Mathematical Foundation¤
The Change of Variables Formula¤
The change of variables formula serves as the cornerstone of all normalizing flow architectures. Given a random variable \(\mathbf{z}\) with known density \(p_\mathcal{Z}(\mathbf{z})\) and an invertible transformation \(\mathbf{x} = f(\mathbf{z})\), the density of \(\mathbf{x}\) becomes:
Or equivalently in log space:
where \(\mathbf{z} = f^{-1}(\mathbf{x})\).
Geometric Intuition
The Jacobian determinant \(\left| \det \frac{\partial f}{\partial \mathbf{z}} \right|\) quantifies the relative change in volume of an infinitesimal neighborhood under transformation \(f\). When the transformation expands a region (\(|\det J| > 1\)), the probability density must decrease proportionally to conserve total probability mass. Conversely, contraction (\(|\det J| < 1\)) concentrates probability, increasing density.
For \(D\)-dimensional vectors, the Jacobian matrix \(J_f(\mathbf{z})\) is the \(D \times D\) matrix of partial derivatives \([\frac{\partial f_i}{\partial z_j}]\). Computing a general determinant requires \(O(D^3)\) operations, which becomes intractable for high-dimensional data like 256×256 RGB images with \(D = 196{,}608\) dimensions.
The entire field of normalizing flows revolves around designing transformations with structured Jacobians—triangular, diagonal, or block-structured matrices where determinants reduce to \(O(D)\) computations.
Composing Multiple Transformations¤
A single invertible transformation typically provides limited modeling capacity. The power of flows emerges through composition: stacking \(K\) transformations:
The log-likelihood decomposes additively:
where \(\mathbf{z}_0 = \mathbf{z}\) and \(\mathbf{z}_k = f_k(\mathbf{z}_{k-1})\) for \(k=1,\ldots,K\).
graph LR
Z0["z₀<br/>(Base)"] --> F1["f₁"]
F1 --> Z1["z₁"]
Z1 --> F2["f₂"]
F2 --> Z2["z₂"]
Z2 --> Dots["..."]
Dots --> FK["f_K"]
FK --> X["x<br/>(Data)"]
F1 -.->|"log|det J₁|"| LogDet1["Σ log-det"]
F2 -.->|"log|det J₂|"| LogDet1
FK -.->|"log|det J_K|"| LogDet1
style Z0 fill:#e1f5ff
style X fill:#ffe1e1
style LogDet1 fill:#fff3cd
Additive Structure in Log-Space
The chain rule for Jacobians states \(\det J_{f_2 \circ f_1}(\mathbf{u}) = \det J_{f_2}(f_1(\mathbf{u})) \cdot \det J_{f_1}(\mathbf{u})\), so log-determinants simply add: \(\log|\det J_\text{total}| = \sum_k \log|\det J_k|\). This ensures numerical stability and makes total computational cost \(O(KD)\) when each layer has \(O(D)\) Jacobian computation.
Three Requirements for Flow Layers¤
For a transformation \(f\) to be a valid flow layer, it must satisfy:
- Invertibility: \(f\) must be bijective (one-to-one and onto)
- Efficient Jacobian: \(\log \left| \det \frac{\partial f}{\partial \mathbf{z}} \right|\) must be tractable to compute
- Efficient Inverse: \(f^{-1}\) must be computable efficiently (for sampling)
Different flow architectures make different trade-offs among these requirements.
Base Distribution¤
The base distribution \(p_\mathcal{Z}(\mathbf{z})\) is typically chosen to be simple for efficient sampling:
Standard Gaussian (most common):
Uniform (less common):
Flow Model Architectures¤
Workshop provides implementations of several state-of-the-art flow architectures, each with different trade-offs between expressiveness, computational efficiency, and ease of use.
1. NICE: Pioneering Coupling Layers¤
NICE (Non-linear Independent Components Estimation) introduced additive coupling layers that made normalizing flows practical for high-dimensional data.
Coupling Layer Mechanism:
Given input \(\mathbf{x} \in \mathbb{R}^D\), partition into \((\mathbf{x}_1, \mathbf{x}_2)\):
where \(m\) can be any arbitrary function (typically a neural network).
Key Properties:
- Volume-preserving: \(\log|\det(\mathbf{J})| = 0\) (determinant is exactly 1)
- Efficient inverse: \(\mathbf{x}_1 = \mathbf{y}_1\), \(\mathbf{x}_2 = \mathbf{y}_2 - m(\mathbf{y}_1)\)
- No Jacobian computation: The triangular structure makes the determinant trivial
- Arbitrary coupling function: \(m\) can be arbitrarily complex without affecting computational cost
graph TB
X["Input x"] --> Split["Partition<br/>(x₁, x₂)"]
Split --> X1["x₁<br/>(unchanged)"]
Split --> X2["x₂"]
X1 --> NN["Neural Network<br/>m(x₁)"]
NN --> Add["y₂ = x₂ + m(x₁)"]
X2 --> Add
X1 --> Concat["Concatenate"]
Add --> Concat
Concat --> Y["Output y"]
style X fill:#e1f5ff
style Y fill:#ffe1e1
style NN fill:#fff3cd
When to Use NICE:
- Fast forward and inverse computations required
- Volume-preserving transformations are acceptable
- Starting point for understanding coupling layers
- Lower-dimensional problems (hundreds of dimensions)
2. RealNVP: Adding Scale for Expressiveness¤
RealNVP (Real-valued Non-Volume Preserving) extends NICE with affine coupling layers:
where \(s(\cdot)\) and \(t(\cdot)\) are neural networks outputting scale and translation, and \(\odot\) denotes element-wise multiplication.
Key Properties:
- Tractable Jacobian: \(\log|\det(\mathbf{J})| = \sum_i s_i(\mathbf{x}_1)\)
- Efficient inverse: $$ \begin{align} \mathbf{x}_1 &= \mathbf{y}_1 \ \mathbf{x}_2 &= (\mathbf{y}_2 - t(\mathbf{y}_1)) \odot \exp(-s(\mathbf{y}_1)) \end{align} $$
- Alternating masks: Alternate which dimensions are transformed across layers
- No gradient through scale/translation: \(s\) and \(t\) can be arbitrarily complex ResNets
graph TB
X["Input x"] --> Split["Split<br/>(x₁, x₂)"]
Split --> X1["x₁<br/>(unchanged)"]
Split --> X2["x₂"]
X1 --> NN["Neural Networks<br/>s(x₁), t(x₁)"]
NN --> Scale["exp(s)"]
NN --> Trans["t"]
X2 --> Mult["⊙"]
Scale --> Mult
Mult --> Add["+ t"]
Trans --> Add
Add --> Y2["y₂"]
X1 --> Concat["Concatenate"]
Y2 --> Concat
Concat --> Y["Output y"]
style X fill:#e1f5ff
style Y fill:#ffe1e1
style NN fill:#fff3cd
Multi-Scale Architecture:
RealNVP introduced hierarchical structure that revolutionized flow-based modeling:
- Squeeze operation: Reshape \(s \times s \times c\) tensors into \(\frac{s}{2} \times \frac{s}{2} \times 4c\)
- Factor out: After several coupling layers, factor out half the channels to the prior
- Continue processing: Transform remaining channels at higher resolution
This enables modeling 256×256 images by avoiding the prohibitive cost of applying dozens of layers to all 196,608 dimensions simultaneously.
When to Use RealNVP:
- Need both fast sampling and density estimation
- Working with continuous data, especially images
- Image generation tasks at moderate to high resolution
- Moderate-dimensional data (hundreds to thousands of dimensions)
3. Glow: Learnable Permutations¤
Glow extends RealNVP with three key innovations that pushed flows to state-of-the-art density estimation:
Glow Block Architecture:
Each flow step combines three layers:
graph TB
X["Input"] --> AN["ActNorm<br/>(Activation Normalization)"]
AN --> Conv["Invertible 1×1 Conv<br/>(Channel Mixing)"]
Conv --> Coup["Affine Coupling Layer<br/>(Transformation)"]
Coup --> Y["Output"]
style X fill:#e1f5ff
style Y fill:#ffe1e1
style AN fill:#d4edda
style Conv fill:#d1ecf1
style Coup fill:#fff3cd
1. ActNorm (Activation Normalization):
Per-channel affine transformation with trainable parameters:
- Data-dependent initialization: normalize first minibatch to zero mean, unit variance
- Enables training with batch size 1 (critical for high-resolution images)
- \(\log|\det J| = H \cdot W \cdot \sum_c \log|s_c|\) for \(H \times W\) spatial dimensions
2. Invertible 1×1 Convolution:
Learned linear mixing of channels using invertible matrix \(\mathbf{W}\):
- Replaces fixed permutations with learned channel mixing
- LU decomposition: \(\mathbf{W} = \mathbf{P} \cdot \mathbf{L} \cdot (\mathbf{U} + \text{diag}(\mathbf{s}))\)
- Determinant: \(\log|\det \mathbf{W}| = \sum_i \log|s_i|\) (reduced to \(O(c)\))
- Improved log-likelihood by ~0.5 bits/dimension over fixed permutations
3. Affine Coupling Layer:
Similar to RealNVP but with the above improvements.
When to Use Glow:
- High-resolution image generation (256×256 and above)
- Need state-of-the-art sample quality
- Have sufficient computational resources
- Want to leverage multi-scale processing
Implementation Detail
Glow achieves 3.35 bits/dimension on CIFAR-10 with 3 scales of 32 steps each (96 total transformations) and coupling networks using 512-channel 3-layer ResNets.
4. MAF: Masked Autoregressive Flow¤
MAF uses autoregressive transformations where each dimension depends on all previous dimensions, providing maximum expressiveness at the cost of sequential sampling.
Autoregressive Transformation:
where \(\mu_i\) and \(\alpha_i\) are computed by a MADE (Masked Autoencoder for Distribution Estimation) network.
MADE Architecture:
Uses masked connections to ensure autoregressive property—each output depends only on previous inputs:
graph TB
X1["x₁"] --> H1["h₁"]
X2["x₂"] --> H1
X2 --> H2["h₂"]
X3["x₃"] --> H2
X3 --> H3["h₃"]
H1 --> Z1["μ₁, α₁"]
H1 --> Z2["μ₂, α₂"]
H2 --> Z2
H2 --> Z3["μ₃, α₃"]
H3 --> Z3
style X1 fill:#e1f5ff
style X2 fill:#e1f5ff
style X3 fill:#e1f5ff
style Z1 fill:#ffe1e1
style Z2 fill:#ffe1e1
style Z3 fill:#ffe1e1
Trade-offs:
| Direction | Complexity | Use Case |
|---|---|---|
| Forward (density) | \(O(1)\) passes | All dimensions computed in parallel |
| Inverse (sampling) | \(O(D)\) passes | Sequential computation required |
When to Use MAF:
- Density estimation is the primary goal
- Sampling speed is less critical
- Tabular or low-to-moderate dimensional data
- Need highly expressive transformations
- All dimensions should interact
5. IAF: Inverse Autoregressive Flow¤
IAF is the "inverse" of MAF with opposite computational trade-offs:
Trade-offs:
| Direction | Complexity | Use Case |
|---|---|---|
| Forward (density) | \(O(D)\) passes | Sequential computation |
| Inverse (sampling) | \(O(1)\) passes | All dimensions computed in parallel |
When to Use IAF:
- Fast sampling is the primary goal
- Density estimation is secondary or not needed
- Variational inference (amortized inference in VAEs)
- Real-time generation applications
6. Neural Spline Flows¤
Neural Spline Flows use monotonic rational-quadratic splines to create highly expressive yet tractable transformations.
Rational-Quadratic Spline Transform:
Each spline maps interval \([-B, B]\) to itself using \(K\) rational-quadratic segments, parameterized by:
- \(K+1\) knot positions \(\{(x^{(k)}, y^{(k)})\}\)
- \(K+1\) derivative values \(\{\delta^{(k)}\}\)
Within segment \(k\), the transformation applies a ratio of quadratic polynomials.
Key Properties:
- Strict monotonicity: Ensures invertibility
- Smooth derivatives: No discontinuities (unlike piecewise-linear)
- Closed-form operations: Forward evaluation, analytic inverse (quadratic equation), closed-form derivative
- Universal approximation: With sufficient bins (8-16 typically suffice)
graph LR
Input["x"] --> Spline["Monotonic<br/>Rational-Quadratic<br/>Spline"]
Spline --> Output["y"]
Params["Knot positions<br/>Derivatives"] -.-> Spline
style Input fill:#e1f5ff
style Output fill:#ffe1e1
style Params fill:#fff3cd
Compared to alternatives:
- vs Affine: ~23 parameters per dimension vs 2, much more expressive
- vs Neural Autoregressive Flows: No iterative root-finding needed
- vs Flow++: No bisection algorithms required
- vs Piecewise-linear: Smooth derivatives improve optimization
Results:
- CIFAR-10: 3.38 bits/dimension using 10× fewer parameters than Glow
- Best-in-class likelihood on multiple density estimation benchmarks
When to Use Neural Spline Flows:
- Maximum expressiveness with tractability
- Density estimation on complex distributions
- Want fewer parameters than Glow
- Need smooth, differentiable transformations
Training Normalizing Flows¤
Maximum Likelihood Objective¤
Flow training optimizes the straightforward objective of maximum likelihood:
Equivalently, minimize negative log-likelihood:
This simplicity contrasts sharply with:
- GANs: Adversarial minimax optimization with mode collapse risks
- VAEs: ELBO with reconstruction-regularization trade-off
- Diffusion: Multi-step denoising with noise schedule design
Training Stability
Gradients flow through the entire composition automatically via backpropagation. Standard optimizers like Adam with learning rates around \(10^{-3}\) work reliably. The monotonic improvement of likelihood makes training highly stable.
Critical Preprocessing Steps¤
Proper preprocessing proves essential for successful training:
1. Dequantization:
Discrete data (e.g., uint8 images) create delta peaks in continuous space, allowing flows to assign arbitrarily high likelihood to exact discrete values while ignoring intermediate regions.
# Uniform dequantization
x_dequantized = x + torch.rand_like(x) / 256.0
# Variational dequantization (more sophisticated)
noise = flow_model_for_noise(x)
x_dequantized = x + noise / 256.0
2. Logit Transform:
Maps bounded \([0,1]\) data to unbounded \((-\infty, +\infty)\) space matching Gaussian priors:
# Add small constant for numerical stability
alpha = 0.05
x = alpha + (1 - 2*alpha) * x
# Apply logit transform
x_logit = torch.logit(x) # log(x / (1-x))
Critical Importance
Without these preprocessing steps, training diverges immediately as the model tries to match bounded data to Gaussian base distributions.
Numerical Stability Techniques¤
1. Log-Space Computation:
Never compute \(\det(J)\) directly—immediate overflow/underflow:
# WRONG
det_J = torch.det(jacobian)
log_det = torch.log(det_J) # Overflow!
# CORRECT
log_det = torch.logdet(jacobian)
# Or for triangular Jacobians:
log_det = torch.sum(torch.log(torch.abs(torch.diagonal(jacobian))))
2. Gradient Clipping:
Prevents exploding gradients in deep architectures (10-15+ layers):
3. Normalization Layers:
Batch normalization or ActNorm stabilizes intermediate representations:
# ActNorm: data-dependent initialization
scale, bias = compute_initial_stats(first_batch)
# Then treat as learnable parameters
4. Learning Rate Schedules:
Polynomial decay with warmup improves convergence:
# Warmup: 2-10 epochs linear increase
# Main training: polynomial decay from 1e-3 to 1e-4
scheduler = PolynomialDecay(optimizer, warmup_steps=1000,
total_steps=100000)
Monitoring Training¤
Watch these metrics:
- Negative log-likelihood: Should decrease steadily
- Per-layer log-determinants: Monitor for sudden spikes (numerical issues)
- Reconstruction error: \(\|x - f(f^{-1}(x))\| < 10^{-5}\) for numerical stability
- Bits per dimension: For images, \(\text{bpd} = \frac{\text{NLL}}{D \cdot \log 2}\)
# Invertibility check
x_reconstructed = flow.inverse(flow.forward(x))
recon_error = torch.mean(torch.abs(x - x_reconstructed))
assert recon_error < 1e-5, f"Poor invertibility: {recon_error}"
Common Pitfalls and Solutions¤
-
Missing Preprocessing
Symptom: Immediate divergence or NaN losses
Solution: Always dequantize discrete data and apply logit transform
-
Numerical Instability
Symptom: Sudden spikes in log-determinants or NaN gradients
Solution: Use log-space computation, gradient clipping, monitor per-layer statistics
-
Poor Invertibility
Symptom: \(\|x - f^{-1}(f(x))\| > 10^{-3}\)
Solution: Use residual flows with soft-thresholding, reduce depth, check numerical precision
-
Slow Convergence
Symptom: Likelihood plateaus early
Solution: Increase model capacity, add more layers, use spline flows, check preprocessing
Advanced Architectures and Recent Advances¤
Continuous Normalizing Flows (Neural ODEs)¤
Continuous flows parameterize the derivative of the hidden state:
The output \(\mathbf{z}_1\) at time \(t=1\) given initial condition \(\mathbf{z}_0\) at \(t=0\) is computed using ODE solvers.
Key Innovation (FFJORD):
The change in log-density follows:
Hutchinson's trace estimator makes this tractable:
This unbiased stochastic estimate requires only one Jacobian-vector product per sample, reducing complexity from \(O(D^2)\) to \(O(D)\).
Advantages:
- No architectural constraints: Any neural network architecture works
- Flexible expressiveness: Can model disconnected regions and sharp boundaries
- Adjoint method: Memory-efficient training (\(O(1)\) memory vs \(O(\text{depth})\))
Challenges:
- Unpredictable cost: Number of function evaluations adapts to complexity
- Stiff dynamics: Can struggle with certain distributions
- Slower than discrete flows: Requires ODE integration
When to Use:
- Modeling distributions with disconnected regions
- Physics simulation, molecular dynamics
- Scientific domains requiring flexible unrestricted networks
Residual Flows: Invertible ResNets¤
Residual flows make standard ResNet architectures \(F(\mathbf{x}) = \mathbf{x} + g(\mathbf{x})\) invertible by constraining \(g\) to be contractive with Lipschitz constant \(L < 1\).
Invertibility via Fixed-Point Iteration:
The Banach fixed-point theorem guarantees bijection with inverse computable via:
which converges exponentially to the true inverse.
Spectral Normalization:
Enforces the constraint by normalizing weight matrices:
The spectral norm \(\|\mathbf{W}\|_2\) is estimated via power iteration.
Russian Roulette Estimator:
For \(F(\mathbf{x}) = \mathbf{x} + g(\mathbf{x})\), the log-determinant has power series:
Rather than truncating (introducing bias), Russian roulette randomly terminates the series with probability ensuring unbiasedness while maintaining finite computation.
When to Use:
- High-dimensional problems (>1000 dimensions)
- Want free-form Jacobians (all dimensions interact)
- Need competitive density estimation with flexibility
- Extensions like Invertible DenseNets for parameter efficiency
Flow Matching: Simulation-Free Training¤
Flow Matching (2022) introduced a paradigm shift for training continuous normalizing flows without ODE simulation.
Key Idea:
Rather than integrating forward dynamics during training (as in Neural ODEs), perform regression on the vector field of fixed conditional probability paths.
Training Procedure:
- Given samples \(\mathbf{x}_0 \sim p_0\) and \(\mathbf{x}_1 \sim p_1\)
- Define interpolant: \(\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1\)
- Train neural network \(\mathbf{v}_\theta(\mathbf{x}_t, t)\) to match conditional vector field:
This is simple L2 regression requiring no simulation.
Inference:
Integrate the learned field using standard ODE solvers:
Optimal Transport Flow Matching:
Uses minibatch optimal transport to couple noise and data samples before interpolation, creating straighter paths that require fewer integration steps.
Results:
- State-of-the-art on ImageNet
- Better likelihood and sample quality than simulation-based methods
- Extensions to Riemannian manifolds, discrete data, video generation
When to Use:
- Training continuous flows efficiently
- Want simulation-free gradients
- Need state-of-the-art likelihood
- 2023-2024 cutting-edge research
Rectified Flows: Learning Straight Trajectories¤
Rectified Flow (2022) learns ODEs following straight-line paths connecting source and target distributions.
Training:
Given coupling between noise samples \(\mathbf{u} \sim p_0\) and data samples \(\mathbf{x} \sim p_1\):
- Linearly interpolate: \(\mathbf{x}_t = (1-t)\mathbf{u} + t\mathbf{x}\)
- Learn velocity field: \(\frac{d\mathbf{x}_t}{dt} = \mathbf{v}_\theta(\mathbf{x}_t, t)\)
Reflow Process (Key Innovation):
- Train initial model
- Generate paired samples: \((\mathbf{u}, \mathbf{x}_{\text{gen}})\) where \(\mathbf{x}_{\text{gen}} = \text{model}(\mathbf{u})\)
- Retrain on these pairs
This iterative rectification progressively straightens trajectories.
Benefits:
- Provably non-increasing convex transport costs
- One-step or few-step generation: Straight paths require minimal integration
- One reflow iteration typically suffices under realistic settings
Applications:
- Stable Diffusion 3: Uses rectified flow formulation, outperforms pure diffusion
- InstaFlow: Achieves 0.1-second generation demonstrating practical viability
- One-step generation for real-time applications
When to Use:
- Need few-step or one-step generation
- Real-time applications requiring fast inference
- Want to distill models for deployment
- State-of-the-art 2023-2024 research
Discrete Flow Matching¤
Discrete Flow Matching (2024) extends flows to discrete data (text, molecules, code) using Continuous-Time Markov Chains (CTMC).
Problem:
Traditional flows designed for continuous data. Dequantization workarounds prove inadequate for inherently discrete data like language.
Solution:
CTMC process with learnable time-dependent transition rates:
Training:
Regress on conditional flow:
Results:
- FlowMol-CTMC: State-of-the-art molecular validity
- Code generation: 1.7B parameter model achieves 13.4% Pass@10 on HumanEval
- DNA sequence design: Dirichlet Flow Matching
When to Use:
- Text generation (alternative to autoregressive)
- Molecular generation with discrete atom types
- Code generation
- Any discrete structured data
Geometric Flows: Riemannian Manifolds¤
Riemannian Flow Matching extends flows to non-Euclidean geometries, critical for data on manifolds.
Applications:
- Molecular conformations: SE(3) equivariant flows
- Protein structures: SO(3) rotations and translations
- Robotic configurations: Configuration space manifolds
- Materials: FlowMM for crystal structure generation (3× efficiency improvement)
Key Idea:
Replace Euclidean straight-line interpolants with geodesics on the manifold:
When to Use:
- Data naturally lives on manifolds
- Symmetries and geometric constraints are important
- Protein design, molecular generation, materials discovery
- 3D geometry and robotics applications
Comparing Flows with Other Generative Models¤
Flows vs VAEs: Exact Likelihood vs Learned Compression¤
| Aspect | Normalizing Flows | VAEs |
|---|---|---|
| Likelihood | Exact | Approximate (ELBO lower bound) |
| Dimensionality | Input = Output | Compressed latent (\(\dim(z) \ll \dim(x)\)) |
| Sample Quality | Sharp (historically) | Blurry (reconstruction loss) |
| Training | Maximum likelihood | ELBO (reconstruction + KL) |
| Mode Coverage | Excellent (exact distribution) | Can suffer posterior collapse |
| Generation Speed | Fast (single pass) | Fast (single pass) |
| Interpretability | Limited | Compressed representations |
When to Choose:
- Use Flows when exact likelihood is essential (anomaly detection, density estimation, model comparison) or lossless reconstruction matters
- Use VAEs when compressed latent representations provide value for downstream tasks, interpretability matters, or computational constraints favor smaller latent spaces
- Hybrid f-VAEs: Combine both—VAEs with flow-based posteriors or decoders
Flows vs GANs: Mode Coverage vs Sample Quality¤
| Aspect | Normalizing Flows | GANs |
|---|---|---|
| Sample Quality | Sharp, competitive with TarFlow (2024) | Superior perceptual quality (historically) |
| Training Stability | Guaranteed convergence | Notorious instability, mode collapse |
| Likelihood | Exact computation | No likelihood evaluation |
| Mode Coverage | Complete | Suffers from mode collapse |
| Evaluation | Negative log-likelihood | FID, Inception Score |
| Consistency | Input-output consistency | May hallucinate details |
2020 Liu & Gretton Study
Empirical comparison on synthetic data showed several normalizing flows substantially outperformed WGAN in Wasserstein distance—the very metric WGAN targets. No GAN tested could model simple distributions well.
When to Choose:
- Use Flows for stable reliable training, guaranteed mode coverage, likelihood evaluation, consistent generation
- Use GANs when perceptual quality dominates all other concerns and substantial tuning expertise is available
- Modern landscape: TarFlow (2024) shows flows can match GAN quality with proper architecture
Flows vs Diffusion: Speed vs Quality with Converging Trajectories¤
Historical Trade-off:
- Diffusion: Superior sample quality, but 50-1000 iterative denoising steps
- Flows: Fast single-step sampling, but lower perceptual quality
2023-2024 Developments Disrupted This:
- Rectified flows: Straight paths enabling few-step generation
- Flow matching: Simulation-free training matching diffusion quality
- TarFlow (2024): Transformer-based flows matching diffusion quality while maintaining one-step generation
| Aspect | Normalizing Flows (Modern) | Diffusion Models |
|---|---|---|
| Sampling Speed | 1-10 steps (TarFlow, Rectified) | 50-1000 steps (DDPM) or 10-50 (DDIM) |
| Sample Quality | Matching diffusion (TarFlow 2024) | Excellent |
| Likelihood | Exact | Tractable (learned) |
| Training Stability | Stable (NLL) | Stable (denoising) |
| Jacobian Computation | Required (\(O(D^3)\) → structured) | Not required |
| Architectural Constraints | Invertibility, equal dimensions | Flexible, no constraints |
Diffusion Normalizing Flow (2021)
Demonstrated 20× speedup over standard diffusion with comparable quality by combining learnable flow-based forward processes with stochastic elements.
When to Choose:
- Use Flows for real-time applications (audio synthesis, interactive systems), exact likelihood scoring (anomaly detection), computational constraints at inference
- Use Diffusion when sample quality is paramount and computational resources permit slower generation
- Hybrid Approaches: Flow matching unifies continuous flows and diffusion under common framework
Practical Implementation Guidance¤
Framework and Package Selection¤
Modern Flow Implementations:
-
PyTorch Ecosystem
-
normflows (1000+ stars): Comprehensive architectures (RealNVP, Glow, NSF, MAF)
- nflows: State-of-the-art methods from Edinburgh group (creators of spline flows)
-
Zuko: Modern PyTorch implementation with clean API
-
TensorFlow Probability
-
First-party flow support via composable bijectors
- Production-stable with extensive testing
-
Integrates with TensorFlow ecosystem
-
:simple-jax:{ .lg .middle } JAX Implementations
-
Distrax: High-performance flows with JAX transformations
- Workshop: This repository—comprehensive flows with Flax/NNX
- Optimal for scientific computing and research
Framework Choice:
- PyTorch: Dominates academic research, excellent debugging
- TensorFlow: Production stability, enterprise deployment
- JAX: High-performance scientific computing, automatic differentiation
Architecture Selection Guide¤
graph TD
Start{{"What's your<br/>primary goal?"}}
Start -->|"Density Estimation"| Dense{{"Dimensionality?"}}
Start -->|"Fast Sampling"| Sample{{"Data Type?"}}
Start -->|"Both Equally"| Both["RealNVP or Glow"]
Dense -->|"Low-Med<br/>(< 100)"| MAF["MAF<br/>(Masked Autoregressive)"]
Dense -->|"High<br/>(> 100)"| Spline["Neural Spline Flows"]
Dense -->|"Very High<br/>(> 1000)"| Residual["Residual Flows or<br/>Continuous Flows"]
Sample -->|"Images"| Glow["Glow"]
Sample -->|"Tabular/Other"| IAF["IAF or RealNVP"]
Sample -->|"Real-time"| Rectified["Rectified Flow"]
style Start fill:#e1f5ff
style Dense fill:#fff3cd
style Sample fill:#fff3cd
style MAF fill:#d4edda
style Spline fill:#d4edda
style Glow fill:#d4edda
style IAF fill:#d4edda
style Both fill:#d4edda
style Residual fill:#d4edda
style Rectified fill:#d4edda
Recommended Hyperparameters by Task¤
Image Generation (Glow/RealNVP):
config = {
"num_scales": 3,
"num_steps_per_scale": 8-12,
"hidden_channels": 512,
"num_layers_per_block": 3,
"batch_size": 64-128,
"learning_rate": 1e-3,
"lr_decay": "polynomial", # decay to 1e-4
"preprocessing": ["dequantize", "logit_transform"],
}
Density Estimation on Tabular Data (MAF/Neural Spline Flows):
config = {
"num_transforms": 5-10,
"hidden_dims": [512, 512], # Match or exceed data dimensionality
"num_bins": 8-16, # For spline flows
"batch_size": 256,
"learning_rate": 5e-4,
"preprocessing": ["standardization"],
}
Variational Inference (IAF/RealNVP):
config = {
"num_steps": 4-8,
"hidden_dims": [256, 256],
"base_distribution": "gaussian", # Learned mean/std
"learning_rate": 1e-3,
"annealing_schedule": "linear", # For KL term
}
Common Implementation Pitfalls¤
-
Forgetting Preprocessing
-
Computing Determinants Directly
-
Ignoring Invertibility Checks
-
Wrong Coupling Architecture
Summary and Key Takeaways¤
Normalizing flows provide a unique combination of exact likelihood computation, fast sampling, and stable training through invertible transformations with tractable Jacobians.
Core Principles¤
-
Exact Likelihood
Flows compute exact probability through change of variables formula, enabling precise density estimation
-
Invertible Architecture
Bijective transformations allow both efficient sampling and exact inference
-
Tractable Jacobians
Structured Jacobians (triangular, diagonal) reduce complexity from \(O(D^3)\) to \(O(D)\)
-
Stable Training
Maximum likelihood provides clear, monotonic objective without adversarial dynamics
-
Composable Design
Stack simple transformations to build arbitrarily complex distributions
Architecture Selection Matrix¤
| Architecture | Density Estimation | Fast Sampling | Best For |
|---|---|---|---|
| RealNVP | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Balanced use, images |
| Glow | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | High-res images, quality |
| MAF | ⭐⭐⭐⭐⭐ | ⭐⭐ | Density on tabular data |
| IAF | ⭐⭐ | ⭐⭐⭐⭐⭐ | Fast sampling, VI |
| Neural Spline | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Maximum expressiveness |
| Continuous | ⭐⭐⭐⭐ | ⭐⭐⭐ | No constraints, flexibility |
| Rectified | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | One-step generation |
Recent Advances (2023-2025)¤
- Flow Matching: Simulation-free training achieving state-of-the-art likelihood
- Rectified Flows: Straight paths enabling few-step generation (Stable Diffusion 3)
- TarFlow: First flow matching diffusion quality while maintaining one-step sampling
- Discrete Flows: CTMC-based flows for text, molecules, code
- Geometric Flows: Riemannian flow matching for manifold data (proteins, materials)
When to Use Normalizing Flows¤
Best Use Cases:
- Exact likelihood is essential (anomaly detection, model comparison)
- Fast generation required (real-time audio, interactive systems)
- Stable training preferred over adversarial methods
- Lossless reconstruction needed
- Mode coverage guarantees important
Avoid When:
- Maximum perceptual quality is sole objective (use GANs/diffusion)
- Compressed representations needed (use VAEs)
- Architectural flexibility critical (diffusion has fewer constraints)
- Very high dimensions with limited resources (consider latent diffusion)
Future Directions¤
- One-step generation via rectified flows and distillation
- Pyramidal structures for video and high-resolution media
- Hybrid models combining flows with diffusion, transformers
- Scientific applications in materials, proteins, molecular generation
- Geometric awareness for data on manifolds
Next Steps¤
-
Practical usage guide with implementation examples and training workflows
-
Complete API documentation for RealNVP, Glow, MAF, IAF, and Neural Spline Flows
-
Step-by-step hands-on tutorial: train a flow model on MNIST from scratch
-
Explore continuous flows, flow matching, and state-of-the-art architectures
References and Further Reading¤
Seminal Papers (Must Read)¤
Dinh, L., Krueger, D., & Bengio, Y. (2014). "NICE: Non-linear Independent Components Estimation"
arXiv:1410.8516
First practical coupling layer architecture
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). "Density estimation using Real NVP"
arXiv:1605.08803
Affine coupling layers and multi-scale architecture
Kingma, D. P., & Dhariwal, P. (2018). "Glow: Generative Flow with Invertible 1×1 Convolutions"
arXiv:1807.03039
State-of-the-art image generation with learnable permutations
Papamakarios, G., Pavlakou, T., & Murray, I. (2017). "Masked Autoregressive Flow for Density Estimation"
arXiv:1705.07057
Autoregressive flows for maximum expressiveness
Durkan, C., Bekasov, A., Murray, I., & Papamakarios, G. (2019). "Neural Spline Flows"
arXiv:1906.04032
Monotonic rational-quadratic splines for flexible transformations
Continuous and Modern Flows¤
Chen, R. T. Q., Rubanova, Y., Bettencourt, J., & Duvenaud, D. (2018). "Neural Ordinary Differential Equations"
arXiv:1806.07366
Continuous-time flows using ODE solvers
Grathwohl, W., Chen, R. T. Q., Bettencourt, J., Sutskever, I., & Duvenaud, D. (2019). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models"
arXiv:1810.01367
Tractable continuous flows with Hutchinson's estimator
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2022). "Flow Matching for Generative Modeling"
arXiv:2210.02747
Simulation-free training paradigm
Liu, X., Gong, C., & Liu, Q. (2022). "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow"
arXiv:2209.03003
Straight paths for one-step generation
Recent Advances (2023-2025)¤
Gat, I., et al. (2024). "Discrete Flow Matching"
arXiv:2407.15595
CTMC-based flows for discrete data (NeurIPS 2024 Spotlight)
Zhai, S., et al. (2024). "Normalizing Flows are Capable Generative Models (TarFlow)"
arXiv:2412.06329
First flow matching diffusion quality
Esser, P., et al. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Stable Diffusion 3)"
arXiv:2403.03206
Rectified flows in production systems
Chen, R. T. Q., & Lipman, Y. (2024). "Riemannian Flow Matching on General Geometries"
arXiv:2302.03660
Flows on manifolds for geometric data
Comprehensive Surveys¤
Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2021). "Normalizing Flows for Probabilistic Modeling and Inference"
arXiv:1912.02762 | JMLR 22(57):1-64, 2021
Comprehensive tutorial covering theory and methods
Kobyzev, I., Prince, S. J., & Brubaker, M. A. (2020). "Normalizing Flows: An Introduction and Review of Current Methods"
arXiv:1908.09257 | IEEE TPAMI 2020
Excellent introduction with taxonomy
Online Resources¤
Lilian Weng's Blog: "Flow-based Deep Generative Models"
lilianweng.github.io/posts/2018-10-13-flow-models
Comprehensive blog post with clear explanations and visualizations
Eric Jang's Tutorial
blog.evjang.com/2018/01/nf1.html
Two-part tutorial with code
UvA Deep Learning Tutorial 11
uvadlc-notebooks.readthedocs.io
Complete Colab notebooks
awesome-normalizing-flows
github.com/janosh/awesome-normalizing-flows
Curated list with 700+ papers