Introduction
Bayesian inference asks for the posterior distribution:
$$ p(z \mid x) = \frac{p(x \mid z)p(z)}{p(x)} $$
The problem is that the denominator $p(x)$ is often difficult to compute because it requires integrating over all latent variables.
Variational inference turns inference into optimization. Instead of computing the exact posterior, we choose a simpler family of distributions and find the member that is closest to the true posterior.
Approximate Posterior
Let $q_\phi(z)$ be an approximate posterior. The goal is:
Find q_phi(z) that is close to p(z | x).
The family of possible $q$ distributions might be simple, such as fully factorized Gaussians. This is called a variational family.
The approximation is only as good as the family allows.
KL Divergence
Closeness is often measured with KL divergence:
$$ KL(q(z) | p(z \mid x)) $$
Directly minimizing this is hard because it still involves the unknown posterior. Instead, variational inference optimizes the evidence lower bound.
ELBO
The evidence lower bound (ELBO) is:
$$ ELBO(q) = \mathbb{E}_{q(z)}[\log p(x, z)]
\mathbb{E}_{q(z)}[\log q(z)] $$
Maximizing the ELBO is equivalent to minimizing the KL divergence from the approximate posterior to the true posterior, up to a constant.
Another common form is:
$$ ELBO = \mathbb{E}_{q(z)}[\log p(x \mid z)]
KL(q(z) | p(z)) $$
This form shows the tradeoff:
- Fit the observed data.
- Keep the approximate posterior close to the prior.
Mean-Field Assumption
Mean-field variational inference assumes the approximate posterior factorizes:
$$ q(z) = \prod_i q_i(z_i) $$
This makes optimization easier, but it ignores posterior dependencies between variables. It is a useful approximation, not a law of nature.
Coordinate Ascent
In classical variational inference, each factor $q_i$ can be updated while holding the others fixed. This is coordinate ascent variational inference.
Modern deep learning often uses gradient-based optimization instead, especially when neural networks parameterize the variational distribution.
Connection to VAEs
Variational autoencoders use variational inference with neural networks.
- The encoder defines $q_\phi(z \mid x)$.
- The decoder defines $p_\theta(x \mid z)$.
- Training maximizes an ELBO.
The same idea appears in a deep learning wrapper: approximate a hard posterior with a tractable learned distribution.
Practical Tradeoffs
Variational inference is usually faster than sampling methods such as MCMC, especially at scale. The tradeoff is approximation bias.
Use it when:
- Exact inference is intractable.
- You need scalable approximate Bayesian inference.
- A tractable posterior approximation is acceptable.
- You can validate whether the approximation is good enough.
Be careful when posterior uncertainty matters deeply. A convenient approximation can understate uncertainty.
Closing
Variational inference is best understood as optimization-based approximate inference. Choose a family of distributions, optimize the ELBO, and remember that the result is an approximation whose quality depends on both the model and the variational family.