# Variational Inference

### Introduction

#### 1. Background of Bayesian methods

In the field of machine learning, most would agree that frequentist approaches played a critical role in the development of early classical models. Nevertheless, we are witnessing the increasing significance of Bayesian methods in modern study of machine learning and data modelling. The simple-looking Bayes' rule

#### 2. Problem with Bayesian methods: intractable integral

While the rule looks easily understandable, the numerical computation is hard in reality. One major issue is the intractable integral

#### 3. Main idea of variational inference

In variational inference, we can avoid computing the intractable integral by magically modelling the posterior

### Understanding Variational Bayesian method

In this section, we demonstrate the theory behind variational Bayesian methods.

#### 1. Kullback-Leibler Divergence

As mentioned above, variational inference needs a distribution

KL divergence is defined as

Where

if

is low, the divergence is generally low. if

is high and is high, the divergence is low. if

is high and is low, the divergence is high, hence the approximation is not ideal.

Take note of the following about use of KL divergence in Variational Bayes:

KL divergence is not symmetric, it's easy to see from the formula that

as the approximation distribution is usually different from the target distribution . In general, we focus on approximating some regions of

as good as possible (Figure 1 (a)). It is not necessary for the to nicely approximate every part of As a result (usually called forward divergence) is not ideal. Because for some regions which we don't want to care, if , the KL divergence will be very large, forcing to take a different form even if it fits well with other regions of (refer to Figure 1(b)). On the other hand, (usually called reverse KL divergence) has the nice property that only regions where requires and to be similar. Consequently, reverse KL divergence is more commonly used in Variational Inference.

#### 2. Evidence lower bound

Usually we don't directly minimizing KL divergence to obtain a good approximated distribution. This is because computing

The approximation using reverse KL divergence usually gives good empirical results, even though some regions of

We can directly conclude by the fact

- By the definition of marginal probability, we have
, take log on both side we have:

- The last 2 lines follow from Jensen's Inequality which states that for a convex function
, we have

This term

### General procedure

In general, a variational inference starts with a family of variational distribution (such as the mean-field family described below) as the candidate for

### Mean Field Variational Family

#### 1. The "Mean Field" Assumptions

As shown above, the particular variational distribution family we use to approximate the posterior

where

#### 2. Derivation of optimal

Now in order to derive the the optimal form of distribution for

With this new expression, we can consider maximizing

We take the derivative with respect to

where

The funtional derivative of this expression actually requires some knowledge about calculus of variations, specifically Euler-Lagrange equation.

#### 3. Variable update with Coordinate Ascent

From equation

Compute values (if any) that can be directly obtained from data and constants

Initialize a particular

to an arbitrary value Update each variable with the step function

Repeat step 3 until the convergence of ELBO

A more detailed example of coordinate ascent will be shown in next section with the univariate gaussian distribution example. A point to take note that in general, we cannot guarantee the convexity of ELBO function. Hence, the convergence is usually to a local maximum.

### Example with Univariate Gaussian

We demonstrate the mean-field variational inference with a simple case of observations from univariate Gaussian model. We first assume there are

Here

where

Note that sometimes some latent variable has higher priority that others. The choice of this variable depends on the exact question in hand.

#### 1. Compute independent and

Next, we apply approximation via

- Compute the expression for
:

Note that here

- Compute the expression for

A closer look at the result

#### 2. Variable update until ELBO convergence}

Now that we have