# Stats in ML: Dirichlet Distribution

## Dirichlet

Before university, I've never heard of Dirichlet distribution. It seemed to me like a loner in the family of statistics when I first saw it. But soon I discovered that this is a huge misunderstanding. Dirichlet is way too important, too usful in the field of machine learning and statistical learning theories to be ignored by any interested scholars. I know deep down in my heart that I must dedicate one whole blog for it to stress its significance. As we walk along the path, we shall see why it is so great (in the past, present and future). Let's begin with a general definition of it.

### Formal definition

The Dirichlet distribution Dir(*multivariate generalisation* of the Beta distribution. Dirichlet distributions are commonly used as prior distributions (e.g. for *categorical* and *multinomial* distributions) in Bayesian statistics.

### Conjugate Prior and its usage.

In Bayesian probability theory, we have **Posterior **,

**Prior**and

**Likelihood**related via the Bayes' theorem:

If the posterior distribution *conjugate distributions*, and the prior is the *conjugate prior* for the likelihood function

Note that in many algorithm, we want to find the value of

### Expression

Analogous to multinomial distribution to binomial distribution, Dirichlet is the multinomial version for the beta distribution. Dirichlet distribution is a family of continuous probability distribution for a discrete probability distribution for

The expression is then **normalizer**.

Think of ** θ** which is used for a multinomial distribution, given that our Dirichlet distribution has parameter

- Moments explicit expressions

To see that Dirichlet distribution is the conjugate prior for multinomial distribution, consider prior

Therefore,

Note the key points here are:

Distributino has 2 parameters: the scale (or concentration)

, and the base measure . A Dirichlet with small concentration favors extreme distributions, but this prior belief is very weak and is easily overwritten by data. It shall be seen as a generalization of Beta:

- Beta is a distribution over binomials (in an interval
); - Dirichlet is a distribution over Multinomials (in the so-called simplex
).

- Beta is a distribution over binomials (in an interval

If we want to marginalize the parameters out (often used in ML models for parameter optimization) we can use the following formula:

If we want to make prediction via conditional pdf of new data given previous data, we can use the following formula instead:

### Side note

The above section gives a comprehensive view of dirichlet distribution. However, a more widely applied technique is **Dirichlet Process**. It is similar to *Gaussian Process*, but uses Dirichlet as conjugate prior instead on problems with multinomial likelihood (e.g. Latent Dirichlet Allocation). We've discussed this idea in the topic modeling blog. Interested readers can go that that blog for details.

### References

Stats in ML: Dirichlet Distribution

https://criss-wang.github.io/post/blogs/prob_and_stats/dirichlet/