Stats in ML: Dirichlet Distribution
Dirichlet
Before university, I've never heard of Dirichlet distribution. It seemed to me like a loner in the family of statistics when I first saw it. But soon I discovered that this is a huge misunderstanding. Dirichlet is way too important, too usful in the field of machine learning and statistical learning theories to be ignored by any interested scholars. I know deep down in my heart that I must dedicate one whole blog for it to stress its significance. As we walk along the path, we shall see why it is so great (in the past, present and future). Let's begin with a general definition of it.
Formal definition
The Dirichlet distribution Dir(
Conjugate Prior and its usage.
In Bayesian probability theory, we have Posterior
If the posterior distribution
Note that in many algorithm, we want to find the value of
Expression
Analogous to multinomial distribution to binomial distribution, Dirichlet is the multinomial version for the beta distribution. Dirichlet distribution is a family of continuous probability distribution for a discrete probability distribution for
The expression is then
Think of
- Moments explicit expressions
To see that Dirichlet distribution is the conjugate prior for multinomial distribution, consider prior
Therefore,
Note the key points here are:
Distributino has 2 parameters: the scale (or concentration)
, and the base measure . A Dirichlet with small concentration favors extreme distributions, but this prior belief is very weak and is easily overwritten by data. It shall be seen as a generalization of Beta:
- Beta is a distribution over binomials (in an interval
); - Dirichlet is a distribution over Multinomials (in the so-called simplex
).
- Beta is a distribution over binomials (in an interval
If we want to marginalize the parameters out (often used in ML models for parameter optimization) we can use the following formula:
If we want to make prediction via conditional pdf of new data given previous data, we can use the following formula instead:
Side note
The above section gives a comprehensive view of dirichlet distribution. However, a more widely applied technique is Dirichlet Process. It is similar to Gaussian Process, but uses Dirichlet as conjugate prior instead on problems with multinomial likelihood (e.g. Latent Dirichlet Allocation). We've discussed this idea in the topic modeling blog. Interested readers can go that that blog for details.
References
Stats in ML: Dirichlet Distribution
https://criss-wang.github.io/post/blogs/prob_and_stats/dirichlet/