Stats in ML: Dirichlet Distribution


Dirichlet

Before university, I've never heard of Dirichlet distribution. It seemed to me like a loner in the family of statistics when I first saw it. But soon I discovered that this is a huge misunderstanding. Dirichlet is way too important, too usful in the field of machine learning and statistical learning theories to be ignored by any interested scholars. I know deep down in my heart that I must dedicate one whole blog for it to stress its significance. As we walk along the path, we shall see why it is so great (in the past, present and future). Let's begin with a general definition of it.

Formal definition

The Dirichlet distribution Dir() is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. It is a multivariate generalisation of the Beta distribution. Dirichlet distributions are commonly used as prior distributions (e.g. for categorical and multinomial distributions) in Bayesian statistics.

Conjugate Prior and its usage.

In Bayesian probability theory, we have Posterior , Prior and Likelihood related via the Bayes' theorem:

If the posterior distribution and the prior distribution are from the same probability distribution family, then the prior and posterior are called conjugate distributions, and the prior is the conjugate prior for the likelihood function .

Note that in many algorithm, we want to find the value of that maximizes the posterior (Maximum a posteriori). If the prior is some weird distribution, we may not get an analytical form for the posterior. Consequently, more complicated optimization strategies like interior point method may need to be applied, which can be computationally expensive. If both the prior and posterior have the same algebraic form, applying bayes rule to find is much easier.

Expression

Analogous to multinomial distribution to binomial distribution, Dirichlet is the multinomial version for the beta distribution. Dirichlet distribution is a family of continuous probability distribution for a discrete probability distribution for categories , where for and , denoted by parameters . Formally, we denote .

The expression is then where is some constant normalizer.

Think of as the probability density associated with θ which is used for a multinomial distribution, given that our Dirichlet distribution has parameter .

  • Moments explicit expressions

To see that Dirichlet distribution is the conjugate prior for multinomial distribution, consider prior , and likelihood , where is the sample/result representing success out of trials for each object .

Therefore, . We can intrepret this as "given prior Dirichlet distribution (with param ) of probability vector for a total of objects and an observation vector , the posterior belief of the is a new Dirichlet Distribution with param ()".

Note the key points here are:

  • Distributino has 2 parameters: the scale (or concentration) , and the base measure . A Dirichlet with small concentration favors extreme distributions, but this prior belief is very weak and is easily overwritten by data.

  • It shall be seen as a generalization of Beta:

    • Beta is a distribution over binomials (in an interval );
    • Dirichlet is a distribution over Multinomials (in the so-called simplex ).

If we want to marginalize the parameters out (often used in ML models for parameter optimization) we can use the following formula:

If we want to make prediction via conditional pdf of new data given previous data, we can use the following formula instead:

Side note

The above section gives a comprehensive view of dirichlet distribution. However, a more widely applied technique is Dirichlet Process. It is similar to Gaussian Process, but uses Dirichlet as conjugate prior instead on problems with multinomial likelihood (e.g. Latent Dirichlet Allocation). We've discussed this idea in the topic modeling blog. Interested readers can go that that blog for details.

References

Author

Zhenlin Wang

Posted on

2021-07-21

Updated on

2022-03-28

Licensed under