Dirichlet process

In probability theory, Dirichlet processes (after Peter Gustav Lejeune Dirichlet) are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution. In probability theory, Dirichlet processes (after Peter Gustav Lejeune Dirichlet) are a family of stochastic processes whose realizations are probability distributions. In other words, a Dirichlet process is a probability distribution whose range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution. The Dirichlet process is specified by a base distribution H {displaystyle H} and a positive real number α {displaystyle alpha } called the concentration parameter (also known as scaling parameter). The base distribution is the expected value of the process, i.e., the Dirichlet process draws distributions 'around' the base distribution the way a normal distribution draws real numbers around its mean. However, even if the base distribution is continuous, the distributions drawn from the Dirichlet process are almost surely discrete. The scaling parameter specifies how strong this discretization is: in the limit of α → 0 {displaystyle alpha ightarrow 0} , the realizations are all concentrated at a single value, while in the limit of α → ∞ {displaystyle alpha ightarrow infty } the realizations become continuous. Between the two extremes the realizations are discrete distributions with less and less concentration as α {displaystyle alpha } increases. The Dirichlet process can also be seen as the infinite-dimensional generalization of the Dirichlet distribution. In the same way as the Dirichlet distribution is the conjugate prior for the categorical distribution, the Dirichlet process is the conjugate prior for infinite, nonparametric discrete distributions. A particularly important application of Dirichlet processes is as a prior probability distribution in infinite mixture models. The Dirichlet process was formally introduced by Thomas Ferguson in 1973and has since been applied in data mining and machine learning, among others for natural language processing, computer vision and bioinformatics. Dirichlet processes are usually used when modelling data that tends to repeat previous values in a so-called 'rich get richer' fashion. Specifically, suppose that the generation of values X 1 , X 2 , … {displaystyle X_{1},X_{2},dots } can be simulated by the following algorithm. a) With probability α α + n − 1 {displaystyle {frac {alpha }{alpha +n-1}}} draw X n {displaystyle X_{n}} from H {displaystyle H} . b) With probability n x α + n − 1 {displaystyle {frac {n_{x}}{alpha +n-1}}} set X n = x {displaystyle X_{n}=x} , where n x {displaystyle n_{x}} is the number of previous observations of x {displaystyle x} . (Formally, n x := | { j : X j = x and j < n } | {displaystyle n_{x}:=|{j:X_{j}=x{ ext{ and }}j<n}|} where | ⋅ | {displaystyle |cdot |} denotes the number of elements in the set.) At the same time, another common model for data is that the observations X 1 , X 2 , … {displaystyle X_{1},X_{2},dots } are assumed to be independent and identically distributed (i.i.d.) according to some distribution P {displaystyle P} . The goal in introducing Dirichlet processes is to be able to describe the procedure outlined above in this i.i.d. model. The X 1 , X 2 , … {displaystyle X_{1},X_{2},dots } observations are not independent, since we have to consider the previous results when generating the next value. They are, however, exchangeable. This fact can be shown by calculating the joint probability distribution of the observations and noticing that the resulting formula only depends on which x {displaystyle x} values occur among the observations and how many repetitions they each have. Because of this exchangeability, de Finetti's representation theorem applies and it implies that the observations X 1 , X 2 , … {displaystyle X_{1},X_{2},dots } are conditionally independent given a (latent) distribution P {displaystyle P} . This P {displaystyle P} is a random variable itself and has a distribution. This distribution (over distributions) is called a Dirichlet process ( DP {displaystyle operatorname {DP} } ). In summary, this means that we get an equivalent procedure to the above algorithm: In practice, however, drawing a concrete distribution P {displaystyle P} is impossible, since its specification requires an infinite amount of information. This is a common phenomenon in the context of Bayesian non-parametric statistics where a typical task is to learn distributions on function spaces, which involve effectively infinitely many parameters. The key insight is that in many applications the infinite dimensional distributions appear only as an intermediary computational device and are not required for either the initial specification of prior beliefs or for the statement of the final inference.

Parent Topic

Child Topic

No Parent Topic