language-icon Old Web
English
Sign In

Mahalanobis distance

The Mahalanobis distance is a measure of the distance between a point P and a distribution D, introduced by P. C. Mahalanobis in 1936. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. This distance is zero if P is at the mean of D, and grows as P moves away from the mean along each principal component axis. If each of these axes is re-scaled to have unit variance, then the Mahalanobis distance corresponds to standard Euclidean distance in the transformed space. The Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.The Mahalanobis distance of an observation x → = ( x 1 , x 2 , x 3 , … , x N ) T {displaystyle {vec {x}}=(x_{1},x_{2},x_{3},dots ,x_{N})^{T}}   from a set of observations with mean μ → = ( μ 1 , μ 2 , μ 3 , … , μ N ) T {displaystyle {vec {mu }}=(mu _{1},mu _{2},mu _{3},dots ,mu _{N})^{T}}   and covariance matrix S is defined as:Consider the problem of estimating the probability that a test point in N-dimensional Euclidean space belongs to a set, where we are given sample points that definitely belong to that set. Our first step would be to find the centroid or center of mass of the sample points. Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set.For a normal distribution in any number of dimensions, the probability density of an observation is uniquely determined by the Mahalanobis distance d. Specifically, d 2 {displaystyle d^{2}}   is chi-squared distributed. If the number of dimensions is 2, for example, the probability of a particular calculated d {displaystyle d}   being less than some threshold t {displaystyle t}   is 1 − e − t 2 / 2 {displaystyle 1-e^{-t^{2}/2}}  . To determine a threshold to achieve a particular probability, p, use t = − 2 ln ⁡ ( 1 − p ) {displaystyle t={sqrt {-2ln(1-p)}}}  , for 2 dimensions. For number of dimensions other than 2, the cumulative chi-squared distribution should be consulted.In general, given a normal (Gaussian) random variable X {displaystyle X}   with variance S = 1 {displaystyle S=1}   and mean μ = 0 {displaystyle mu =0}  , any other normal random variable R {displaystyle R}   (with mean μ 1 {displaystyle mu _{1}}   and variance S 1 {displaystyle S_{1}}  ) can be defined in terms of X {displaystyle X}   by the equation R = μ 1 + S 1 X . {displaystyle R=mu _{1}+{sqrt {S_{1}}}X.}   Conversely, to recover a normalized random variable from any normal random variable, one can typically solve for X = ( R − μ 1 ) / S 1 {displaystyle X=(R-mu _{1})/{sqrt {S_{1}}}}  . If we square both sides, and take the square-root, we will get an equation for a metric that looks a lot like the Mahalanobis distance:Mahalanobis distance is closely related to the leverage statistic, h {displaystyle h}  , but has a different scale:Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based on measurements in 1927.Many programs and statistics packages, such as R, Python, etc., include implementations of Mahalanobis distance.

[ "Statistics", "Machine learning", "Artificial intelligence", "Pattern recognition", "mahalanobis metric", "multivariate outlier detection", "multivariate outliers" ]
Parent Topic
Child Topic
    No Parent Topic