Overdispersion

In statistics, overdispersion is the presence of greater variability (statistical dispersion) in a data set than would be expected based on a given statistical model. In statistics, overdispersion is the presence of greater variability (statistical dispersion) in a data set than would be expected based on a given statistical model. A common task in applied statistics is choosing a parametric model to fit a given set of empirical observations. This necessitates an assessment of the fit of the chosen model. It is usually possible to choose the model parameters in such a way that the theoretical population mean of the model is approximately equal to the sample mean. However, especially for simple models with few parameters, theoretical predictions may not match empirical observations for higher moments. When the observed variance is higher than the variance of a theoretical model, overdispersion has occurred. Conversely, underdispersion means that there was less variation in the data than predicted. Overdispersion is a very common feature in applied data analysis because in practice, populations are frequently heterogeneous (non-uniform) contrary to the assumptions implicit within widely used simple parametric models. Overdispersion is often encountered when fitting very simple parametric models, such as those based on the Poisson distribution. The Poisson distribution has one free parameter and does not allow for the variance to be adjusted independently of the mean. The choice of a distribution from the Poisson family is often dictated by the nature of the empirical data. For example, Poisson regression analysis is commonly used to model count data. If overdispersion is a feature, an alternative model with additional free parameters may provide a better fit. In the case of count data, a Poisson mixture model like the negative binomial distribution can be proposed instead, in which the mean of the Poisson distribution can itself be thought of as a random variable drawn – in this case – from the gamma distribution thereby introducing an additional free parameter (note the resulting negative binomial distribution is completely characterized by two parameters). As a more concrete example, it has been observed that the number of boys born to families does not conform faithfully to a binomial distribution as might be expected. Instead, the sex ratios of families seem to skew toward either boys or girls (see, for example the Trivers–Willard hypothesis for one possible explanation) i.e. there are more all-boy families, more all-girl families and not enough families close to the population 51:49 boy-to-girl mean ratio than expected from a binomial distribution, and the resulting empirical variance is larger than specified by a binomial model. In this case, the beta-binomial model distribution is a popular and analytically tractable alternative model to the binomial distribution since it provides a better fit to the observed data. To capture the heterogeneity of the families, one can think of the probability parameter of the binomial model (say, probability of being a boy) as itself a random variable (i.e. random effects model) drawn for each family from a beta distribution as the mixing distribution. The resulting compound distribution (beta-binomial) has an additional free parameter. Another common model for overdispersion—when some of the observations are not Bernoulli—arises from introducing a normal random variable into a logistic model. Software is widely available for fitting this type of multilevel model. In this case, if the variance of the normal variable is zero, the model reduces to the standard (undispersed) logistic regression. This model has an additional free parameter, namely the variance of the normal variable.

Parent Topic

Child Topic

No Parent Topic