Theil–Sen estimator

In non-parametric statistics, the Theil–Sen estimator is a method for robustly fitting a line to sample points in the plane (simple linear regression) by choosing the median of the slopes of all lines through pairs of points. It has also been called Sen's slope estimator, slope selection, the single median method, the Kendall robust line-fit method, and the Kendall–Theil robust line. It is named after Henri Theil and Pranab K. Sen, who published papers on this method in 1950 and 1968 respectively, and after Maurice Kendall because of its relation to the Kendall tau rank correlation coefficient. This estimator can be computed efficiently, and is insensitive to outliers. It can be significantly more accurate than non-robust simple linear regression (least squares) for skewed and heteroskedastic data, and competes well against least squares even for normally distributed data in terms of statistical power. It has been called 'the most popular nonparametric technique for estimating a linear trend'. As defined by Theil (1950), the Theil–Sen estimator of a set of two-dimensional points (xi,yi) is the median m of the slopes (yj − yi)/(xj − xi) determined by all pairs of sample points. Sen (1968) extended this definition to handle the case in which two data points have the same x coordinate. In Sen's definition, one takes the median of the slopes defined only from pairs of points having distinct x coordinates. Once the slope m has been determined, one may determine a line from the sample points by setting the y-intercept b to be the median of the values yi − mxi. The fit line is then the line y = mx + b with coefficients m and b in slope–intercept form. As Sen observed, this choice of slope makes the Kendall tau rank correlation coefficient become approximately zero, when it is used to compare the values xi with their associated residuals yi − mxi − b. Intuitively, this suggests that how far the fit line passes above or below a data point is not correlated with whether that point is on the left or right side of the data set. The choice of b does not affect the Kendall coefficient, but causes the median residual to become approximately zero; that is, the fit line passes above and below equal numbers of points. A confidence interval for the slope estimate may be determined as the interval containing the middle 95% of the slopes of lines determined by pairs of points and may be estimated quickly by sampling pairs of points and determining the 95% interval of the sampled slopes. According to simulations, approximately 600 sample pairs are sufficient to determine an accurate confidence interval. A variation of the Theil–Sen estimator, the repeated median regression of Siegel (1982), determines for each sample point (xi,yi), the median mi of the slopes (yj − yi)/(xj − xi) of lines through that point, and then determines the overall estimator as the median of these medians. It can tolerate a greater number of outliers than the Theil–Sen estimator, but known algorithms for computing it efficiently are more complicated and less practical. A different variant pairs up sample points by the rank of their x-coordinates: the point with the smallest coordinate is paired with the first point above the median coordinate, the second-smallest point is paired with the next point above the median, and so on. It then computes the median of the slopes of the lines determined by these pairs of points, gaining speed by examining significantly fewer pairs than the Theil–Sen estimator. Variations of the Theil–Sen estimator based on weighted medians have also been studied, based on the principle that pairs of samples whose x-coordinates differ more greatly are more likely to have an accurate slope and therefore should receive a higher weight.

Parent Topic

Child Topic

No Parent Topic