logo
    A Study on Identification of High Leverage Points in Multiple Linear Regression
    0
    Citation
    0
    Reference
    20
    Related Paper
    Abstract:
    Outliers with respect to the predictor variables are called high leverage points. The observations that are slightly different from all others can drive to a large difference in the results of regression analysis. In regression analysis, the detection of high leverage points is compulsory, as they will give large impact on the estimation values as well as lead to multicollinearity problems. In this situation, robust regression procedure can be very useful to deal with problems arise due to the existence of high leverage points. The aim of this study is to compare the performance of three methods in detecting high leverage points. At first stage, the two well-known data sets are considered. The first data used is artificial data set generated by Hawkins, Bradu and Kass in 1984 and the second data used is stack loss data by Brownlee in 1965. The second stage of this study is to conduct simulation study whereby the data were generated based on clean and contaminated data. The three sets of measures being considered in this study are Leverage methods Ttwice-the-mean-rule), Generalized Potentials and Diagnostic Robust Generalized Approach (DRGP). The result indicates that DRGP successfully proved its ability as a powerful method of detecting high leverage points as compared to the other two methods using both artificial data sets and simulated data.
    Keywords:
    Leverage (statistics)
    Data point
    Robust regression
    Problem statement: High leverage points are extreme outliers in the X-direction. In regression analysis, the detection of these leverage points becomes important due to their arbitrary large effects on the estimations as well as multicollinearity problems. Mahalanobis Distance (MD) has been used as a diagnostic tool for identification of outliers in multivariate analysis where it finds the distance between normal and abnormal groups of the data. Since the computation of MD relies on nonrobust classical estimations, the classical MD can hardly detect outliers accurately. As an alternative,Robust MD (RMD) methods such as Minimum Covariance Determinant (MCD) and Minimum Volume Ellipsoid (MVE) estimators had been used to identify the existence of high leverage points in the data set. However, these methods tended to swamp some low leverage points even though they can identify high leverage points correctly. Since, the detection of leverage points is one of the most important issues in regression analysis, it is imperative to introduce a novel detection method of high leverage points. Approach: In this study, we proposed a relatively new two-step method for detection of high leverage points by utilizing the RMD (MVE) and RMD (MCD) in the first step to identify the suspected outlier points. Then, in the second step the MD was used based on the mean and covariance of the clean data set. We called this method two-step Robust Diagnostic Mahalanobis Distance (RDMDTS) which could identify high leverage points correctly and also swamps less low leverage points. Results: The merit of the newly proposed method was investigated extensively by real data sets and Monte Carlo Simulations study. The results of this study indicated that, for small sample sizes, the best detection method is (RDMDTS) (MVE)-mad while there was not much difference between (RDMDTS) (MVE)-mad and (RDMDTS) (MCD)-mad for large sample sizes. Conclusion/Recommendations: In order to swamp less low leverage as high leverage point, the proposed robust diagnostic methods, (RDMDTS) (MVE)-mad and (RDMDTS) (MCD)-mad were recommended.
    Mahalanobis distance
    Leverage (statistics)
    Citations (0)
    In small to moderate sample sizes it is important to make use of all the data when there are no outliers, for reasons of efficiency. It is equally important to guard against the possibility that there may be single or multiple outliers which can have disastrous effects on normal theory least squares estimation and inference. The purpose of this paper is to describe and illustrate the use of an adaptive regression estimation algorithm which can be used to highlight outliers, either single or multiple of varying number. The outliers can include 'bad' leverage points. Illustration is given of how 'good' leverage points are retained and 'bad' leverage points discarded. The adaptive regression estimator generalizes its high breakdown point adaptive location estimator counterpart and thus is expected to have high efficiency at the normal model. Simulations confirm this. On the other hand, examples demonstrate that the regression algorithm given highlights outliers and 'potential' outliers for closer scrutiny. The algorithm is computer intensive for the reason that it is a global algorithm which is designed to highlight outliers automatically. This also obviates the problem of searching out ``local minima encountered by some algorithms designed as fast search methods. Instead the objective here is to assess all observations and subsets of observations with the intention of culling all outliers which can range up to as much as approximately half the data. It is assumed that the distributional form of the data less outliers is approximately normal. If this distributional assumption fails, plots can be used to indicate such failure, and, transformations may be ;required before potential outliers are deemed as outliers. A well known set of data illustrates this point.
    Sample (material)
    Robust regression
    Citations (14)
    The presence of outliers and multicollinearity are inevitable in real data sets and they have an unduly effect on the parameter estimation of multiple linear regression models. It is now evident that outliers in the X-direction or high leverage points are another source of multicollinearity. These leverage points may induce or hide near-linear dependency of explanatory variables in a data set. We call these leverages, high leverage collinearity-influential observations either enhancing or reducing multicollinearity. By proposing High Leverage Collinearity-Influential Measure, denoted as HLCIM, we study several criteria such as sample size and magnitude, percentage, and position of high leverage points which cause these leverages to change the multicollinearity pattern of collinear and non-collinear data sets. The Ordinary Least Squares (OLS) estimates are heavily influenced by the presence of high leverage collinearity-influential observations. To rectify this problem, two new groups of robust regression methods are proposed. The Diagnostic Robust Generalized Potentials (DRGP) based on Minimum Volume Ellipsoid (MVE) is incorporated with different types of robust methods such as L1, LTS, M, and MM in the establishment of the first proposed group of robust methods. The new proposed methods are called GM-DRGP-L1, GM-DRGP-LTS (or Modified GM-estimator1(MGM1)), M-DRGP, MM-DRGP, and DRGP-MM. The second group of the proposed robust methods is formulated by modifying the existing Generalized M-estimator which is called as GM6. Two new GM-estimators which we call the Modified GM-estimator 2 and the Modified GM-estimator 3, denoted as MGM2 and MGM3, respectively are developed. Some indicators are employed to assess the performance of several existing robust methods and the new proposed methods. The results for real data set and Monte Carlo simulation study reveal that our proposed MGM3 outperforms the OLS and some of the existing robust methods. The classical multicollinearity diagnostic methods may not be suitable to diagnose correctly the existence of multicollinearity in the presence high leverage collinearity-influential observations. To remedy this problem, two different approaches are proposed in the establishment of robust multicollinearity diagnostic methods. In the first approach, we propose robust variance inflation factors, namely the RVIF(MM) and the RVIF(MGM3). The later is based on the proposed robust coefficient determination of MGM3. In the second approach, the diagnostic robust methods are proposed, specifically the Robust Condition Number (RCN), Robust Variance Inflation Factors (RVIF) and Robust Variance Decomposition Properties (RVDP) which are based on Minimum Covariance Determinant (MCD). The findings of this study suggest that the developed robust multicollnearity diagnostic methods are able to identify the source of multicollinearity in non-collinear data sets in the presence of high leverage collinearity-enhancing observations. On the other hand, for collinear data sets, in the presence of high leverage collinearity-reducing observations, the developed robust multicollinearity diagnostic methods are able to diagnose the multicollinearity pattern of the data set, correctly. This thesis also addresses the problems of identifying multiple high leverage collinearity- influential observations in a data set. Since, the existing collinearity-influential measures fail to identify multiple collinearity-influential observations in a data set, a new High Leverage Collinearity-Influential Measure based on DRGP, denoted as HLCIM(DRGP) is proposed. The results of the study signify that this new diagnostic measure surpasses the existing measures. Furthermore, some non-parametric cutoff points for the proposed and some existing collinearity-influential measures are suggested in this thesis. High leverage points may be considered as good or bad leverage point which depend on their residuals values. Unfortunately, researchers do not consider good leverage points to be problematic. However, these points may be collinearity-influential observations and need more attention. Regression diagnostic plots are one of the easiest and efficient tools for virtualizing the influential observations in a data set. Unfortunately, there is no existing plot in the literatures that identifies high leverage collinearity-influential observations. Finally, in this regard, we proposed three diagnostic plots specifically the SR(LMS)-DRGP, the DRGP-HLCIM, and the SR(LMS)-HLCIM. These new proposed diagnostic plots serve as powerful tools in separating outliers in the y-direction and the X-direction and able to identify any high leverage point which is collinearity-influential observation
    Leverage (statistics)
    Collinearity
    Robust regression
    Ordinary least squares
    Variance Inflation Factor
    Robust Statistics
    Citations (1)
    Identification and assessment of outliers have a key role in Ordinary Least Squares (OLS) regression analysis. This paper presents a robust two-stage procedure to identify outlying observations in regression analysis. The exploratory stage identifies leverage points and vertical outliers through a robust distance estimator based on Minimum Covariance Determinant (MCD). After deletion of these points, the confirmatory stage carries out an OLS analysis on the remaining subset of data and investigates the effect of adding back in the previously deleted observations. Cut-off points pertinent to different diagnostics are generated by bootstrapping and the cases are definitely labeled as good-leverage, bad leverage, vertical outliers and typical cases. This procedure is applied to four examples taken from the literature and it is effective in rightly pinpointing outlying observations, even in the presence of substantial masking. This procedure is able to identify and correctly classify vertical outliers, good and bad leverage points, through the use of jackknife-after-bootstrap robust cut-off points. Moreover its two stage structure makes it interactive and this enables the user to reach a deeper understanding of the dataset main features than resorting to an automatic procedure.
    Leverage (statistics)
    Robust regression
    Robust Statistics
    Bootstrapping (finance)
    Jackknife resampling
    Ordinary least squares
    Exploratory data analysis
    Citations (4)
    The linear regression model remains an important workhorse for data scientists. However, many data sets contain many more predictors than observations. Besides, outliers, or anomalies, frequently occur. This paper proposes an algorithm for regression analysis that addresses these features typical for big data sets, which we call "sparse shooting S". The resulting regression coefficients are sparse, meaning that many of them are set to zero, hereby selecting the most relevant predictors. A distinct feature of the method is its robustness with respect to outliers in the cells of the data matrix. The excellent performance of this robust variable selection and prediction method is shown in a simulation study. A real data application on car fuel consumption demonstrates its usefulness.
    Robustness
    Data set
    Robust regression
    Citations (25)
    Abstract Leverage values are being used in regression diagnostics as measures of influential observations in the $X$-space. Detection of high leverage values is crucial because of their responsibility for misleading conclusion about the fitting of a regression model, causing multicollinearity problems, masking and/or swamping of outliers, etc. Much work has been done on the identification of single high leverage points and it is generally believed that the problem of detection of a single high leverage point has been largely resolved. But there is no general agreement among the statisticians about the detection of multiple high leverage points. When a group of high leverage points is present in a data set, mainly because of the masking and/or swamping effects the commonly used diagnostic methods fail to identify them correctly. On the other hand, the robust alternative methods can identify the high leverage points correctly but they have a tendency to identify too many low leverage points to be points of high leverages which is not also desired. An attempt has been made to make a compromise between these two approaches. We propose an adaptive method where the suspected high leverage points are identified by robust methods and then the low leverage points (if any) are put back into the estimation data set after diagnostic checking. The usefulness of our newly proposed method for the detection of multiple high leverage points is studied by some well-known data sets and Monte Carlo simulations. Keywords: diagnostic-robust generalized potentialsgroup deletionhigh leverage pointsmaskingrobust Mahalanobis distanceminimum volume ellipsoidMonte Carlo simulation Acknowledgements The authors gratefully acknowledge various helpful comments and suggestions made by the reviewers.
    Leverage (statistics)
    Mahalanobis distance
    Robust regression
    Data point
    Citations (66)
    Mahalanobis distance
    Leverage (statistics)
    Robust regression
    Studentized residual
    Scatter plot
    Abstract Detecting outliers in a multivariate point cloud is not trivial, especially when there are several outliers. The classical identification method does not always find them, because it is based on the sample mean and covariance matrix, which are themselves affected by the outliers. That is how the outliers get masked. To avoid the masking effect, we propose to compute distances based on very robust estimates of location and covariance. These robust distances are better suited to expose the outliers. In the case of regression data, the classical least squares approach masks outliers in a similar way. Also here, the outliers may be unmasked by using a highly robust regression method. Finally, a new display is proposed in which the robust regression residuals are plotted versus the robust distances. This plot classifies the data into regular observations, vertical outliers, good leverage points, and bad leverage points. Several examples are discussed.
    Leverage (statistics)
    Robust regression
    RANSAC
    Robust Statistics
    Citations (1,335)
    Parallel to the development in regression diagnosis, this paper defines good and bad leverage observations in factor analysis. Outliers are observations that deviate from the factor model, not from the center of the data cloud. The effects of each kind of outlying observations on the normal distribution-based maximum likelihood estimator and the associated likelihood ratio statistic are studied through analysis. The distinction between outliers and leverage observations also clarifies the roles of three robust procedures based on different Mahalanobis distances. All the robust procedures are designed to minimize the effect of certain outlying observations. Only the robust procedure with a residual-based distance properly controls the effect of outliers. Empirical results illustrate the strength or weakness of each procedure and support those obtained in analysis. The relevance of the results to general structural equation models is discussed and formulas are provided.
    Mahalanobis distance
    Leverage (statistics)
    Statistic
    Robust regression
    Robust Statistics