Remove the Outliers and High Leverage Points From Your Model and Run the Regression Again
Robust regression
Robust regression is an culling to least squares regression when information are contaminated with outliers or influential observations, and it can also exist used for the purpose of detecting influential observations.
library(MASS) library(strange)
Introduction and theoretical background
Permit's begin our discussion on robust regression with some terms in linear regression.
Residual: The difference between the predicted value (based on the regression equation) and the bodily, observed value, i.e., \(u=Y-\hat{Y}\).
Outlier: In linear regression, an outlier is an ascertainment with large residuum. In other words, it is an observation whose dependent-variable value is unusual given its value on the predictor variables. An outlier may bespeak a sample peculiarity or may bespeak a data entry error or other problem.
Leverage: An observation with an farthermost value on a predictor variable is a point with high leverage. Leverage is a measure of how far an contained variable deviates from its mean. Loftier leverage points can have a great amount of effect on the estimate of regression coefficients.
Influence: An ascertainment is said to be influential if removing the observation substantially changes the gauge of the regression coefficients. Influence can be idea of as the product of leverage and outlierness.
Cook's altitude (or Cook's D): A measure out that combines the information of leverage and residual of the ascertainment.
Robust regression can exist used in any state of affairs in which yous would use least squares regression. When fitting a least squares regression \[ \sum_{i=1}^n(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})^2 \looparrowright \min_{\boldsymbol\beta}, \] we might find some outliers or high leverage data points. We accept decided that these information points are not data entry errors, neither they are from a unlike population than well-nigh of our data. So we have no compelling reason to exclude them from the analysis. Robust regression might exist a good strategy since information technology is a compromise between excluding these points entirely from the assay and including all the information points and treating all them equally in OLS regression. The idea of robust regression is to weigh the observations differently based on how well behaved these observations are. Roughly speaking, it is a class of weighted and reweighted least squares regression.
The rlm
command in the MASS
package control implements several versions of robust regression. In this folio, nosotros will show \(Grand\)-interpretation with Huber and Tukey (aka bisquare weighting). These two are very standard. \(M\)-estimation is optimizing \[ \sum_{i=1}^n\rho(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta}) \looparrowright \min_{\boldsymbol\beta} \] with respect to unknown \({\boldsymbol\beta}\), where loss office \(\rho:\mathbb{R}\to\mathbb{R}\) is absolutely continuous, usually convex with derivative \(\psi(ten)=d\rho(x)/d ten\) (so-called influence function). The \(M\)-reckoner of \({\boldsymbol\beta}\) based on the function \(\rho\) is the solution of the following equations \[ \sum_{i=1}^due north {\bf X}_{i,\bullet}^{\top}\psi(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})={\boldsymbol 0}. \]
If now nosotros define a weight function \(w(10)=\psi(x)/x\), the previous non-linear system of equations becomes so-called estimating equation \[ \sum_{i=1}^n {\bf Ten}_{i,\bullet}^{\tiptop}(Y_i-{\bf 10}_{i,\bullet}{\boldsymbol\beta})w(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})={\boldsymbol 0}. \] But the weights depend on the residuals and the residuals on the weights. The equation is solved using Iteratively Reweighted Least Squares (IRLS). If \(r_i=Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta}\), then the IRLS problem is \[ \sum_{i=1}^nw(r_I^{(thou-1)})r_i^2 \looparrowright \min_{\boldsymbol\beta}, \] where the superscript \((thou)\) indicates the iteration number. The weight \(west(r_I^{(yard-ane)})\) should be recomputed after each iteration in order to exist used in the next iteration. For example, the coefficient vector at iteration \(thousand\) is \[ \widehat{\boldsymbol\beta}^{(k)}=({\bf X}^{\top}{\bf W}^{(k-ane)}{\bf X})^{-1}{\bf X}^{\superlative}{\bf W}^{(k-ane)}{\bf Y}. \] The process continues until information technology converges.
In Huber weighting, observations with small residuals become a weight of one and the larger the residual, the smaller the weight. This is defined by the weight function \[ west(x)=\left\{ \brainstorm{array}{cc} 1, & |10|\leq k,\\ yard/|10|, & |x|>k. \cease{array} \right. \] With Tukey's bisquare weighting, all cases with a non-nada residual go downward-weighted at to the lowest degree a fiddling \[ w(x)=\left\{ \begin{array}{cc} [one-(ten/c)^ii]^2, & |x|\leq c,\\ 0, & |x|>c. \finish{array} \right. \]
Criminal offense data
For our data analysis below, we volition utilize the criminal offense dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid
), state proper noun (state
), violent crimes per 100,000 people (crime
), murders per 1,000,000 (murder
), the percent of the population living in metropolitan areas (pctmetro
), the per centum of the population that is white (pctwhite
), per centum of population with a high school education or to a higher place (pcths
), percent of population living under poverty line (poverty
), and percent of population that are single parents (single
). Information technology has 51 observations. Nosotros are going to utilize poverty
and single
to predict offense
.
cdata <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Information/crime.csv") summary(cdata)
## sid state crime murder ## Min. : 1.0 ak : 1 Min. : 82.0 Min. : 1.600 ## 1st Qu.:xiii.five al : i 1st Qu.: 326.5 1st Qu.: 3.900 ## Median :26.0 ar : one Median : 515.0 Median : 6.800 ## Mean :26.0 az : 1 Hateful : 612.8 Mean : 8.727 ## 3rd Qu.:38.5 ca : 1 third Qu.: 773.0 3rd Qu.:10.350 ## Max. :51.0 co : 1 Max. :2922.0 Max. :78.500 ## (Other):45 ## pctmetro pctwhite pcths poverty ## Min. : 24.00 Min. :31.fourscore Min. :64.xxx Min. : viii.00 ## 1st Qu.: 49.55 1st Qu.:79.35 1st Qu.:73.50 1st Qu.:10.70 ## Median : 69.80 Median :87.60 Median :76.lxx Median :xiii.10 ## Hateful : 67.39 Mean :84.12 Hateful :76.22 Hateful :14.26 ## 3rd Qu.: 83.95 3rd Qu.:92.60 3rd Qu.:eighty.10 third Qu.:17.40 ## Max. :100.00 Max. :98.50 Max. :86.threescore Max. :26.40 ## ## unmarried ## Min. : viii.40 ## 1st Qu.:10.05 ## Median :10.90 ## Mean :xi.33 ## tertiary Qu.:12.05 ## Max. :22.10 ##
Using robust regression analysis
In most cases, we begin by running an OLS regression and doing some diagnostics. We volition brainstorm past running an OLS regression and looking at diagnostic plots examining residuals, fitted values, Cook'due south distance, and leverage.
summary(ols <- lm(law-breaking ~ poverty + single, data = cdata))
## ## Phone call: ## lm(formula = crime ~ poverty + single, data = cdata) ## ## Residuals: ## Min 1Q Median 3Q Max ## -811.14 -114.27 -22.44 121.86 689.82 ## ## Coefficients: ## Estimate Std. Mistake t value Pr(>|t|) ## (Intercept) -1368.189 187.205 -seven.308 2.48e-09 *** ## poverty 6.787 8.989 0.755 0.454 ## single 166.373 xix.423 8.566 3.12e-eleven *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' i ## ## Residual standard error: 243.6 on 48 degrees of liberty ## Multiple R-squared: 0.7072, Adjusted R-squared: 0.695 ## F-statistic: 57.96 on 2 and 48 DF, p-value: 1.578e-thirteen
opar <- par(mfrow = c(3, 2), oma = c(0, 0, 1.1, 0)) plot(ols, las = 1) plot(ols, which=c(4,6))
## sid state ## 9 ix fl ## 25 25 ms ## 51 51 dc
d1 <- cooks.distance(ols) r <- stdres(ols) a <- cbind(cdata, d1, r) a[d1 > iv/51, ]
## sid state crime murder pctmetro pctwhite pcths poverty single d1 ## 1 i ak 761 ix.0 41.eight 75.2 86.6 ix.1 14.3 0.1254750 ## ix ix fl 1206 8.nine 93.0 83.5 74.4 17.8 ten.6 0.1425891 ## 25 25 ms 434 13.5 30.seven 63.iii 64.3 24.vii 14.7 0.6138721 ## 51 51 dc 2922 78.5 100.0 31.eight 73.1 26.4 22.1 two.6362519 ## r ## ane -1.397418 ## ix ii.902663 ## 25 -3.562990 ## 51 2.616447
We probably should drop DC to brainstorm with since it is not even a state. We include it in the analysis merely to show that it has large Cook's D and demonstrate how it will be handled by rlm
. Now we will look at the residuals. We will generate a new variable chosen absr1
, which is the absolute value of the residuals (because the sign of the residual doesn't thing). We then print the ten observations with the highest absolute residual values.
rabs <- abs(r) a <- cbind(cdata, d1, r, rabs) asorted <- a[order(-rabs), ] asorted[one:10, ]
## sid country crime murder pctmetro pctwhite pcths poverty single ## 25 25 ms 434 13.5 30.seven 63.three 64.3 24.seven xiv.7 ## 9 ix fl 1206 8.ix 93.0 83.five 74.four 17.8 10.half dozen ## 51 51 dc 2922 78.5 100.0 31.8 73.1 26.4 22.1 ## 46 46 vt 114 iii.6 27.0 98.4 80.8 x.0 11.0 ## 26 26 mt 178 3.0 24.0 92.6 81.0 14.nine ten.8 ## 21 21 me 126 one.6 35.7 98.five 78.eight x.7 10.vi ## 1 1 ak 761 9.0 41.viii 75.ii 86.half dozen ix.1 14.iii ## 31 31 nj 627 5.3 100.0 80.eight 76.seven 10.ix 9.six ## 14 14 il 960 11.iv 84.0 81.0 76.2 13.6 eleven.five ## 20 20 dr. 998 12.vii 92.8 68.ix 78.4 9.7 12.0 ## d1 r rabs ## 25 0.61387212 -3.562990 3.562990 ## nine 0.14258909 2.902663 ii.902663 ## 51 2.63625193 2.616447 2.616447 ## 46 0.04271548 -i.742409 i.742409 ## 26 0.01675501 -1.460885 1.460885 ## 21 0.02233128 -one.426741 1.426741 ## one 0.12547500 -1.397418 i.397418 ## 31 0.02229184 1.354149 1.354149 ## fourteen 0.01265689 ane.338192 1.338192 ## 20 0.03569623 one.287087 1.287087
Now let'southward run our first robust regression. Robust regression is done by iterated re-weighted least squares (IRLS). The command for running robust regression is rlm
in the MASS
bundle. There are several weighting functions that can be used for IRLS. We are going to first use the Huber weights in this instance. Nosotros volition and then wait at the concluding weights created by the IRLS process. This tin can be very useful.
summary(rr.huber <- rlm(crime ~ poverty + single, data = cdata))
## ## Call: rlm(formula = law-breaking ~ poverty + single, data = cdata) ## Residuals: ## Min 1Q Median 3Q Max ## -846.09 -125.80 -xvi.49 119.15 679.94 ## ## Coefficients: ## Value Std. Fault t value ## (Intercept) -1423.0373 167.5899 -8.4912 ## poverty eight.8677 8.0467 1.1020 ## single 168.9858 17.3878 9.7186 ## ## Residual standard error: 181.8 on 48 degrees of freedom
hweights <- data.frame(state = cdata$state, resid = rr.huber$resid, weight = rr.huber$west) hweights2 <- hweights[order(rr.huber$w), ] hweights2[i:15, ]
## state resid weight ## 25 ms -846.08536 0.2889618 ## 9 fl 679.94327 0.3595480 ## 46 vt -410.48310 0.5955740 ## 51 dc 376.34468 0.6494131 ## 26 mt -356.13760 0.6864625 ## 21 me -337.09622 0.7252263 ## 31 nj 331.11603 0.7383578 ## 14 il 319.10036 0.7661169 ## 1 ak -313.15532 0.7807432 ## 20 md 307.19142 0.7958154 ## 19 ma 291.20817 0.8395172 ## 18 la -266.95752 0.9159411 ## 2 al 105.40319 one.0000000 ## three ar 30.53589 1.0000000 ## 4 az -43.25299 1.0000000
We can see that roughly, as the accented residual goes downwardly, the weight goes upwards. In other words, cases with a big residuals tend to be down-weighted. This output shows us that the observation for Mississippi will exist down-weighted the most. Florida will too be essentially downward-weighted. All observations not shown above have a weight of i. In OLS regression, all cases have a weight of i. Hence, the more cases in the robust regression that accept a weight close to i, the closer the results of the OLS and robust regressions.
Adjacent, permit'south run the aforementioned model, but using the bisquare weighting function. Again, we can look at the weights.
rr.bisquare <- rlm(crime ~ poverty + single, data = cdata, psi = psi.bisquare) summary(rr.bisquare)
## ## Call: rlm(formula = crime ~ poverty + single, data = cdata, psi = psi.bisquare) ## Residuals: ## Min 1Q Median 3Q Max ## -905.59 -140.97 -fourteen.98 114.65 668.38 ## ## Coefficients: ## Value Std. Error t value ## (Intercept) -1535.3338 164.5062 -9.3330 ## poverty 11.6903 7.8987 ane.4800 ## single 175.9303 17.0678 x.3077 ## ## Residual standard mistake: 202.3 on 48 degrees of liberty
biweights <- information.frame(state = cdata$land, resid = rr.bisquare$resid, weight = rr.bisquare$w) biweights2 <- biweights[social club(rr.bisquare$w), ] biweights2[i:xv, ]
## land resid weight ## 25 ms -905.5931 0.007652565 ## 9 fl 668.3844 0.252870542 ## 46 vt -402.8031 0.671495418 ## 26 mt -360.8997 0.731136908 ## 31 nj 345.9780 0.751347695 ## 18 la -332.6527 0.768938330 ## 21 me -328.6143 0.774103322 ## i ak -325.8519 0.777662383 ## 14 il 313.1466 0.793658594 ## 20 physician 308.7737 0.799065530 ## 19 ma 297.6068 0.812596833 ## 51 dc 260.6489 0.854441716 ## l wy -234.1952 0.881660897 ## 5 ca 201.4407 0.911713981 ## 10 ga -186.5799 0.924033113
We can run across that the weight given to Mississippi is dramatically lower using the bisquare weighting function than the Huber weighting part and the parameter estimates from these two different weighting methods differ. When comparing the results of a regular OLS regression and a robust regression, if the results are very different, you volition most likely want to use the results from the robust regression. Large differences suggest that the model parameters are being highly influenced by outliers. Different functions have advantages and drawbacks. Huber weights can accept difficulties with astringent outliers, and bisquare weights can take difficulties converging or may yield multiple solutions.
Equally you lot can encounter, the results from the two analyses are fairly different, especially with respect to the coefficients of unmarried
and the constant (intercept
). While commonly we are non interested in the abiding, if you had centered one or both of the predictor variables, the constant would be useful. On the other mitt, you volition notice that poverty
is not statistically pregnant in either assay, whereas single
is pregnant in both analyses.
Things to consider
- Robust regression does not address issues of heterogeneity of variance. This problem can be addressed past using functions in the
sandwich
packet afterwards thelm
part. - The examples shown here have presented R code for \(M\)-estimation. There are other estimation options available in
rlm
and other R commands and packages: Least trimmed squares usingltsReg
in therobustbase
bundle and MM usingrlm
.
References
- UCLA: IDRE (Constitute for Digital Research and Education). Information Assay Examples. from http://www.ats.ucla.edu/stat/dae/ (accessed January 31, 2014)
- Li, M. (1985). Robust regression. In Exploring Information Tables, Trends, and Shapes, ed. D. C. Hoaglin, F. Mosteller, and J. W. Tukey, Wiley.
- Fob, J. (1997). Practical regression assay, linear models, and related models. Thousand Oaks, CA: Sage publications.
Source: https://www2.karlin.mff.cuni.cz/~pesta/NMFM404/robust.html
0 Response to "Remove the Outliers and High Leverage Points From Your Model and Run the Regression Again"
Enviar um comentário