Remove the Outliers and High Leverage Points From Your Model and Run the Regression Again

Robust regression

Robust regression is an culling to least squares regression when information are contaminated with outliers or influential observations, and it can also exist used for the purpose of detecting influential observations.

library(MASS) library(strange)        

Introduction and theoretical background

Permit's begin our discussion on robust regression with some terms in linear regression.

Residual: The difference between the predicted value (based on the regression equation) and the bodily, observed value, i.e., \(u=Y-\hat{Y}\).

Outlier: In linear regression, an outlier is an ascertainment with large residuum. In other words, it is an observation whose dependent-variable value is unusual given its value on the predictor variables. An outlier may bespeak a sample peculiarity or may bespeak a data entry error or other problem.

Leverage: An observation with an farthermost value on a predictor variable is a point with high leverage. Leverage is a measure of how far an contained variable deviates from its mean. Loftier leverage points can have a great amount of effect on the estimate of regression coefficients.

Influence: An ascertainment is said to be influential if removing the observation substantially changes the gauge of the regression coefficients. Influence can be idea of as the product of leverage and outlierness.

Cook's altitude (or Cook's D): A measure out that combines the information of leverage and residual of the ascertainment.

Robust regression can exist used in any state of affairs in which yous would use least squares regression. When fitting a least squares regression \[ \sum_{i=1}^n(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})^2 \looparrowright \min_{\boldsymbol\beta}, \] we might find some outliers or high leverage data points. We accept decided that these information points are not data entry errors, neither they are from a unlike population than well-nigh of our data. So we have no compelling reason to exclude them from the analysis. Robust regression might exist a good strategy since information technology is a compromise between excluding these points entirely from the assay and including all the information points and treating all them equally in OLS regression. The idea of robust regression is to weigh the observations differently based on how well behaved these observations are. Roughly speaking, it is a class of weighted and reweighted least squares regression.

The rlm command in the MASS package control implements several versions of robust regression. In this folio, nosotros will show \(Grand\)-interpretation with Huber and Tukey (aka bisquare weighting). These two are very standard. \(M\)-estimation is optimizing \[ \sum_{i=1}^n\rho(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta}) \looparrowright \min_{\boldsymbol\beta} \] with respect to unknown \({\boldsymbol\beta}\), where loss office \(\rho:\mathbb{R}\to\mathbb{R}\) is absolutely continuous, usually convex with derivative \(\psi(ten)=d\rho(x)/d ten\) (so-called influence function). The \(M\)-reckoner of \({\boldsymbol\beta}\) based on the function \(\rho\) is the solution of the following equations \[ \sum_{i=1}^due north {\bf X}_{i,\bullet}^{\top}\psi(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})={\boldsymbol 0}. \]

If now nosotros define a weight function \(w(10)=\psi(x)/x\), the previous non-linear system of equations becomes so-called estimating equation \[ \sum_{i=1}^n {\bf Ten}_{i,\bullet}^{\tiptop}(Y_i-{\bf 10}_{i,\bullet}{\boldsymbol\beta})w(Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta})={\boldsymbol 0}. \] But the weights depend on the residuals and the residuals on the weights. The equation is solved using Iteratively Reweighted Least Squares (IRLS). If \(r_i=Y_i-{\bf X}_{i,\bullet}{\boldsymbol\beta}\), then the IRLS problem is \[ \sum_{i=1}^nw(r_I^{(thou-1)})r_i^2 \looparrowright \min_{\boldsymbol\beta}, \] where the superscript \((thou)\) indicates the iteration number. The weight \(west(r_I^{(yard-ane)})\) should be recomputed after each iteration in order to exist used in the next iteration. For example, the coefficient vector at iteration \(thousand\) is \[ \widehat{\boldsymbol\beta}^{(k)}=({\bf X}^{\top}{\bf W}^{(k-ane)}{\bf X})^{-1}{\bf X}^{\superlative}{\bf W}^{(k-ane)}{\bf Y}. \] The process continues until information technology converges.

In Huber weighting, observations with small residuals become a weight of one and the larger the residual, the smaller the weight. This is defined by the weight function \[ west(x)=\left\{ \brainstorm{array}{cc} 1, & |10|\leq k,\\ yard/|10|, & |x|>k. \cease{array} \right. \] With Tukey's bisquare weighting, all cases with a non-nada residual go downward-weighted at to the lowest degree a fiddling \[ w(x)=\left\{ \begin{array}{cc} [one-(ten/c)^ii]^2, & |x|\leq c,\\ 0, & |x|>c. \finish{array} \right. \]

Criminal offense data

For our data analysis below, we volition utilize the criminal offense dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid), state proper noun (state), violent crimes per 100,000 people (crime), murders per 1,000,000 (murder), the percent of the population living in metropolitan areas (pctmetro), the per centum of the population that is white (pctwhite), per centum of population with a high school education or to a higher place (pcths), percent of population living under poverty line (poverty), and percent of population that are single parents (single). Information technology has 51 observations. Nosotros are going to utilize poverty and single to predict offense.

cdata <- read.csv("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Information/crime.csv") summary(cdata)          
##       sid           state        crime            murder       ##  Min.   : 1.0   ak     : 1   Min.   :  82.0   Min.   : 1.600   ##  1st Qu.:xiii.five   al     : i   1st Qu.: 326.5   1st Qu.: 3.900   ##  Median :26.0   ar     : one   Median : 515.0   Median : 6.800   ##  Mean   :26.0   az     : 1   Hateful   : 612.8   Mean   : 8.727   ##  3rd Qu.:38.5   ca     : 1   third Qu.: 773.0   3rd Qu.:10.350   ##  Max.   :51.0   co     : 1   Max.   :2922.0   Max.   :78.500   ##                 (Other):45                                     ##     pctmetro         pctwhite         pcths          poverty      ##  Min.   : 24.00   Min.   :31.fourscore   Min.   :64.xxx   Min.   : viii.00   ##  1st Qu.: 49.55   1st Qu.:79.35   1st Qu.:73.50   1st Qu.:10.70   ##  Median : 69.80   Median :87.60   Median :76.lxx   Median :xiii.10   ##  Hateful   : 67.39   Mean   :84.12   Hateful   :76.22   Hateful   :14.26   ##  3rd Qu.: 83.95   3rd Qu.:92.60   3rd Qu.:eighty.10   third Qu.:17.40   ##  Max.   :100.00   Max.   :98.50   Max.   :86.threescore   Max.   :26.40   ##                                                                   ##      unmarried      ##  Min.   : viii.40   ##  1st Qu.:10.05   ##  Median :10.90   ##  Mean   :xi.33   ##  tertiary Qu.:12.05   ##  Max.   :22.10   ##          

Using robust regression analysis

In most cases, we begin by running an OLS regression and doing some diagnostics. We volition brainstorm past running an OLS regression and looking at diagnostic plots examining residuals, fitted values, Cook'due south distance, and leverage.

summary(ols <- lm(law-breaking ~ poverty + single, data = cdata))          
##  ## Phone call: ## lm(formula = crime ~ poverty + single, data = cdata) ##  ## Residuals: ##     Min      1Q  Median      3Q     Max  ## -811.14 -114.27  -22.44  121.86  689.82  ##  ## Coefficients: ##              Estimate Std. Mistake t value Pr(>|t|)     ## (Intercept) -1368.189    187.205  -seven.308 2.48e-09 *** ## poverty         6.787      8.989   0.755    0.454     ## single        166.373     xix.423   8.566 3.12e-eleven *** ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' i ##  ## Residual standard error: 243.6 on 48 degrees of liberty ## Multiple R-squared:  0.7072, Adjusted R-squared:  0.695  ## F-statistic: 57.96 on 2 and 48 DF,  p-value: 1.578e-thirteen          
opar <- par(mfrow = c(3, 2), oma = c(0, 0, 1.1, 0)) plot(ols, las = 1) plot(ols, which=c(4,6))          

plot of chunk robust-diagnostic

From these plots, nosotros can identify observations 9, 25, and 51 as possibly problematic to our model. We tin can wait at these observations to see which states they correspond.
##    sid state ## 9    ix    fl ## 25  25    ms ## 51  51    dc        
DC, Florida and Mississippi have either loftier leverage or large residuals. We can display the observations that accept relatively large values of Cook'due south D. A conventional cut-off indicate is \(four/n\), where \(n\) is the number of observations in the information set. Nosotros will use this criterion to select the values to brandish.
d1 <- cooks.distance(ols) r <- stdres(ols) a <- cbind(cdata, d1, r) a[d1 > iv/51, ]          
##    sid state crime murder pctmetro pctwhite pcths poverty single        d1 ## 1    i    ak   761    ix.0     41.eight     75.2  86.6     ix.1   14.3 0.1254750 ## ix    ix    fl  1206    8.nine     93.0     83.5  74.4    17.8   ten.6 0.1425891 ## 25  25    ms   434   13.5     30.seven     63.iii  64.3    24.vii   14.7 0.6138721 ## 51  51    dc  2922   78.5    100.0     31.eight  73.1    26.4   22.1 two.6362519 ##            r ## ane  -1.397418 ## ix   ii.902663 ## 25 -3.562990 ## 51  2.616447          

We probably should drop DC to brainstorm with since it is not even a state. We include it in the analysis merely to show that it has large Cook's D and demonstrate how it will be handled by rlm. Now we will look at the residuals. We will generate a new variable chosen absr1, which is the absolute value of the residuals (because the sign of the residual doesn't thing). We then print the ten observations with the highest absolute residual values.

rabs <- abs(r) a <- cbind(cdata, d1, r, rabs) asorted <- a[order(-rabs), ] asorted[one:10, ]          
##    sid country crime murder pctmetro pctwhite pcths poverty single ## 25  25    ms   434   13.5     30.seven     63.three  64.3    24.seven   xiv.7 ## 9    ix    fl  1206    8.ix     93.0     83.five  74.four    17.8   10.half dozen ## 51  51    dc  2922   78.5    100.0     31.8  73.1    26.4   22.1 ## 46  46    vt   114    iii.6     27.0     98.4  80.8    x.0   11.0 ## 26  26    mt   178    3.0     24.0     92.6  81.0    14.nine   ten.8 ## 21  21    me   126    one.6     35.7     98.five  78.eight    x.7   10.vi ## 1    1    ak   761    9.0     41.viii     75.ii  86.half dozen     ix.1   14.iii ## 31  31    nj   627    5.3    100.0     80.eight  76.seven    10.ix    9.six ## 14  14    il   960   11.iv     84.0     81.0  76.2    13.6   eleven.five ## 20  20    dr.   998   12.vii     92.8     68.ix  78.4     9.7   12.0 ##            d1         r     rabs ## 25 0.61387212 -3.562990 3.562990 ## nine  0.14258909  2.902663 ii.902663 ## 51 2.63625193  2.616447 2.616447 ## 46 0.04271548 -i.742409 i.742409 ## 26 0.01675501 -1.460885 1.460885 ## 21 0.02233128 -one.426741 1.426741 ## one  0.12547500 -1.397418 i.397418 ## 31 0.02229184  1.354149 1.354149 ## fourteen 0.01265689  ane.338192 1.338192 ## 20 0.03569623  one.287087 1.287087          

Now let'southward run our first robust regression. Robust regression is done by iterated re-weighted least squares (IRLS). The command for running robust regression is rlm in the MASS bundle. There are several weighting functions that can be used for IRLS. We are going to first use the Huber weights in this instance. Nosotros volition and then wait at the concluding weights created by the IRLS process. This tin can be very useful.

summary(rr.huber <- rlm(crime ~ poverty + single, data = cdata))          
##  ## Call: rlm(formula = law-breaking ~ poverty + single, data = cdata) ## Residuals: ##     Min      1Q  Median      3Q     Max  ## -846.09 -125.80  -xvi.49  119.15  679.94  ##  ## Coefficients: ##             Value      Std. Fault t value    ## (Intercept) -1423.0373   167.5899    -8.4912 ## poverty         eight.8677     8.0467     1.1020 ## single        168.9858    17.3878     9.7186 ##  ## Residual standard error: 181.8 on 48 degrees of freedom          
hweights <- data.frame(state = cdata$state, resid = rr.huber$resid, weight = rr.huber$west) hweights2 <- hweights[order(rr.huber$w), ] hweights2[i:15, ]          
##    state      resid    weight ## 25    ms -846.08536 0.2889618 ## 9     fl  679.94327 0.3595480 ## 46    vt -410.48310 0.5955740 ## 51    dc  376.34468 0.6494131 ## 26    mt -356.13760 0.6864625 ## 21    me -337.09622 0.7252263 ## 31    nj  331.11603 0.7383578 ## 14    il  319.10036 0.7661169 ## 1     ak -313.15532 0.7807432 ## 20    md  307.19142 0.7958154 ## 19    ma  291.20817 0.8395172 ## 18    la -266.95752 0.9159411 ## 2     al  105.40319 one.0000000 ## three     ar   30.53589 1.0000000 ## 4     az  -43.25299 1.0000000          

We can see that roughly, as the accented residual goes downwardly, the weight goes upwards. In other words, cases with a big residuals tend to be down-weighted. This output shows us that the observation for Mississippi will exist down-weighted the most. Florida will too be essentially downward-weighted. All observations not shown above have a weight of i. In OLS regression, all cases have a weight of i. Hence, the more cases in the robust regression that accept a weight close to i, the closer the results of the OLS and robust regressions.

Adjacent, permit'south run the aforementioned model, but using the bisquare weighting function. Again, we can look at the weights.

rr.bisquare <- rlm(crime ~ poverty + single, data = cdata, psi = psi.bisquare) summary(rr.bisquare)          
##  ## Call: rlm(formula = crime ~ poverty + single, data = cdata, psi = psi.bisquare) ## Residuals: ##     Min      1Q  Median      3Q     Max  ## -905.59 -140.97  -fourteen.98  114.65  668.38  ##  ## Coefficients: ##             Value      Std. Error t value    ## (Intercept) -1535.3338   164.5062    -9.3330 ## poverty        11.6903     7.8987     ane.4800 ## single        175.9303    17.0678    x.3077 ##  ## Residual standard mistake: 202.3 on 48 degrees of liberty          
biweights <- information.frame(state = cdata$land, resid = rr.bisquare$resid, weight = rr.bisquare$w) biweights2 <- biweights[social club(rr.bisquare$w), ] biweights2[i:xv, ]          
##    land     resid      weight ## 25    ms -905.5931 0.007652565 ## 9     fl  668.3844 0.252870542 ## 46    vt -402.8031 0.671495418 ## 26    mt -360.8997 0.731136908 ## 31    nj  345.9780 0.751347695 ## 18    la -332.6527 0.768938330 ## 21    me -328.6143 0.774103322 ## i     ak -325.8519 0.777662383 ## 14    il  313.1466 0.793658594 ## 20    physician  308.7737 0.799065530 ## 19    ma  297.6068 0.812596833 ## 51    dc  260.6489 0.854441716 ## l    wy -234.1952 0.881660897 ## 5     ca  201.4407 0.911713981 ## 10    ga -186.5799 0.924033113          

We can run across that the weight given to Mississippi is dramatically lower using the bisquare weighting function than the Huber weighting part and the parameter estimates from these two different weighting methods differ. When comparing the results of a regular OLS regression and a robust regression, if the results are very different, you volition most likely want to use the results from the robust regression. Large differences suggest that the model parameters are being highly influenced by outliers. Different functions have advantages and drawbacks. Huber weights can accept difficulties with astringent outliers, and bisquare weights can take difficulties converging or may yield multiple solutions.

Equally you lot can encounter, the results from the two analyses are fairly different, especially with respect to the coefficients of unmarried and the constant (intercept). While commonly we are non interested in the abiding, if you had centered one or both of the predictor variables, the constant would be useful. On the other mitt, you volition notice that poverty is not statistically pregnant in either assay, whereas single is pregnant in both analyses.

Things to consider

  • Robust regression does not address issues of heterogeneity of variance. This problem can be addressed past using functions in the sandwich packet afterwards the lm part.
  • The examples shown here have presented R code for \(M\)-estimation. There are other estimation options available in rlm and other R commands and packages: Least trimmed squares using ltsReg in the robustbase bundle and MM using rlm.

References

  • UCLA: IDRE (Constitute for Digital Research and Education). Information Assay Examples. from http://www.ats.ucla.edu/stat/dae/ (accessed January 31, 2014)
  • Li, M. (1985). Robust regression. In Exploring Information Tables, Trends, and Shapes, ed. D. C. Hoaglin, F. Mosteller, and J. W. Tukey, Wiley.
  • Fob, J. (1997). Practical regression assay, linear models, and related models. Thousand Oaks, CA: Sage publications.

grahamlishat.blogspot.com

Source: https://www2.karlin.mff.cuni.cz/~pesta/NMFM404/robust.html

0 Response to "Remove the Outliers and High Leverage Points From Your Model and Run the Regression Again"

Enviar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel