Robust Linear Models

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std

Estimation

Load data:

In [2]:
data = sm.datasets.stackloss.load()
data.exog = sm.add_constant(data.exog)

Huber's T norm with the (default) median absolute deviation scaling

In [3]:
huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
hub_results = huber_t.fit()
print(hub_results.params)
print(hub_results.bse)
print(hub_results.summary(yname='y',
            xname=['var_%d' % i for i in range(len(hub_results.params))]))
[-41.02649835   0.82938433   0.92606597  -0.12784672]
[ 9.79189854  0.11100521  0.30293016  0.12864961]
                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                      y   No. Observations:                   21
Model:                            RLM   Df Residuals:                       17
Method:                          IRLS   Df Model:                            3
Norm:                          HuberT                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                Fri, 03 Nov 2017                                         
Time:                        04:26:44                                         
No. Iterations:                    19                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
var_0        -41.0265      9.792     -4.190      0.000     -60.218     -21.835
var_1          0.8294      0.111      7.472      0.000       0.612       1.047
var_2          0.9261      0.303      3.057      0.002       0.332       1.520
var_3         -0.1278      0.129     -0.994      0.320      -0.380       0.124
==============================================================================

If the model instance has been used for another fit with different fit
parameters, then the fit options might not be the correct ones anymore .

Huber's T norm with 'H2' covariance matrix

In [4]:
hub_results2 = huber_t.fit(cov="H2")
print(hub_results2.params)
print(hub_results2.bse)
[-41.02649835   0.82938433   0.92606597  -0.12784672]
[ 9.08950419  0.11945975  0.32235497  0.11796313]

Andrew's Wave norm with Huber's Proposal 2 scaling and 'H3' covariance matrix

In [5]:
andrew_mod = sm.RLM(data.endog, data.exog, M=sm.robust.norms.AndrewWave())
andrew_results = andrew_mod.fit(scale_est=sm.robust.scale.HuberScale(), cov="H3")
print('Parameters: ', andrew_results.params)
Parameters:  [-40.8817957    0.79276138   1.04857556  -0.13360865]

See help(sm.RLM.fit) for more options and module sm.robust.scale for scale options

Comparing OLS and RLM

Artificial data with outliers:

In [6]:
nsample = 50
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, (x1-5)**2))
X = sm.add_constant(X)
sig = 0.3   # smaller error variance makes OLS<->RLM contrast bigger
beta = [5, 0.5, -0.0]
y_true2 = np.dot(X, beta)
y2 = y_true2 + sig*1. * np.random.normal(size=nsample)
y2[[39,41,43,45,48]] -= 5   # add some outliers (10% of nsample)

Example 1: quadratic function with linear truth

Note that the quadratic term in OLS regression will capture outlier effects.

In [7]:
res = sm.OLS(y2, X).fit()
print(res.params)
print(res.bse)
print(res.predict())
[ 4.95781208  0.546495   -0.01554405]
[ 0.46049445  0.07109413  0.00629073]
[  4.56921088   4.85312556   5.13186105   5.40541735   5.67379445
   5.93699236   6.19501109   6.44785062   6.69551096   6.93799211
   7.17529406   7.40741683   7.63436041   7.85612479   8.07270998
   8.28411598   8.49034279   8.69139041   8.88725884   9.07794808
   9.26345812   9.44378898   9.61894064   9.78891311   9.95370639
  10.11332048  10.26775538  10.41701109  10.5610876   10.69998493
  10.83370306  10.962242    11.08560175  11.20378231  11.31678368
  11.42460586  11.52724885  11.62471264  11.71699724  11.80410266
  11.88602888  11.96277591  12.03434375  12.10073239  12.16194185
  12.21797211  12.26882319  12.31449507  12.35498776  12.39030126]

Estimate RLM:

In [8]:
resrlm = sm.RLM(y2, X).fit()
print(resrlm.params)
print(resrlm.bse)
[ 4.89228535  0.53160449 -0.00491301]
[ 0.13490498  0.02082751  0.00184291]

Draw a plot to compare OLS estimates to the robust estimates:

In [9]:
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.plot(x1, y2, 'o',label="data")
ax.plot(x1, y_true2, 'b-', label="True")
prstd, iv_l, iv_u = wls_prediction_std(res)
ax.plot(x1, res.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm.fittedvalues, 'g.-', label="RLM")
ax.legend(loc="best")
Out[9]:
<matplotlib.legend.Legend at 0x7f1ef08c3470>

Example 2: linear function with linear truth

Fit a new OLS model using only the linear term and the constant:

In [10]:
X2 = X[:,[0,1]] 
res2 = sm.OLS(y2, X2).fit()
print(res2.params)
print(res2.bse)
[ 5.58433236  0.39105452]
[ 0.40432924  0.03483864]

Estimate RLM:

In [11]:
resrlm2 = sm.RLM(y2, X2).fit()
print(resrlm2.params)
print(resrlm2.bse)
[ 5.05016817  0.48839714]
[ 0.11452887  0.00986827]

Draw a plot to compare OLS estimates to the robust estimates:

In [12]:
prstd, iv_l, iv_u = wls_prediction_std(res2)

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x1, y2, 'o', label="data")
ax.plot(x1, y_true2, 'b-', label="True")
ax.plot(x1, res2.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm2.fittedvalues, 'g.-', label="RLM")
legend = ax.legend(loc="best")