Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.987
Method:                 Least Squares   F-statistic:                     1276.
Date:                Fri, 03 Nov 2017   Prob (F-statistic):           2.84e-44
Time:                        04:26:11   Log-Likelihood:                 7.5857
No. Observations:                  50   AIC:                            -7.171
Df Residuals:                      46   BIC:                            0.4766
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.8197      0.074     65.235      0.000       4.671       4.968
x1             0.5241      0.011     45.992      0.000       0.501       0.547
x2             0.4968      0.045     11.091      0.000       0.407       0.587
x3            -0.0218      0.001    -21.749      0.000      -0.024      -0.020
==============================================================================
Omnibus:                        1.580   Durbin-Watson:                   2.393
Prob(Omnibus):                  0.454   Jarque-Bera (JB):                1.275
Skew:                          -0.193   Prob(JB):                        0.529
Kurtosis:                       2.320   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.27571298   4.77199194   5.22862227   5.61852809   5.92440505
   6.14156341   6.27869854   6.3564623    6.40407      6.4545004
   6.5390774    6.68232361   6.89793135   7.18651304   7.53550061
   7.92121033   8.31273407   8.67701733   8.98429008   9.21295783
   9.35314957   9.40833956   9.3947765    9.3388139    9.27258033
   9.22870072   9.23493623   9.30962312   9.4586602    9.67453965
   9.93758053  10.21916232  10.48642717  10.70767836  10.857588
  10.92135829  10.8971528   10.79639815  10.6419067   10.46412848
  10.29614838  10.16825041  10.10294206  10.11125644  10.19093955
  10.32682062  10.49330459  10.65857735  10.78983268  10.85866222]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 10.83055142  10.67030994  10.39742092  10.056284     9.70534468
   9.40278487   9.19227795   9.09229593   9.09158692   9.15192998]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.819665
x1                  0.524050
np.sin(x1)          0.496813
I((x1 - 5) ** 2)   -0.021758
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.830551
1    10.670310
2    10.397421
3    10.056284
4     9.705345
5     9.402785
6     9.192278
7     9.092296
8     9.091587
9     9.151930
dtype: float64