Linear Mixed Effects Models

Linear Mixed Effects models are used for regression analyses involving dependent data. Such data arise when working with longitudinal and other study designs in which multiple observations are made on each subject. Two specific mixed effects models are random intercepts models, where all responses in a single group are additively shifted by a value that is specific to the group, and random slopes models, where the values follow a mean trajectory that is linear in observed covariates, with both the slopes and intercept being specific to the group. The Statsmodels MixedLM implementation allows arbitrary random effects design matrices to be specified for the groups, so these and other types of random effects models can all be fit.

The Statsmodels LME framework currently supports post-estimation inference via Wald tests and confidence intervals on the coefficients, profile likelihood analysis, likelihood ratio testing, and AIC. Some limitations of the current implementation are that it does not support structure more complex on the residual errors (they are always homoscedastic), and it does not support crossed random effects. We hope to implement these features for the next release.

Examples

In [1]: import statsmodels.api as sm

In [2]: import statsmodels.formula.api as smf

In [3]: data = sm.datasets.get_rdataset("dietox", "geepack", cache=True).data
---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
/usr/lib/python3.6/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1317                 h.request(req.get_method(), req.selector, req.data, headers,
-> 1318                           encode_chunked=req.has_header('Transfer-encoding'))
   1319             except OSError as err: # timeout error

/usr/lib/python3.6/http/client.py in request(self, method, url, body, headers, encode_chunked)
   1238         """Send a complete request to the server."""
-> 1239         self._send_request(method, url, body, headers, encode_chunked)
   1240 

/usr/lib/python3.6/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1284             body = _encode(body, 'body')
-> 1285         self.endheaders(body, encode_chunked=encode_chunked)
   1286 

/usr/lib/python3.6/http/client.py in endheaders(self, message_body, encode_chunked)
   1233             raise CannotSendHeader()
-> 1234         self._send_output(message_body, encode_chunked=encode_chunked)
   1235 

/usr/lib/python3.6/http/client.py in _send_output(self, message_body, encode_chunked)
   1025         del self._buffer[:]
-> 1026         self.send(msg)
   1027 

/usr/lib/python3.6/http/client.py in send(self, data)
    963             if self.auto_open:
--> 964                 self.connect()
    965             else:

/usr/lib/python3.6/http/client.py in connect(self)
   1391 
-> 1392             super().connect()
   1393 

/usr/lib/python3.6/http/client.py in connect(self)
    935         self.sock = self._create_connection(
--> 936             (self.host,self.port), self.timeout, self.source_address)
    937         self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

/usr/lib/python3.6/socket.py in create_connection(address, timeout, source_address)
    703     err = None
--> 704     for res in getaddrinfo(host, port, 0, SOCK_STREAM):
    705         af, socktype, proto, canonname, sa = res

/usr/lib/python3.6/socket.py in getaddrinfo(host, port, family, type, proto, flags)
    744     addrlist = []
--> 745     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    746         af, socktype, proto, canonname, sa = res

gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
<ipython-input-3-dd9bcc886be5> in <module>()
----> 1 data = sm.datasets.get_rdataset("dietox", "geepack", cache=True).data

/build/statsmodels-pH_Txj/statsmodels-0.8.0/.pybuild/pythonX.Y_3.6/build/statsmodels/datasets/utils.py in get_rdataset(dataname, package, cache)
    288                      "master/doc/"+package+"/rst/")
    289     cache = _get_cache(cache)
--> 290     data, from_cache = _get_data(data_base_url, dataname, cache)
    291     data = read_csv(data, index_col=0)
    292     data = _maybe_reset_index(data)

/build/statsmodels-pH_Txj/statsmodels-0.8.0/.pybuild/pythonX.Y_3.6/build/statsmodels/datasets/utils.py in _get_data(base_url, dataname, cache, extension)
    219     url = base_url + (dataname + ".%s") % extension
    220     try:
--> 221         data, from_cache = _urlopen_cached(url, cache)
    222     except HTTPError as err:
    223         if '404' in str(err):

/build/statsmodels-pH_Txj/statsmodels-0.8.0/.pybuild/pythonX.Y_3.6/build/statsmodels/datasets/utils.py in _urlopen_cached(url, cache)
    210     # not using the cache or didn't find it in cache
    211     if not from_cache:
--> 212         data = urlopen(url).read()
    213         if cache is not None:  # then put it in the cache
    214             _cache_it(data, cache_path)

/usr/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    221     else:
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 
    225 def install_opener(opener):

/usr/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
    524             req = meth(req)
    525 
--> 526         response = self._open(req, data)
    527 
    528         # post-process response

/usr/lib/python3.6/urllib/request.py in _open(self, req, data)
    542         protocol = req.type
    543         result = self._call_chain(self.handle_open, protocol, protocol +
--> 544                                   '_open', req)
    545         if result:
    546             return result

/usr/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    502         for handler in handlers:
    503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
    505             if result is not None:
    506                 return result

/usr/lib/python3.6/urllib/request.py in https_open(self, req)
   1359         def https_open(self, req):
   1360             return self.do_open(http.client.HTTPSConnection, req,
-> 1361                 context=self._context, check_hostname=self._check_hostname)
   1362 
   1363         https_request = AbstractHTTPHandler.do_request_

/usr/lib/python3.6/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1318                           encode_chunked=req.has_header('Transfer-encoding'))
   1319             except OSError as err: # timeout error
-> 1320                 raise URLError(err)
   1321             r = h.getresponse()
   1322         except:

URLError: <urlopen error [Errno -2] Name or service not known>

In [4]: md = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-9efdb46cb28d> in <module>()
----> 1 md = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"])

KeyError: 'Pig'

In [5]: mdf = md.fit()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-22d27fb7daa8> in <module>()
----> 1 mdf = md.fit()

NameError: name 'md' is not defined

In [6]: print(mdf.summary())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-1bd4bb60f2e8> in <module>()
----> 1 print(mdf.summary())

NameError: name 'mdf' is not defined

Detailed examples can be found here

There some notebook examples on the Wiki: Wiki notebooks for MixedLM

Technical Documentation

The data are partitioned into disjoint groups. The probability model for group i is:

Y = X\beta + Z\gamma + \epsilon

where

  • n_i is the number of observations in group i
  • Y is a n_i dimensional response vector
  • X is a n_i * k_{fe} dimensional matrix of fixed effects coefficients
  • \beta is a k_{fe}-dimensional vector of fixed effects slopes
  • Z is a n_i * k_{re} dimensional matrix of random effects coefficients
  • \gamma is a k_{re}-dimensional random vector with mean 0 and covariance matrix \Psi; note that each group gets its own independent realization of gamma.
  • \epsilon is a n_i dimensional vector of i.i.d normal errors with mean 0 and variance \sigma^2; the \epsilon values are independent both within and between groups

Y, X and Z must be entirely observed. \beta, \Psi, and \sigma^2 are estimated using ML or REML estimation, and \gamma and \epsilon are random so define the probability model.

The mean structure is E[Y|X,Z] = X*\beta. If only the mean structure is of interest, GEE is a good alternative to mixed models.

Notation:

  • cov_{re} is the random effects covariance matrix (referred to above as \Psi) and scale is the (scalar) error variance. For a single group, the marginal covariance matrix of endog given exog is scale*I + Z * cov_{re} * Z, where Z is the design matrix for the random effects in one group.

Notes

1. Three different parameterizations are used here in different places. The regression slopes (usually called fe_{params}) are identical in all three parameterizations, but the variance parameters differ. The parameterizations are:

  • The natural parameterization in which cov(endog) = scale*I + Z * cov_{re} * Z, as described above. This is the main parameterization visible to the user.
  • The profile parameterization in which cov(endog) = I + Z * cov_{re1} * Z. This is the parameterization of the profile likelihood that is maximized to produce parameter estimates. (see Lindstrom and Bates for details). The natural cov_{re} is equal to the profile cov_{re1} times scale.
  • The square root parameterization in which we work with the Cholesky factor of cov_{re1} instead of cov_{re1} directly.

All three parameterizations can be packed by concatenating fe_{params} together with the lower triangle of the dependence structure. Note that when unpacking, it is important to either square or reflect the dependence structure depending on which parameterization is being used.

2. The situation where the random effects covariance matrix is singular is numerically challenging. Small changes in the covariance parameters may lead to large changes in the likelihood and derivatives.

3. The optimization strategy is to optionally perform a few EM steps, followed by optionally performing a few steepest descent steps, followed by conjugate gradient descent using one of the scipy gradient optimizers. The EM and steepest descent steps are used to get adequate starting values for the conjugate gradient optimization, which is much faster.

References

The primary reference for the implementation details is:

  • MJ Lindstrom, DM Bates (1988). Newton Raphson and EM algorithms for linear mixed effects models for repeated measures data. Journal of the American Statistical Association. Volume 83, Issue 404, pages 1014-1022.

See also this more recent document:

All the likelihood, gradient, and Hessian calculations closely follow Lindstrom and Bates.

The following two documents are written more from the perspective of users:

Module Reference

The model class is:

MixedLM(endog, exog, groups[, exog_re, …]) An object specifying a linear mixed effects model.

The result classe are:

MixedLMResults(model, params, cov_params) Class to contain results of fitting a linear mixed effects model.