Skip to main content

GLS estimator with learned correlation and variance structures (Python equivalent of R's nlme::gls)

Project description

python-gls

GLS with learned correlation and variance structures for Python.

The missing Python equivalent of R's nlme::gls(). Unlike statsmodels.GLS (which requires you to supply a pre-computed covariance matrix), python-gls estimates the correlation and variance structure from your data via maximum likelihood (ML) or restricted maximum likelihood (REML) — exactly like R's nlme::gls().

Why this library?

If you work with panel data, repeated measures, longitudinal studies, or clustered observations, your errors are probably correlated and possibly heteroscedastic. Ignoring this gives you wrong standard errors and misleading p-values.

R has had nlme::gls() for 25+ years. Python hasn't had an equivalent. Until now.

Feature statsmodels.GLS python-gls R nlme::gls
Estimate correlation from data No (manual Omega) Yes Yes
AR(1), compound symmetry, etc. No Yes (11 structures) Yes
Heteroscedastic variance models No Yes (6 functions) Yes
ML / REML estimation No Yes Yes
R-style formulas No Yes Yes

Installation

pip install python-gls

Or from source:

git clone https://github.com/brunoabrahao/python-gls.git
cd python-gls
pip install -e ".[dev]"

Quick Start

from python_gls import GLS
from python_gls.correlation import CorAR1
from python_gls.variance import VarIdent

result = GLS.from_formula(
    "response ~ treatment + time",
    data=df,
    correlation=CorAR1(),       # Learn AR(1) correlation
    variance=VarIdent("group"),  # Learn group-specific variances
    groups="subject",            # Define independent clusters
).fit()

print(result.summary())
print(f"Estimated AR(1) phi: {result.correlation_params[0]:.3f}")

Output:

==============================================================================
                      Generalized Least Squares Results
==============================================================================
Method:          REML                 Log-Likelihood:        -615.0544
No. Observations:500                  AIC:                   1240.1088
Df Model:        2                    BIC:                   1261.1818
Df Residuals:    497                  Sigma^2:                0.984576
Converged:       Yes                  Iterations:                    6
------------------------------------------------------------------------------
                           coef    std err          t      P>|t|     [0.025     0.975]
------------------------------------------------------------------------------
           Intercept     1.0368     0.1069     9.7013     0.0000     0.8268     1.2468
           treatment     0.6465     0.1428     4.5272     0.0000     0.3659     0.9271
                   x     1.9734     0.0323    61.0960     0.0000     1.9099     2.0368
==============================================================================
Correlation Structure: CorAR1
  Parameters: [0.61312872]

Correlation Structures

All correlation structures are in python_gls.correlation.

Temporal / Serial Correlation

Class R Equivalent Parameters Description
CorAR1(phi=None) corAR1() 1 First-order autoregressive. R[i,j] = phi^|i-j|
CorARMA(p, q) corARMA(p, q) p + q ARMA(p,q) autocorrelation
CorCAR1(phi=None) corCAR1() 1 Continuous-time AR(1) for irregular spacing
CorCompSymm(rho=None) corCompSymm() 1 Exchangeable / compound symmetry. All pairs equal rho
CorSymm(dim=None) corSymm() d(d-1)/2 General unstructured. Free correlation for every pair

Spatial Correlation

Class R Equivalent Parameters Description
CorExp(range_param, nugget=False) corExp() 1-2 Exponential: exp(-d/range)
CorGaus(range_param, nugget=False) corGaus() 1-2 Gaussian: exp(-(d/range)^2)
CorLin(range_param, nugget=False) corLin() 1-2 Linear: max(1 - d/range, 0)
CorRatio(range_param, nugget=False) corRatio() 1-2 Rational quadratic: 1/(1 + (d/range)^2)
CorSpher(range_param, nugget=False) corSpher() 1-2 Spherical: cubic polynomial, zero beyond range

All spatial structures accept an optional nugget=True parameter for a discontinuity at distance zero.

Usage

from python_gls.correlation import CorAR1, CorSymm, CorExp

# Serial: AR(1) with optional initial value
cor = CorAR1(phi=0.5)

# Unstructured: all pairs free
cor = CorSymm()  # dimension inferred from data

# Spatial: set coordinates per group
cor = CorExp(range_param=10.0, nugget=True)
cor.set_coordinates(group_id=0, coords=np.array([[0,0], [1,0], [0,1]]))

Variance Functions

All variance functions are in python_gls.variance.

Class R Equivalent Parameters Description
VarIdent(group_var) varIdent(form=~1|group) G-1 Different variance per group level
VarPower(covariate) varPower(form=~cov) 1 sd = |v|^delta
VarExp(covariate) varExp(form=~cov) 1 sd = exp(delta * v)
VarConstPower(covariate) varConstPower(form=~cov) 2 sd = (c + |v|^delta)
VarFixed(weights_var) varFixed(~cov) 0 Pre-specified weights (not estimated)
VarComb(*varfuncs) varComb(...) sum Product of multiple variance functions

Usage

from python_gls.variance import VarIdent, VarPower, VarComb

# Different variance for treatment vs. control
var = VarIdent("treatment_group")

# Variance increases with fitted values
var = VarPower("fitted_values")

# Combine: group-specific + covariate-dependent
var = VarComb(VarIdent("group"), VarPower("x"))

API Reference

GLS Class

Construction

# From formula (recommended)
model = GLS.from_formula(
    formula,          # R-style formula: "y ~ x1 + x2"
    data,             # pandas DataFrame
    correlation=None, # CorStruct instance
    variance=None,    # VarFunc instance
    groups=None,      # str: column name for groups
    method="REML",    # "ML" or "REML"
)

# From arrays
model = GLS(
    endog=y,          # response vector
    exog=X,           # design matrix (include intercept column)
    correlation=None,
    variance=None,
    groups=None,      # array of group labels
    method="REML",
)

Fitting

result = model.fit(
    maxiter=200,     # max optimization iterations
    tol=1e-8,        # convergence tolerance
    verbose=False,   # print optimization progress
)

GLSResults Class

Property / Method Type Description
params Series Estimated coefficients
bse Series Standard errors
tvalues Series t-statistics
pvalues Series Two-sided p-values
conf_int(alpha=0.05) DataFrame Confidence intervals
sigma2 float Estimated residual variance
loglik float Log-likelihood at convergence
aic float Akaike Information Criterion
bic float Bayesian Information Criterion
resid array Residuals (y - X*beta)
fittedvalues array Fitted values (X*beta)
correlation_params array Estimated correlation parameters
variance_params array Estimated variance parameters
cov_params_func() DataFrame Covariance matrix of beta
summary() str Formatted results table
converged bool Optimization convergence status
n_iter int Number of iterations
method str "ML" or "REML"

How It Works

The Statistical Model

GLS models the response as:

y = X*beta + epsilon, where Var(epsilon) = sigma^2 * Omega

The covariance matrix Omega is block-diagonal by group:

Omega_g = A_g^{1/2} R_g A_g^{1/2}

where:

  • R_g is the correlation matrix (from the correlation structure)
  • A_g is a diagonal matrix of variance weights (from the variance function)

Estimation

  1. OLS initial fit to get starting residuals
  2. Initialize correlation and variance parameters from residuals
  3. Optimize profile log-likelihood over correlation/variance parameters using L-BFGS-B. At each step, beta and sigma^2 are profiled out analytically.
  4. Compute final GLS estimates at the converged parameters:
    • beta = (X' Omega^{-1} X)^{-1} X' Omega^{-1} y
    • Cov(beta) = sigma^2 (X' Omega^{-1} X)^{-1}

Key Design Decisions

Spherical parametrization for CorSymm: The unstructured correlation matrix is parametrized via angles that map to a Cholesky factor, guaranteeing positive-definiteness without constrained optimization. Based on Pinheiro & Bates (1996).

Block-diagonal inversion: Omega is inverted per-group (O(n*m^3)) rather than as a full matrix (O(N^3)), where n = number of groups and m = group size.

REML: Restricted maximum likelihood integrates out the fixed effects from the likelihood, giving unbiased variance estimates. This is the default, matching R's nlme::gls().

Formula Syntax

Powered by formulaic, supporting:

# Simple linear
"y ~ x1 + x2"

# Categorical variables
"y ~ C(treatment)"

# Interactions
"y ~ x1 * x2"          # x1 + x2 + x1:x2
"y ~ x1 : x2"          # just the interaction

# Transformations
"y ~ np.log(x1) + x2"

# Remove intercept
"y ~ x1 + x2 - 1"

ML vs. REML

ML REML
Variance estimate Biased (divides by N) Unbiased (divides by N-k)
Default in R's gls No Yes
Default here No Yes
Use for model comparison AIC/BIC of nested & non-nested models Only models with same fixed effects
method= "ML" "REML"

Translating from R

R code → Python equivalent

# R
library(nlme)
m <- gls(y ~ x1 + x2,
         data = df,
         correlation = corAR1(form = ~1|subject),
         weights = varIdent(form = ~1|group),
         method = "REML")
summary(m)
intervals(m)
# Python
from python_gls import GLS
from python_gls.correlation import CorAR1
from python_gls.variance import VarIdent

r = GLS.from_formula(
    "y ~ x1 + x2",
    data=df,
    correlation=CorAR1(),
    variance=VarIdent("group"),
    groups="subject",
    method="REML",
).fit()

print(r.summary())
print(r.conf_int())

Parameter name mapping

R Python Notes
corAR1(form=~1|subject) CorAR1(), groups="subject" Groups specified at model level
corCompSymm(form=~1|id) CorCompSymm(), groups="id"
corSymm(form=~1|id) CorSymm(), groups="id"
corExp(form=~x+y|id) CorExp(); cor.set_coordinates(...) Coordinates set per group
varIdent(form=~1|group) VarIdent("group") Group variable as string
varPower(form=~fitted) VarPower("fitted") Covariate name as string
method="REML" method="REML" Same

Dependencies

  • numpy >= 1.24
  • scipy >= 1.10
  • pandas >= 2.0
  • formulaic >= 1.0

Testing

pip install -e ".[dev]"
pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_gls-0.1.0.tar.gz (42.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

python_gls-0.1.0-py3-none-any.whl (34.8 kB view details)

Uploaded Python 3

File details

Details for the file python_gls-0.1.0.tar.gz.

File metadata

  • Download URL: python_gls-0.1.0.tar.gz
  • Upload date:
  • Size: 42.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for python_gls-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6e5a4d8d3d6179c846283593ca06655c165aee569c83d23962bca3a9988636ba
MD5 6f22f73a11fb90ccd7e9933b57e715c9
BLAKE2b-256 83cad82bebe175bf2e4d3b44760ebd864b02ca8514e44287c00a09778777cfbf

See more details on using hashes here.

File details

Details for the file python_gls-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: python_gls-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for python_gls-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5203fffcf9ab539b655ab586fc40b4d9fa961460c8d4cc1f187c78f8dd82709b
MD5 a34f33ab7c2b7e17eeff569bf809d4b1
BLAKE2b-256 08810948e0552c3aeaf3daef2fb94c46a8e73eb90af78429308ef0a9ba0ed81c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page