A modular, extensible, and high-performance Python framework for generalized linear models (GLM), generalized additive models (GAM), and generalized additive mixed models (GAMM)
Project description
Aurora-GLM
Aurora-GLM is a modular, extensible, and high-performance Python framework for statistical modeling, focusing on Generalized Linear Models (GLM), Generalized Additive Models (GAM), and Generalized Additive Mixed Models (GAMM).
โก Up to 141ร faster than NumPy with PyTorch CUDA on GPU โ Validated against R and statsmodels (max diff < 1e-11) ๐ฏ Modular design - Easy to extend with custom distributions and links ๐ Multi-backend - NumPy, PyTorch, JAX with transparent API
Project Identity
- Package name:
aurora-glm - Python import:
import aurora - Repository: github.com/Matcraft94/Aurora-GLM
- Author: Lucy E. Arias (@Matcraft94)
- Version: 0.7.0
- Status: Phase 5 (85% complete) - Extended distributions, temporal covariance, GPU acceleration
- Python: 3.10+
- Tagline: Illuminating complex data with modern generalized linear modeling tools
Vision and Goals
Aurora-GLM aims to be:
- Scientifically rigorous: Correct implementations validated against R (mgcv, lme4) and statsmodels
- High performance: GPU-accelerated with multi-backend support (NumPy, PyTorch, JAX)
- Extensible: Users can add custom distributions, link functions, and algorithms
- Modular and functional: Clean design favoring composition over complex inheritance
- Production-ready: Comprehensive testing, documentation, and real-world validation
Use Cases
- Academic research: Ecology, epidemiology, social sciences
- Pharmaceutical industry: Clinical trials, longitudinal analysis
- Financial analysis: Credit scoring, risk modeling, insurance claims
- Machine learning: Statistical foundations with modern tools and GPU acceleration
Performance Highlights
GPU Acceleration
PyTorch CUDA provides exceptional speedups for larger problems:
| Problem Size | NumPy CPU | PyTorch CUDA | Speedup |
|---|---|---|---|
| Gaussian n=1,000 | 42ms | 5ms | 9ร |
| Gaussian n=5,000 | 206ms | 5ms | 39ร |
| Gaussian n=50,000 | 2.1s | 18ms | 116ร |
| Poisson n=50,000 | 4.3s | 30ms | 141ร |
Benchmarked on NVIDIA RTX 5070 Ti. See PERFORMANCE.md for details.
Accuracy Validation
Aurora-GLM achieves excellent numerical agreement with reference implementations:
| Comparison | Max Coefficient Difference |
|---|---|
| vs R glm() | < 1e-11 |
| vs statsmodels (Gaussian) | < 1e-11 |
| vs statsmodels (Poisson) | < 1e-10 |
| vs statsmodels (Binomial) | < 1e-9 |
| PyTorch vs NumPy | < 4e-7 |
| JAX vs NumPy | < 1e-15 |
All backends produce consistent, validated results.
Sparse Matrix Performance
For large GAM/GAMM problems, sparse matrices provide significant benefits:
| Problem Size | Dense | Sparse | Speedup | Memory Reduction |
|---|---|---|---|---|
| n=1,000, k=30 | 0.79s | 0.15s | 5.5ร | 6-8ร |
| n=2,000, k=50 | 2.72s | 0.37s | 7.4ร | 6-8ร |
| n=5,000, k=50 | 8.04s | 1.51s | 5.3ร | 6-8ร |
Key Features
Statistical Models
- GLM: 10 distribution families (Gaussian, Poisson, Binomial, Gamma, Beta, Inverse Gaussian, Negative Binomial, Student-t, Tweedie, Quasi-families)
- GAM: B-splines, natural cubic splines, thin plate splines with GCV/REML smoothing
- GAMM: Random effects, temporal covariance (AR1, compound symmetry, Toeplitz), PQL for non-Gaussian
Performance Features
- GPU acceleration: Up to 141ร speedup with PyTorch CUDA
- Multi-backend: NumPy, PyTorch, JAX with transparent API
- Sparse matrices: 5-8ร speedup and 6-8ร memory reduction for large GAM/GAMM
- Optimized algorithms: IRLS, Newton-Raphson, L-BFGS with numerical stability
Scientific Rigor
- Validated: Against R (glm, mgcv, lme4) and statsmodels
- Comprehensive tests: 494 tests with extensive coverage across all model families
- R-style output: Summary tables, diagnostic plots, confidence intervals
- Formula syntax: R/mgcv-compatible formulas like
y ~ s(x1, k=10) + s(x2) + (1|group)
Extensibility
- Custom distributions: Easy to add new distribution families
- Custom link functions: Flexible link function framework
- Custom algorithms: Pluggable optimization backends
- Multi-backend pattern: Transparent support for NumPy/PyTorch/JAX
Installation
From Source (Development)
# Clone repository
git clone https://github.com/Matcraft94/Aurora-GLM.git
cd Aurora-GLM
# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install in development mode
pip install -e .
# Optional: Install PyTorch for GPU acceleration
pip install torch
# Optional: Install JAX for additional backend
pip install jax jaxlib
Quick Start
Basic GLM Example
import numpy as np
from aurora.models.glm import fit_glm
# Generate Poisson count data
np.random.seed(42)
X = np.random.randn(200, 2)
y = np.random.poisson(np.exp(X[:, 0] * 0.5 - 0.3))
# Fit Poisson GLM with log link
result = fit_glm(X, y, family='poisson', link='log')
# Print R-style summary with coefficients, std errors, p-values
print(result.summary())
# Make predictions
X_new = np.random.randn(10, 2)
predictions = result.predict(X_new, type='response')
# Generate diagnostic plots
result.plot_diagnostics()
Output:
================================================================================
Generalized Linear Model Results
================================================================================
Family: Poisson Link Function: Log
No. Observations: 200 Df Residuals: 197
Df Model: 2 Pseudo R-squared: 0.123
Converged: Yes No. Iterations: 5
================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
intercept -0.2987 0.0712 -4.197 0.000 -0.4382 -0.1593 ***
X0 0.5124 0.0718 7.137 0.000 0.3717 0.6531 ***
X1 -0.0234 0.0706 -0.331 0.741 -0.1617 0.1150
================================================================================
Deviance: 212.54 Null Deviance: 240.32
AIC: 465.43 BIC: 475.52
================================================================================
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Logistic Regression for Classification
# Generate binary classification data
X = np.random.randn(300, 2)
probabilities = 1 / (1 + np.exp(-(X @ np.array([1.2, -0.9]) + 0.3)))
y = np.random.binomial(1, probabilities)
# Fit logistic regression
result = fit_glm(X, y, family='binomial', link='logit')
# View summary
print(result.summary())
# Predict probabilities
y_prob = result.predict(X, type='response')
# Classification metrics
from aurora.validation.metrics import accuracy_score, concordance_index
y_pred = (y_prob >= 0.5).astype(int)
print(f"Accuracy: {accuracy_score(y, y_pred):.3f}")
print(f"C-index (AUROC): {concordance_index(y, y_prob):.3f}")
GPU Acceleration
import torch
# PyTorch tensors work transparently - automatically uses GPU if available
X_torch = torch.randn(100, 2).cuda()
y_torch = torch.poisson(torch.exp(X_torch[:, 0] * 0.5)).cuda()
# Same API, GPU backend - up to 141ร faster!
result = fit_glm(X_torch, y_torch, family='poisson')
print(result.summary())
Multi-Backend Support
# Same code works with NumPy, PyTorch, and JAX
import torch
# PyTorch tensors work transparently
X_torch = torch.randn(100, 2)
y_torch = torch.poisson(torch.exp(X_torch[:, 0] * 0.5))
# Same API, PyTorch backend
result = fit_glm(X_torch, y_torch, family='poisson')
print(result.summary())
Cross-Validation
from aurora.validation.cross_val import cross_val_score, KFold
# Evaluate model with 5-fold cross-validation
scores = cross_val_score(
X, y,
family='poisson',
cv=KFold(n_splits=5),
metrics=['deviance', 'pseudo_r2']
)
print(f"Mean deviance: {scores['deviance'].mean():.2f}")
print(f"Mean pseudo Rยฒ: {scores['pseudo_r2'].mean():.3f}")
Case Studies and Examples
Aurora-GLM includes 17 comprehensive case studies covering diverse domains and real-world applications:
Available Case Studies
Finance & Insurance (5 studies):
- Insurance pricing (Gamma GLM)
- Medical insurance costs prediction
- Customer churn prediction (Binomial GLM)
- Motor claims frequency (Poisson/Negative Binomial)
- E-commerce conversion optimization (Binomial GAM)
Environmental & Ecological (4 studies):
- Air quality forecasting (Gaussian GAM)
- Species distribution modeling (Binomial GAM)
- Bike sharing demand prediction (GAM)
- Traffic accident severity (GAMM)
Healthcare & Medicine (5 studies):
- Sleep study analysis (Random effects GAMM)
- Clinical trials (Gaussian GLM)
- Longitudinal clinical data (GAMM with AR1)
- Psychometric measurement (GAMM with crossed effects)
- Survival analysis (Cox models)
Education (1 study):
- Multilevel educational data (Nested GAMM)
Business & Retail (2 studies):
- Health inspection scoring (Beta GLM)
Each case study is a self-contained Jupyter notebook with:
- Problem statement and research questions
- Data exploration and visualization
- Model selection justification
- Complete model fitting workflow
- Results interpretation and diagnostics
- Key findings and business implications
Getting started with examples:
# Navigate to examples directory
cd examples/06_case_studies
# Start Jupyter and open any notebook
jupyter notebook
# Recommended learning path:
# 1. 01_insurance_pricing.ipynb (Basic GLM)
# 2. 02_air_quality_gam.ipynb (GAM with smooths)
# 3. 04_sleep_study_gamm.ipynb (Random effects)
# 4. 06_clinical_trial_longitudinal.ipynb (Temporal covariance)
See examples/README.md for complete case study guide.
Recent Improvements (v0.7.0)
Code Quality Enhancements
All 5 TODO items completed and fully tested:
-
Formula Parser Validation - Prevents silent errors in R-style formulas
- Validates parenthesized terms require
|for random effects - Rejects empty parentheses
- Clear error messages for invalid syntax
- Validates parenthesized terms require
-
Sparse EDF Computation - Improved GAM smoothing parameter selection
- Hybrid exact/approximate approach
- Exact for small basis (n_basis โค 100), fast approximation for large
- More accurate GCV scores
-
GAMM Log-Likelihood & Model Comparison - Enable AIC/BIC selection
- Proper log-likelihood computation for Gaussian GAMM
- AIC and BIC metrics for model comparison
- Critical for statistical inference
-
Memory-Mapped Streaming - Handle datasets larger than RAM
- True streaming for .npy files via memory mapping
- Only chunks loaded into memory
- Maintains full backward compatibility
-
Data Loading Efficiency - Production-ready ChunkedDataLoader
- Shuffling works correctly with streaming
- X and y remain aligned
- Fallback support for other formats (.npz, .csv)
All implementations manually tested and verified for correctness.
GAM API (Phase 3 - AVAILABLE NOW!)
Basic Univariate GAM
from aurora.models.gam import fit_gam
import numpy as np
# Generate noisy data with non-linear relationship
np.random.seed(42)
x = np.linspace(0, 1, 100)
y_true = np.sin(2 * np.pi * x)
y = y_true + 0.1 * np.random.randn(100)
# Fit GAM with automatic smoothing parameter selection (GCV)
result = fit_gam(x, y, n_basis=12, basis_type='bspline')
# Print model summary
print(result.summary())
# Make predictions at new points
x_new = np.linspace(0, 1, 200)
y_pred = result.predict(x_new)
Output:
============================================================
Generalized Additive Model (GAM) - Fitted Summary
============================================================
Model Information:
Basis type: BSplineBasis
Number of basis: 12
Observations: 100
Smoothing:
Lambda: 8.806605e-02
Effective DoF: 9.69
GCV score: 3.662963e-02
Fit Statistics:
Residual sum sq: 2.9872
R-squared: 0.9421
Residual std: 0.1727
Residuals:
Min: -0.7174
Q1: -0.0823
Median: 0.0123
Q3: 0.0957
Max: 0.6197
============================================================
Multivariate Additive GAM
from aurora.models.gam import fit_additive_gam, SmoothTerm, ParametricTerm
import numpy as np
# Generate data with multiple nonlinear relationships
np.random.seed(42)
n = 200
X = np.random.randn(n, 3)
y = (np.sin(2 * X[:, 0]) + # Nonlinear effect of x1
np.cos(X[:, 1]) + # Nonlinear effect of x2
2 * X[:, 2] + # Linear effect of x3
0.1 * np.random.randn(n))
# Fit additive GAM: y = intercept + f1(x1) + f2(x2) + ฮฒ*x3
result = fit_additive_gam(
X, y,
smooth_terms=[
SmoothTerm(variable=0, n_basis=12, basis_type='bspline'),
SmoothTerm(variable=1, n_basis=12, basis_type='cubic')
],
parametric_terms=[
ParametricTerm(variable=2) # Linear term
]
)
# Print comprehensive summary
print(result.summary())
# Make predictions
X_new = np.random.randn(50, 3)
y_pred = result.predict(X_new)
Formula-Based API (R-style)
from aurora.models.gam import fit_gam_formula
import pandas as pd
# R-style formula with smooth terms
result = fit_gam_formula(
formula="y ~ s(x1, k=10) + s(x2, bs='cubic') + x3",
data=df,
method='REML' # or 'GCV'
)
# Tensor product interactions
result = fit_gam_formula(
formula="y ~ te(x1, x2) + s(x3)",
data=df
)
# Print comprehensive summary
print(result.summary())
# Visualize all smooth terms
from aurora.models.gam import plot_all_smooths
plot_all_smooths(result)
Visualizing Smooth Terms
from aurora.models.gam import plot_smooth, plot_all_smooths
# Plot single smooth term with confidence bands
fig = plot_smooth(
result,
term='s(0)', # or term=0 for first variable
confidence_level=0.95,
partial_residuals=True
)
# Plot all smooth terms in a grid
fig = plot_all_smooths(
result,
ncols=2,
confidence_level=0.95
)
GAMM API (Phase 4 - AVAILABLE NOW!)
Basic Random Intercept Model
from aurora.models.gamm import fit_gamm, RandomEffect
import numpy as np
# Generate longitudinal data
np.random.seed(42)
n_subjects, n_per_subject = 10, 15
n = n_subjects * n_per_subject
subject_id = np.repeat(np.arange(n_subjects), n_per_subject)
time = np.tile(np.arange(n_per_subject), n_subjects)
# Design matrix
X = np.column_stack([np.ones(n), time])
# Random intercepts (subject-specific baseline)
b_subj = np.random.randn(n_subjects) * 0.8
y = 2.0 + 0.5 * time + b_subj[subject_id] + np.random.randn(n) * 0.3
# Fit GAMM with random intercept
re = RandomEffect(grouping='subject')
result = fit_gamm(
y=y,
X=X,
random_effects=[re],
groups_data={'subject': subject_id},
covariance='identity'
)
# View results
print(f"Fixed effects (ฮฒ): {result.beta_parametric}")
print(f"Variance components (ฮจ): {result.variance_components}")
print(f"Residual variance (ฯยฒ): {result.residual_variance}")
print(f"AIC: {result.aic:.2f}, BIC: {result.bic:.2f}")
Random Intercept + Slope Model
# Random intercept + slope on 'time' (variable index 1)
re_slope = RandomEffect(
grouping='subject',
variables=(1,), # Random slope for 'time'
include_intercept=True
)
# Fit model with unstructured covariance (allows correlation)
result = fit_gamm(
y=y,
X=X,
random_effects=[re_slope],
groups_data={'subject': subject_id},
covariance='unstructured'
)
# Extract variance-covariance matrix
psi = result.variance_components
print(f"Var(intercept): {psi[0, 0]:.3f}")
print(f"Cov(intercept, slope): {psi[0, 1]:.3f}")
print(f"Var(slope): {psi[1, 1]:.3f}")
# Correlation between random intercept and slope
corr = psi[0, 1] / np.sqrt(psi[0, 0] * psi[1, 1])
print(f"Correlation: {corr:.3f}")
Temporal Correlation with AR1
For longitudinal data with temporal autocorrelation:
from aurora.models.gamm import fit_gamm, RandomEffect
# AR1 covariance for temporally correlated observations
re_ar1 = RandomEffect(grouping='subject', covariance='ar1')
result = fit_gamm(
y=y,
X=X,
random_effects=[re_ar1],
groups_data={'subject': subject_id}
)
# Extract AR1 parameters
params = result.covariance_params[0]
sigma2 = np.exp(params[0]) # Variance
rho = np.tanh(params[1]) # Autocorrelation
print(f"AR1: ฯยฒ = {sigma2:.3f}, ฯ = {rho:.3f}")
# Interpretation: observations at lag k have correlation ฯ^k
# Example: ฯ = 0.7 means adjacent observations have correlation 0.7,
# observations 2 time units apart have correlation 0.49, etc.
Compound Symmetry (Exchangeable Correlation)
For clustered data with equal within-cluster correlation:
# Compound symmetry: all pairs equally correlated
re_cs = RandomEffect(grouping='cluster', covariance='compound_symmetry')
result = fit_gamm(
y=y,
X=X,
random_effects=[re_cs],
groups_data={'cluster': cluster_id}
)
# Extract variance parameter
params = result.covariance_params[0]
sigma2 = np.exp(params[0])
# For compound symmetry, all within-cluster pairs have the same correlation
psi = result.variance_components[0]
print(f"Estimated variance: ฯยฒ = {sigma2:.3f}")
print("Use compound symmetry when cluster members are exchangeable")
print("(e.g., students in same school, patients in same hospital)")
Sparse Matrix Support for Large-Scale Models
For large datasets or models with many basis functions:
# Sparse GAMM (10-100ร faster for large problems)
result = fit_gamm(
formula="y ~ s(x, k=50) + (1 | subject)",
data={"y": y, "x": x, "subject": subject_id},
use_sparse=True # Enable sparse matrices
)
# Benefits of sparse matrices:
# - Memory: 6-8ร reduction for typical problems
# - Speed: 10-100ร faster for n > 1000
# - Enables models that don't fit in memory with dense matrices
# When to use sparse:
# - Large datasets (n > 500, k > 20)
# - B-spline basis functions (naturally sparse)
# - Memory-constrained environments
# - Multiple smooth terms
Non-Gaussian GAMM (Poisson/Binomial with PQL)
from aurora.models.gamm import fit_pql_smooth
# Poisson GAMM for count data
result = fit_pql_smooth(
y=counts,
X=X,
groups=subject_id,
family='poisson',
smooth_terms=[SmoothTerm(variable=0, n_basis=10)],
max_iter=50
)
# Binomial GAMM for binary outcomes
result = fit_pql_smooth(
y=binary_outcome,
X=X,
groups=subject_id,
family='binomial',
smooth_terms=[SmoothTerm(variable=0, n_basis=8)],
max_iter=100
)
print(f"Fixed effects: {result.beta}")
print(f"Random effects variance: {result.variance_components}")
Current Implementation Status
Phase 1: Core Numerical Foundation - COMPLETED โ
Backend Infrastructure:
- โ Backend abstraction layer (JAX, PyTorch)
- โ Type system with comprehensive Protocols
- โ Array namespace utilities for transparent NumPy/PyTorch/JAX compatibility
Optimization Algorithms:
- โ Newton-Raphson with automatic Hessian
- โ IRLS (Iteratively Reweighted Least Squares) with sparse matrix support
- โ L-BFGS with strong Wolfe line search
- โ Autodiff module (gradient, hessian, jacobian) for all backends
Distribution Families (10/10):
- โ Gaussian, Poisson, Binomial, Gamma
- โ Beta, Inverse Gaussian, Negative Binomial
- โ Student-t, Tweedie, Quasi-families
Link Functions (6/6):
- โ Identity, Log, Logit, Inverse, CLogLog, Probit
Phase 2: Basic GLM - COMPLETED โ
- โ
IRLS-based
fit_glm()with multi-backend support - โ
R-style
GLMResult.summary()with p-values, significance codes - โ Diagnostic plots (residuals, Q-Q, scale-location, leverage)
- โ Confidence intervals, hypothesis tests, influence measures
- โ Cross-validation and comprehensive metrics
- โ Validation against statsmodels and R glm()
Phase 3: GAM (Splines and Smoothing) - COMPLETED โ
- โ B-spline, natural cubic spline, thin plate spline bases
- โ GCV and REML smoothing parameter selection
- โ
R-style formula parser (
y ~ s(x1) + s(x2) + x3) - โ
Tensor product smooths (
te(x1, x2)) - โ Visualization with confidence bands
- โ Sparse matrix support for large problems
Phase 4: GAMM (Random Effects) - COMPLETED โ
- โ Random intercepts and slopes
- โ Nested and crossed random effects
- โ Multiple covariance structures (identity, unstructured, diagonal)
- โ AR1 and compound symmetry for temporal data
- โ Toeplitz covariance structure
- โ PQL estimation for non-Gaussian families
- โ Laplace approximation
- โ Sparse matrix optimization
Phase 5: Extended Features - IN PROGRESS ๐ง (85%)
Implemented:
- โ Extended distributions (Beta, Inverse Gaussian, Negative Binomial, Student-t, Tweedie)
- โ Temporal covariance structures (AR1, compound symmetry, Toeplitz)
- โ Sparse matrix support (6-8ร memory reduction, 10-100ร speedup)
- โ GPU acceleration benchmarks (up to 141ร speedup)
- โ Multi-backend accuracy validation
- โ Comprehensive benchmark suite
Remaining:
- ๐ PyPI package publication
- ๐ Documentation website
- ๐ Additional spatial covariance structures
- ๐ Performance optimizations for massive datasets
Complete Feature Set
Distribution Families (10/10)
- โ Gaussian (Normal) - continuous data
- โ Poisson - count data
- โ Binomial - binary/proportions
- โ Gamma - positive continuous, right-skewed
- โ Beta - proportions in (0,1)
- โ Inverse Gaussian - positive durations (Wald distribution)
- โ Negative Binomial - overdispersed counts (NB2 parameterization)
- โ Student-t - heavy-tailed, robust regression
- โ Tweedie - compound Poisson-Gamma (insurance/actuarial)
- โ Quasi-families - Quasi-Poisson, Quasi-Binomial
Link Functions (6/6)
- โ Identity: g(ฮผ) = ฮผ
- โ Log: g(ฮผ) = log(ฮผ)
- โ Logit: g(ฮผ) = log(ฮผ/(1-ฮผ))
- โ Probit: g(ฮผ) = ฮฆโปยน(ฮผ)
- โ Inverse: g(ฮผ) = 1/ฮผ
- โ CLogLog: g(ฮผ) = log(-log(1-ฮผ))
GAM Features
- โ B-spline basis: Cox-de Boor recursion, local support, partition of unity
- โ Natural cubic splines: Truncated power basis with analytical penalties
- โ Thin plate splines: Multidimensional smoothing with radial basis functions
- โ Tensor products: te(x1, x2) for multidimensional interactions
- โ GCV smoothing: Generalized Cross-Validation for automatic ฮป selection
- โ REML smoothing: Restricted Maximum Likelihood for better multi-term selection
- โ Sparse matrices: 5-8ร speedup and 6-8ร memory reduction
- โ Formula parser: R/mgcv-compatible syntax
GAMM Features
- โ Random effects: Intercepts and slopes with flexible specification
- โ Nested/crossed effects: Complex hierarchies fully supported
- โ
Covariance structures:
- Identity (diagonal)
- Unstructured (full variance-covariance)
- Diagonal (heterogeneous variances)
- AR1 (temporal autocorrelation with exponential decay)
- Compound symmetry (exchangeable correlation)
- Toeplitz (banded temporal correlations)
- โ PQL estimation: Penalized Quasi-Likelihood for non-Gaussian families
- โ Laplace approximation: Alternative estimation method
- โ Sparse matrices: Memory-efficient for large-scale models
- โ
Formula syntax: lme4-style
(1 + x | group)supported
Inference & Diagnostics
- โ Standard errors: Wald approximation with delta method
- โ P-values: Z-tests and likelihood ratio tests
- โ Confidence intervals: For coefficients and predictions
- โ Residuals: Response, Pearson, deviance, working, studentized
- โ Influence measures: Leverage, Cook's distance, DFBETAs
- โ Hypothesis tests: Wald tests for single and multi-constraint hypotheses
- โ Model comparison: AIC, BIC, pseudo Rยฒ, deviance
Validation
- โ Cross-validation: KFold, StratifiedKFold with aggregated results
- โ Metrics: MSE, MAE, RMSE, accuracy, log-loss, Brier score, C-index
- โ Diagnostic plots: 4-panel residual plots (residuals, Q-Q, scale-location, leverage)
- โ Q-Q plots: Normal probability plots for residuals
- โ Caterpillar plots: Random effects visualization with confidence intervals
Benchmarking and Validation
Run comprehensive benchmarks comparing Aurora with statsmodels and R:
# Comprehensive benchmarks (accuracy + GPU performance + R comparison)
cd /tmp && PYTHONPATH=/path/to/Aurora-GLM python benchmarks/comprehensive_benchmarks.py
# Quick benchmarks (CI-friendly)
cd /tmp && PYTHONPATH=/path/to/Aurora-GLM python benchmarks/comprehensive_benchmarks.py --quick
# Performance benchmarks only
PYTHONPATH=. python benchmarks/performance_benchmarks.py
# GLM validation against statsmodels
PYTHONPATH=. python benchmarks/run_glm_checks.py --replicates 3
Note: Run from /tmp or another directory without an renv project to avoid R library path conflicts when using rpy2.
Results saved to benchmarks/results/. See PERFORMANCE.md for detailed analysis.
External Validation
Latest validation results:
- vs R glm(): max |ฮcoef| < 1e-11
- vs statsmodels: max |ฮcoef| < 2e-6
- Multi-backend: JAX vs NumPy < 1e-15
- PyTorch vs NumPy < 4e-7
All backends produce consistent, validated results suitable for scientific research.
Development
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=aurora --cov-report=html
# Run specific module
pytest tests/test_models/test_glm_fitting.py
# Run specific test
pytest tests/test_distributions/test_links.py::test_identity_link_roundtrip
# Verbose output
pytest -v
Test Status: 494 tests collected (comprehensive coverage across GLM, GAM, GAMM, Bayesian inference, count models, and smoothing methods)
Code Quality
# Format code
ruff format aurora/ tests/
# Lint code
ruff check aurora/ tests/
# Type checking
mypy aurora/
# Run all quality checks before committing
ruff format aurora/ tests/ && ruff check aurora/ tests/ && mypy aurora/
Development Workflow
# Create feature branch
git checkout -b feature/implement-new-distribution
# Make changes and test
pytest
pytest --cov=aurora --cov-report=html
# Format and lint
ruff format aurora/ tests/
ruff check aurora/ tests/
mypy aurora/
# Commit with descriptive message
git commit -m "feat(distributions): add Exponential family"
# Push and create PR
git push origin feature/implement-new-distribution
Design Principles
Array Namespace Pattern
Aurora-GLM uses a namespace abstraction to support multiple array libraries transparently:
from aurora.distributions._utils import namespace, as_namespace_array
def my_function(x, y):
# Automatically detect NumPy, PyTorch, or JAX
xp = namespace(x, y)
# Convert to appropriate array type
x_arr = as_namespace_array(x, xp, like=y)
# Use namespace-specific operations
return xp.sum(x_arr * xp.exp(y))
This pattern ensures:
- Zero code changes between backends
- Automatic GPU support with PyTorch/JAX
- Consistent results across all backends (validated < 1e-15)
Extensibility
Users can create custom distributions and link functions:
from aurora.distributions.base import Family, LinkFunction
from aurora.distributions._utils import namespace, ensure_positive
class MyDistribution(Family):
def log_likelihood(self, y, mu, **params):
xp = namespace(y, mu)
mu = ensure_positive(mu, xp)
# Implementation using namespace pattern
# ...
def deviance(self, y, mu, **params):
# ...
def variance(self, mu, **params):
# V(ฮผ) = variance function
# ...
def initialize(self, y):
# Return initial ฮผ estimates
# ...
@property
def default_link(self):
return MyLink()
class MyLink(LinkFunction):
def link(self, mu):
# ฮท = g(ฮผ)
# ...
def inverse(self, eta):
# ฮผ = gโปยน(ฮท)
# ...
def derivative(self, mu):
# dฮท/dฮผ = g'(ฮผ)
# ...
When to Use Aurora-GLM
Choose Aurora-GLM when you need:
- GPU acceleration - Up to 141ร speedup for large datasets
- GAM/GAMM capabilities - statsmodels doesn't support these
- Multi-backend flexibility - PyTorch/JAX for GPU or autodiff
- Advanced covariance structures - AR1, compound symmetry, Toeplitz for temporal/spatial data
- Sparse matrix optimization - Handle models that don't fit in memory
- Consistent API - GLM โ GAM โ GAMM with same interface
- Research/teaching - Readable Python implementation with R-style output
- Extensibility - Easy to add custom distributions and link functions
Choose statsmodels when you need:
- Pure CPU GLM with maximum single-threaded speed
- Extensive diagnostics and inference tools
- Time series models (ARIMA, VAR, state space models)
- Mature ecosystem with extensive documentation
Project Structure
aurora/
โโโ core/
โ โโโ backends/ # Multi-backend support (JAX, PyTorch)
โ โ โโโ _protocol.py # Backend Protocol interface (internal)
โ โ โโโ _registry.py # Backend registration (internal)
โ โ โโโ operations.py # Backend-agnostic numerical operations
โ โโโ autodiff/ # Automatic differentiation (gradient, hessian, jacobian)
โ โโโ optimization/ # IRLS, Newton-Raphson, L-BFGS
โ โโโ linalg/ # Linear algebra primitives
โโโ distributions/
โ โโโ families/ # 10 distribution families
โ โโโ links/ # 6 link functions
โ โโโ _utils.py # namespace(), as_namespace_array(), ensure_positive()
โโโ models/
โ โโโ glm/ # fit_glm(), predictions, diagnostics
โ โโโ gam/ # fit_gam(), formula parser, smooths
โ โโโ gamm/ # fit_gamm(), random effects, PQL, covariance structures
โโโ smoothing/
โ โโโ splines/ # B-splines, cubic, thin plate
โ โโโ penalties/ # Difference penalties, ridge
โ โโโ selection/ # GCV, REML smoothing selection
โโโ inference/
โ โโโ hypothesis/ # Wald tests, likelihood ratio tests
โ โโโ intervals/ # Confidence intervals
โ โโโ diagnostics/ # Residuals, influence measures
โโโ validation/
โโโ metrics/ # MSE, MAE, accuracy, C-index
โโโ cross_val/ # KFold, cross_val_score
Roadmap
Completed โ
- Phase 1: Core infrastructure (backends, types, optimization)
- Phase 2: Full GLM implementation with IRLS, diagnostics, inference
- Phase 3: GAM with B-splines, natural cubic, thin plate, GCV/REML
- Phase 4: GAMM with random effects, PQL for non-Gaussian families
- Phase 5 (current): Extended distributions, temporal covariance, sparse matrices
- 494 tests with comprehensive coverage (5,731 new lines of test code)
- Validation against statsmodels and R
- GPU acceleration benchmarks (up to 141ร speedup)
- Multi-backend accuracy validation
In Progress ๐ง (Phase 5 - 85% complete)
- PyPI package publication
- Documentation with interactive examples
- Research paper publication
- Additional spatial covariance structures (exponential, Matรฉrn)
Recently Completed ๐
- Comprehensive benchmark suite (1,591 lines): Multi-backend accuracy validation and GPU performance measurement
- Bayesian GLM inference tests (1,060 lines): Prior specification, posterior sampling, convergence diagnostics
- Zero-inflated count model tests (956 lines): ZIP and ZINB with excess zero validation
- Hurdle count model tests (640 lines): Two-stage modeling of structural zeros
- Smoothing method tests (1,083 lines): LOESS and P-spline implementation validation
- Integration tests (401 lines): End-to-end workflows for count models
Total test coverage expansion: 5,731 lines across benchmarking, Bayesian inference, count models, and smoothing methods.
References and Mathematical Foundations
Aurora-GLM is built on rigorous statistical foundations with comprehensive mathematical documentation. For detailed mathematical formulations, proofs, and derivations, see REFERENCES.md.
Core Statistical Theory
Generalized Linear Models (GLM):
- McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall/CRC.
- Nelder, J. A., & Wedderburn, R. W. M. (1972). "Generalized linear models." JRSS: Series A, 135(3), 370-384.
Generalized Additive Models (GAM):
- Hastie, T., & Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall/CRC.
- Wood, S. N. (2017). Generalized Additive Models: An Introduction with R (2nd ed.). CRC Press.
- Wood, S. N. (2011). "Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models." JRSS: Series B, 73(1), 3-36.
Generalized Additive Mixed Models (GAMM):
- Breslow, N. E., & Clayton, D. G. (1993). "Approximate inference in generalized linear mixed models." JASA, 88(421), 9-25.
- Lin, X., & Breslow, N. E. (1996). "Bias correction in generalized linear mixed models with multiple components of dispersion." JASA, 91(435), 1007-1016.
- Bates, D., Mรคchler, M., Bolker, B., & Walker, S. (2015). "Fitting linear mixed-effects models using lme4." Journal of Statistical Software, 67(1), 1-48.
Smoothing and Splines:
- de Boor, C. (2001). A Practical Guide to Splines (Revised ed.). Springer.
- Eilers, P. H. C., & Marx, B. D. (1996). "Flexible smoothing with B-splines and penalties." Statistical Science, 11(2), 89-121.
- Craven, P., & Wahba, G. (1978). "Smoothing noisy data with spline functions." Numerische Mathematik, 31(4), 377-403.
Numerical Methods
Optimization:
- Green, P. J. (1984). "Iteratively reweighted least squares for maximum likelihood estimation." JRSS: Series B, 46(2), 149-192.
- Liu, D. C., & Nocedal, J. (1989). "On the limited memory BFGS method for large scale optimization." Mathematical Programming, 45(1-3), 503-528.
Numerical Stability:
- Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
- Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM.
Reference Implementations (Validation)
- R glm(): Base stats package (GLM reference)
- R mgcv: GAM/GAMM by Simon Wood (gold standard for smoothing)
- R lme4: Mixed models by Bates et al. (GLMM reference)
- statsmodels: Python statistical modeling library
- scikit-learn: API design patterns and conventions
Technical Resources
- JAX: Composable transformations - https://jax.readthedocs.io
- PyTorch: Automatic differentiation - https://pytorch.org/docs
- Array API Standard: Cross-library compatibility - https://data-apis.org/array-api
For a comprehensive list of mathematical foundations, algorithms, and validation references, including detailed equations and derivations, please see REFERENCES.md.
Contributing
Contributions are welcome! This project is in active development with many opportunities to help:
Areas for Contribution
- Additional distributions - Zero-inflated, hurdle models
- Performance optimizations - Further GPU optimizations, distributed computing
- Documentation improvements - Tutorials, case studies, API reference
- Real-world case studies - Applications in ecology, epidemiology, finance
- Bug reports and fixes - Help improve stability and reliability
Contribution Guidelines
Please ensure:
- Type hints on all public functions
- Tests covering both NumPy and PyTorch backends
- Docstrings in NumPy/Google format with examples
- Code formatting with
ruff format - Multi-backend support using the namespace pattern
Citation
If you use Aurora-GLM in your research, please cite it using the information in our CITATION.cff file, or use the following BibTeX entry:
@software{aurora_glm2025,
title = {Aurora-GLM: Generalized Linear and Additive Models},
author = {Arias, Lucy Eduardo},
year = {2025},
version = {0.7.0},
url = {https://github.com/Matcraft94/Aurora-GLM},
license = {MIT}
}
GitHub users can use the "Cite this repository" button in the right sidebar to automatically generate citations in various formats.
License
MIT License - see LICENSE file
Acknowledgments
Aurora-GLM draws inspiration from:
- R's mgcv package by Simon Wood
- R's lme4 package by Bates et al.
- Python's statsmodels library
- The JAX ecosystem for modern array programming
Special thanks to the open-source community for providing excellent tools and libraries.
Status: Phase 5 (85% complete) - Extended features, temporal covariance, GPU acceleration Version: 0.7.0 Tests: 494 tests collected (5,731 new lines added: benchmarks, Bayesian, count models, smoothing) Python: 3.10+ GPU: Up to 141ร speedup with PyTorch CUDA Accuracy: Validated against R and statsmodels (< 1e-11) Maintained by: Lucy E. Arias (@Matcraft94)
Illuminating complex data with modern generalized linear and additive modeling tools.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aurora_glm-0.7.0.tar.gz.
File metadata
- Download URL: aurora_glm-0.7.0.tar.gz
- Upload date:
- Size: 330.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6dcb8e73c9f2fa52eca46d8a842a191165ae8795ca121e9a728fca6c294fdf97
|
|
| MD5 |
e4b018eefeebaf87c53968dca6beea76
|
|
| BLAKE2b-256 |
f804bd650fa8f0c77a39db5ef2f18bfceeb919cd204b6521b17e1a7128c630b3
|
File details
Details for the file aurora_glm-0.7.0-py3-none-any.whl.
File metadata
- Download URL: aurora_glm-0.7.0-py3-none-any.whl
- Upload date:
- Size: 392.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9a0af1f735a9e23ab9669a5627cdae745df3046a852d0fbbde916f25f3796b3
|
|
| MD5 |
3ed27e9e5878eaa64ad4827446e88045
|
|
| BLAKE2b-256 |
bba7d5e44b1bad102d022247a1d2905f053329d5f2556c4860c554d695c5cda6
|