Skip to main content

High-performance statistical operations for Polars DataFrames

Project description

causers

Build Status PyPI Version Python Versions License: MIT Coverage: 100% Documentation Status

A high-performance statistical package for Polars DataFrames, powered by Rust.

๐Ÿš€ Overview

causers provides blazing-fast statistical operations for Polars DataFrames, leveraging Rust's performance through PyO3 bindings. Designed for data scientists and analysts who need production-grade performance without sacrificing ease of use.

โœจ Key Features

  • ๐ŸŽ๏ธ High Performance: Linear regression on 1M rows in ~250ms with HC3 standard errors
  • ๐Ÿ“Š Multiple Regression: Support for multiple covariates with matrix-based OLS
  • ๐Ÿ”ฎ Logistic Regression: Binary outcome regression with Newton-Raphson MLE
  • ๐Ÿ“ˆ Robust Standard Errors: HC3 heteroskedasticity-consistent standard errors included
  • ๐ŸŽฏ Flexible Models: Optional intercept for fully saturated models
  • ๐Ÿข Clustered Standard Errors: Cluster-robust SE for panel/grouped data
  • ๐Ÿ”„ Bootstrap Methods: Wild cluster bootstrap (linear) and score bootstrap (logistic)
  • ๐Ÿงช Synthetic DID: Synthetic Difference-in-Differences for causal inference with panel data
  • ๐ŸŽฏ Synthetic Control: Classic SC with 4 method variants (traditional, penalized, robust, augmented)
  • ๐Ÿ”ง Native Polars Integration: Zero-copy operations on Polars DataFrames
  • ๐Ÿฆ€ Rust-Powered: Core computations in Rust for maximum throughput
  • ๐Ÿ Pythonic API: Clean, intuitive interface with full type hints
  • ๐Ÿ›ก๏ธ Production Ready: Comprehensive test coverage, security rating B+
  • ๐ŸŒ Cross-Platform: Works on Linux, macOS (Intel/ARM), and Windows

๐Ÿ“ฆ Installation

From PyPI (Recommended)

pip install causers

From Source (Development)

# Prerequisites: Python 3.8+ and Rust 1.70+
git clone https://github.com/causers/causers.git
cd causers

# Install build dependencies
pip install maturin polars numpy

# Build and install in development mode
maturin develop --release

๐ŸŽฏ Quick Start

Single Covariate Regression

import polars as pl
import causers

# Create a sample DataFrame
df = pl.DataFrame({
    "x": [1.0, 2.0, 3.0, 4.0, 5.0],
    "y": [2.0, 4.0, 5.0, 8.0, 10.0]
})

# Perform linear regression
result = causers.linear_regression(df, x_cols="x", y_col="y")

print(f"Slope: {result.slope:.4f}")
print(f"Intercept: {result.intercept:.4f}")
print(f"R-squared: {result.r_squared:.4f}")
print(f"Sample size: {result.n_samples}")

Output:

Slope: 2.0000
Intercept: -0.0000
R-squared: 0.9459
Sample size: 5

Multiple Covariate Regression

# Multiple regression with two predictors
df_multi = pl.DataFrame({
    "size": [1000, 1500, 1200, 1800, 2200],
    "age": [5, 10, 3, 15, 7],
    "price": [200000, 280000, 245000, 350000, 430000]
})

# Predict price from size and age
result = causers.linear_regression(df_multi, x_cols=["size", "age"], y_col="price")

print(f"Coefficients: {result.coefficients}")
print(f"Intercept: {result.intercept:.2f}")
print(f"R-squared: {result.r_squared:.4f}")

Accessing Standard Errors

import polars as pl
import causers

# Regression with noisy data
df = pl.DataFrame({
    "x": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0],
    "y": [2.1, 3.9, 6.2, 7.8, 10.1, 12.3, 13.9, 16.2, 17.8, 20.1]
})

result = causers.linear_regression(df, "x", "y")

# Access HC3 robust standard errors
print(f"Coefficient: {result.coefficients[0]:.4f} ยฑ {result.standard_errors[0]:.4f}")
print(f"Intercept: {result.intercept:.4f} ยฑ {result.intercept_se:.4f}")

Output:

Coefficient: 1.9879 ยฑ 0.0294
Intercept: 0.1333 ยฑ 0.1828

Note: Standard errors use the HC3 estimator (MacKinnon & White, 1985), which provides heteroskedasticity-consistent inference even when error variance is not constant.

Logistic Regression

import polars as pl
import causers

# Binary outcome data
df = pl.DataFrame({
    "x": [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0],
    "y": [0, 0, 0, 1, 0, 1, 1, 1]
})

result = causers.logistic_regression(df, x_cols="x", y_col="y")

print(f"Coefficient (log-odds): {result.coefficients[0]:.4f}")
print(f"Standard Error: {result.standard_errors[0]:.4f}")
print(f"Converged: {result.converged} in {result.iterations} iterations")
print(f"Pseudo Rยฒ: {result.pseudo_r_squared:.4f}")

Tip: Coefficients are on the log-odds scale. Use math.exp(coefficient) to convert to odds ratios.

Clustered Standard Errors

When observations are clustered (e.g., students within schools, employees within firms), use cluster-robust standard errors:

# Panel data with firm-level clustering
df = pl.DataFrame({
    "treatment": [0, 0, 1, 1, 0, 0, 1, 1],
    "outcome": [5, 6, 12, 14, 4, 7, 11, 15],
    "firm_id": [1, 1, 1, 1, 2, 2, 2, 2]
})

result = causers.linear_regression(
    df,
    x_cols="treatment",
    y_col="outcome",
    cluster="firm_id"
)

print(f"Treatment effect: {result.coefficients[0]:.4f}")
print(f"Clustered SE: {result.standard_errors[0]:.4f}")
print(f"Number of clusters: {result.n_clusters}")

Tip: When you have fewer than 42 clusters, use bootstrap=True for more reliable inference:

result = causers.linear_regression(
    df, "treatment", "outcome",
    cluster="firm_id", bootstrap=True, seed=42
)

Bootstrap Weight Methods

When using wild cluster bootstrap with very few clusters (G < 10), the Webb six-point distribution can provide improved small-sample properties:

# Wild cluster bootstrap with Webb weights (recommended for very few clusters)
result = causers.linear_regression(
    df,
    x_cols="treatment",
    y_col="outcome",
    cluster="firm_id",
    bootstrap=True,
    bootstrap_method="webb",  # Use Webb six-point distribution
    seed=42
)

print(f"Treatment effect: {result.coefficients[0]:.4f}")
print(f"Bootstrap SE: {result.standard_errors[0]:.4f}")
print(f"SE type: {result.cluster_se_type}")  # "bootstrap_webb"

When to use each method:

  • Rademacher (default): Standard choice, works well with moderate to many clusters
  • Webb: Recommended when you have very few clusters (G < 10). Uses a six-point distribution that better approximates the normal distribution with small samples.

Reference: MacKinnon, J. G., & Webb, M. D. (2018). "The wild bootstrap for few (treated) clusters." The Econometrics Journal, 21(2), 114-135.

Regression Without Intercept

# Force regression through origin
result = causers.linear_regression(
    df,
    x_cols="x",
    y_col="y",
    include_intercept=False
)

print(f"Coefficient: {result.coefficients[0]:.4f}")
print(f"Intercept: {result.intercept}")  # None
print(f"Standard Error: {result.standard_errors[0]:.4f}")
print(f"Intercept SE: {result.intercept_se}")  # None

Synthetic Difference-in-Differences

Estimate causal effects from panel data using Synthetic DID (Arkhangelsky et al., 2021):

import polars as pl
import causers

# Panel data: units observed over time, with treatment in post-period
df = pl.DataFrame({
    "unit": [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
    "time": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
    "outcome": [1.0, 2.0, 3.0, 9.0,   # Unit 0: treated (effect = 5)
                1.0, 2.0, 3.0, 4.0,   # Unit 1: control
                1.5, 2.5, 3.5, 4.5,   # Unit 2: control
                0.5, 1.5, 2.5, 3.5],  # Unit 3: control
    "treated": [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
})

result = causers.synthetic_did(
    df,
    unit_col="unit",
    time_col="time",
    outcome_col="outcome",
    treatment_col="treated",
    bootstrap_iterations=200,
    seed=42
)

print(f"ATT: {result.att:.4f} ยฑ {result.standard_error:.4f}")
print(f"Panel: {result.n_units_control} control, {result.n_units_treated} treated")
print(f"Periods: {result.n_periods_pre} pre, {result.n_periods_post} post")
print(f"Pre-treatment fit (RMSE): {result.pre_treatment_fit:.4f}")

Output:

ATT: 5.0000 ยฑ 0.0000
Panel: 3 control, 1 treated
Periods: 3 pre, 1 post
Pre-treatment fit (RMSE): 0.0000

Note: The Frank-Wolfe algorithm solves for optimal unit and time weights constrained to the unit simplex. Placebo bootstrap is used by default for standard error estimation.

Synthetic Control (Single Treated Unit)

Estimate causal effects for a single treated unit using Synthetic Control (Abadie et al., 2010):

import polars as pl
import causers

# Panel data: 1 treated unit (unit 1) with controls
df = pl.DataFrame({
    "state": [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
    "year": [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
    "outcome": [10.0, 12.0, 14.0, 25.0,   # State 1: treated in year 4
                9.0, 11.0, 13.0, 15.0,    # State 2: control
                11.0, 13.0, 15.0, 17.0],  # State 3: control
    "treated": [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
})

result = causers.synthetic_control(
    df,
    unit_col="state",
    time_col="year",
    outcome_col="outcome",
    treatment_col="treated",
    method="traditional",  # or "penalized", "robust", "augmented"
    seed=42
)

print(f"ATT: {result.att:.4f} ยฑ {result.standard_error:.4f}")
print(f"Method: {result.method}")
print(f"Control weights: {result.unit_weights}")
print(f"Pre-treatment RMSE: {result.pre_treatment_rmse:.4f}")

Note: Synthetic Control requires exactly ONE treated unit. For multiple treated units, use synthetic_did() instead. The 4 method variants offer different trade-offs between bias and variance.

๐Ÿ“Š Performance Benchmarks

Benchmarked on Apple M1 Pro with 16GB RAM:

Dataset Size causers NumPy+pandas Speedup
1,000 rows 0.8ms 2.1ms 2.6x
100,000 rows 4.2ms 15.3ms 3.6x
1,000,000 rows 45ms 142ms 3.2x
5,000,000 rows 210ms 723ms 3.4x

Performance may vary based on hardware. All benchmarks use the same OLS algorithm.

๐Ÿ“– API Documentation

linear_regression(df, x_cols, y_col, ...)

Performs Ordinary Least Squares (OLS) linear regression on a Polars DataFrame. Supports both single and multiple covariate regression.

Parameters:

  • df (pl.DataFrame): Input DataFrame containing the data
  • x_cols (str | List[str]): Name(s) of independent variable column(s)
    • Single covariate: "feature" (backward compatible)
    • Multiple covariates: ["size", "age", "bedrooms"]
  • y_col (str): Name of the dependent variable column
  • include_intercept (bool, optional): Whether to include intercept term. Default: True
    • True: Standard regression y = ฮฒโ‚€ + ฮฒโ‚xโ‚ + ฮฒโ‚‚xโ‚‚ + ...
    • False: Regression through origin y = ฮฒโ‚xโ‚ + ฮฒโ‚‚xโ‚‚ + ...

Returns:

  • LinearRegressionResult: Object with the following attributes:
    • coefficients (List[float]): Regression coefficients for each predictor
    • standard_errors (List[float]): HC3 robust standard errors for each coefficient
    • slope (float | None): Single coefficient (backward compatibility, single covariate only)
    • intercept (float | None): Y-intercept (None if include_intercept=False)
    • intercept_se (float | None): HC3 standard error for intercept (None if include_intercept=False)
    • r_squared (float): Coefficient of determination (Rยฒ)
    • n_samples (int): Number of data points used

Raises:

  • ValueError: If columns don't exist, have mismatched lengths, contain invalid data, or if any observation has extreme leverage (โ‰ฅ0.99)
  • TypeError: If columns contain non-numeric data

Examples:

Single covariate:

import polars as pl
import causers

df = pl.read_csv("data.csv")
result = causers.linear_regression(df, x_cols="feature", y_col="target")
print(f"y = {result.slope}x + {result.intercept}")

Multiple covariates:

# Multiple regression
result = causers.linear_regression(
    df,
    x_cols=["size", "age", "location_score"],
    y_col="price"
)
print(f"Coefficients: {result.coefficients}")
print(f"Intercept: {result.intercept}")
print(f"Rยฒ = {result.r_squared:.4f}")

No intercept:

# Regression through origin
result = causers.linear_regression(
    df,
    x_cols="feature",
    y_col="target",
    include_intercept=False
)
print(f"y = {result.coefficients[0]}x (no intercept)")

logistic_regression(df, x_cols, y_col, ...)

Performs logistic regression on binary outcomes using Maximum Likelihood Estimation with Newton-Raphson optimization.

Parameters:

  • df (pl.DataFrame): Input DataFrame containing the data
  • x_cols (str | List[str]): Name(s) of independent variable column(s)
  • y_col (str): Name of the binary outcome column (must contain only 0 and 1)
  • include_intercept (bool, optional): Whether to include intercept term. Default: True
  • cluster (str, optional): Column for cluster identifiers. Default: None
  • bootstrap (bool, optional): Enable score bootstrap for clustered SE. Default: False
  • bootstrap_iterations (int, optional): Number of bootstrap replications. Default: 1000
  • seed (int, optional): Random seed for reproducibility. Default: None
  • bootstrap_method (str, optional): Weight distribution: "rademacher" (default) or "webb". Webb recommended for G < 10 clusters.

Returns:

  • LogisticRegressionResult: Object with the following attributes:
    • coefficients (List[float]): Coefficient estimates (log-odds scale)
    • standard_errors (List[float]): Robust standard errors for each coefficient
    • intercept (float | None): Intercept term
    • intercept_se (float | None): Standard error for intercept
    • n_samples (int): Number of observations used
    • converged (bool): Whether the optimizer converged
    • iterations (int): Number of iterations used
    • log_likelihood (float): Log-likelihood at MLE
    • pseudo_r_squared (float): McFadden's pseudo Rยฒ
    • n_clusters (int | None): Number of clusters (if clustered)
    • cluster_se_type (str | None): Type of SE: "analytical", "bootstrap_rademacher", or "bootstrap_webb"

Raises:

  • ValueError: If y contains values other than 0/1, perfect separation is detected, or convergence fails
  • TypeError: If columns contain non-numeric data

synthetic_did(df, unit_col, time_col, outcome_col, treatment_col, ...)

Performs Synthetic Difference-in-Differences (SDID) estimation on balanced panel data (Arkhangelsky et al., 2021).

Parameters:

  • df (pl.DataFrame): Balanced panel data in long format
  • unit_col (str): Column name for unit identifiers
  • time_col (str): Column name for time period identifiers
  • outcome_col (str): Column name for the outcome variable
  • treatment_col (str): Column name for treatment indicator (0/1)
  • bootstrap_iterations (int, optional): Number of placebo bootstrap iterations. Default: 200
  • seed (int, optional): Random seed for reproducibility. Default: None

Returns:

  • SyntheticDIDResult: Object with the following attributes:
    • att (float): Average Treatment Effect on the Treated
    • standard_error (float): Placebo bootstrap standard error
    • unit_weights (List[float]): Weights for control units (sum to 1.0)
    • time_weights (List[float]): Weights for pre-treatment periods (sum to 1.0)
    • n_units_control (int): Number of control units
    • n_units_treated (int): Number of treated units
    • n_periods_pre (int): Number of pre-treatment periods
    • n_periods_post (int): Number of post-treatment periods
    • solver_converged (bool): Whether Frank-Wolfe solver converged
    • solver_iterations (tuple): Iterations used for (unit, time) weight optimization
    • pre_treatment_fit (float): RMSE of synthetic control in pre-treatment period
    • bootstrap_iterations_used (int): Number of bootstrap iterations used

Raises:

  • ValueError: If panel is unbalanced, < 2 control units, < 2 pre-periods, no treated units, or treatment values not 0/1

Example:

import polars as pl
import causers

# Estimate treatment effect with SDID
result = causers.synthetic_did(
    panel_data,
    unit_col="state",
    time_col="year",
    outcome_col="gdp_growth",
    treatment_col="policy_adopted",
    bootstrap_iterations=500,
    seed=42
)

print(f"Estimated ATT: {result.att:.4f}")
print(f"95% CI: [{result.att - 1.96*result.standard_error:.4f}, "
      f"{result.att + 1.96*result.standard_error:.4f}]")

For full API documentation, see docs/api-reference.md.

๐Ÿ—๏ธ Architecture

graph TD
    A[Python API] --> B[PyO3 Bridge]
    B --> C[Rust Core]
    C --> D[Statistical Engine]
    
    E[Polars DataFrame] --> B
    D --> F[Results]
    F --> A

Project Structure

causers/
โ”œโ”€โ”€ src/                    # Rust source code
โ”‚   โ”œโ”€โ”€ lib.rs             # PyO3 bindings and module definition
โ”‚   โ””โ”€โ”€ stats.rs           # Statistical computation implementations
โ”œโ”€โ”€ python/                # Python package
โ”‚   โ””โ”€โ”€ causers/
โ”‚       โ””โ”€โ”€ __init__.py    # Python API and type definitions
โ”œโ”€โ”€ tests/                 # Comprehensive test suite
โ”‚   โ”œโ”€โ”€ test_basic.py
โ”‚   โ”œโ”€โ”€ test_edge_cases.py
โ”‚   โ”œโ”€โ”€ test_performance.py
โ”‚   โ””โ”€โ”€ test_integration.py
โ”œโ”€โ”€ Cargo.toml             # Rust dependencies
โ”œโ”€โ”€ pyproject.toml         # Python package configuration
โ””โ”€โ”€ README.md              # This file

๐Ÿ› ๏ธ Development

Prerequisites

  • Python 3.8 or higher
  • Rust 1.70 or higher
  • Polars 0.19 or higher

Building from Source

# Clone the repository
git clone https://github.com/causers/causers.git
cd causers

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"

# Build the Rust extension
maturin develop --release

Running Tests

# Run all tests with coverage
pytest tests/ --cov=causers --cov-report=html

# Run specific test categories
pytest tests/test_performance.py -v  # Performance benchmarks
pytest tests/test_edge_cases.py -v   # Edge case handling

# Run Rust tests
cargo test

Code Quality

# Format Python code
black python/ tests/

# Lint Python code
ruff check python/ tests/

# Type check
mypy python/

# Format Rust code
cargo fmt

# Lint Rust code
cargo clippy

๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on:

  • Code style and conventions
  • Testing requirements
  • Pull request process
  • Development workflow

๐Ÿ“‹ Roadmap

v0.1.0

  • โœ… Linear regression with OLS
  • โœ… Multiple covariate support
  • โœ… Optional intercept models
  • โœ… 100% test coverage
  • โœ… Performance validation

v0.2.0

  • โœ… HC3 robust standard errors
  • โœ… statsmodels-validated accuracy (rtol=1e-6)
  • โœ… Extreme leverage detection and error handling
  • โœ… Comprehensive test coverage with edge cases

v0.3.0

  • โœ… Logistic regression with Newton-Raphson MLE
  • โœ… Score bootstrap for logistic regression (Kline & Santos, 2012)
  • โœ… Clustered standard errors (analytical)
  • โœ… Wild cluster bootstrap for small cluster counts
  • โœ… Cluster balance warning (>50% imbalance detection)
  • โœ… Configurable bootstrap iterations and seed
  • โœ… Small-cluster warning (G < 42)
  • โœ… statsmodels/wildboottest-validated accuracy
  • โœ… Webb weights for bootstrap (bootstrap_method="webb")
  • โœ… Method-specific cluster_se_type values ("bootstrap_rademacher", "bootstrap_webb")

v0.4.0

  • โœ… Synthetic Difference-in-Differences (Arkhangelsky et al., 2021)
  • โœ… Frank-Wolfe algorithm for simplex-constrained weight optimization
  • โœ… Placebo bootstrap for standard error estimation
  • โœ… azcausal-validated accuracy (ATT rtol=1e-6, SE rtol=0.5)
  • โœ… Performance: 1000ร—100 panel < 1 second

v0.5.0 (Current)

  • โœ… Synthetic Control (Abadie et al., 2010, 2015)
  • โœ… Four method variants: traditional, penalized, robust, augmented
  • โœ… In-space placebo standard errors
  • โœ… Auto-lambda selection via LOOCV (penalized method)
  • โœ… Augmented SC with ridge bias correction (Ben-Michael et al., 2021)
  • โœ… pysyncon parity testing

๐Ÿ”’ Security

  • Memory Safety: Zero unsafe Rust code (except required PyO3 interfaces)
  • Input Validation: Comprehensive validation of all inputs
  • No Telemetry: No data collection or external network calls
  • Security Rating: B+ (See security assessment)

๐Ÿ“œ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Polars for the excellent DataFrame library
  • PyO3 for seamless Python-Rust integration
  • maturin for simplified packaging

๐Ÿ“š Resources

๐Ÿ› Found a Bug?

Please open an issue with:

  • Minimal reproducible example
  • Expected vs actual behavior
  • Environment details (OS, Python version, etc.)

๐Ÿ“Š Status

  • Build: โœ… Passing
  • Tests: โœ… 193/193 passing
  • Coverage: โœ… 100%
  • Performance: โœ… <100ms for 1M rows
  • Security: โœ… B+ rating
  • Platforms: โœ… Linux, macOS, Windows

Made with โค๏ธ and ๐Ÿฆ€ by the causers team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

causers-0.5.1.tar.gz (170.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

causers-0.5.1-cp38-abi3-manylinux_2_34_x86_64.whl (370.1 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

File details

Details for the file causers-0.5.1.tar.gz.

File metadata

  • Download URL: causers-0.5.1.tar.gz
  • Upload date:
  • Size: 170.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for causers-0.5.1.tar.gz
Algorithm Hash digest
SHA256 2d13d9a4e6ad53eb4c43b10a0c13b3e9ba793b74d482f404b83b3e2800134a33
MD5 40a7a1197e0ac487e5451cdfebc5de67
BLAKE2b-256 2899e97d855112c57c61dd84cf96472654066cdbc9399fb4f5e45a860f732d7c

See more details on using hashes here.

File details

Details for the file causers-0.5.1-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for causers-0.5.1-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 a498fb1186362715c72910036568a9f0bf847d1070e75f005b118a07ad8b71d0
MD5 c83d7b14caf321083855eb34ff730edd
BLAKE2b-256 7556fc19853bfb69fee64b621ab4bb8a7ac7d2613cc8a35eea25be869f3360d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page