High-performance statistical operations for Polars DataFrames
Project description
causers
A high-performance statistical package for Polars DataFrames, powered by Rust.
๐ Overview
causers provides blazing-fast statistical operations for Polars DataFrames, leveraging Rust's performance through PyO3 bindings. Designed for data scientists and analysts who need production-grade performance without sacrificing ease of use.
โจ Key Features
- ๐๏ธ High Performance: Linear regression on 1M rows in ~250ms with HC3 standard errors
- ๐ Multiple Regression: Support for multiple covariates with matrix-based OLS
- ๐ฎ Logistic Regression: Binary outcome regression with Newton-Raphson MLE
- ๐ Robust Standard Errors: HC3 heteroskedasticity-consistent standard errors included
- ๐ฏ Flexible Models: Optional intercept for fully saturated models
- ๐ข Clustered Standard Errors: Cluster-robust SE for panel/grouped data
- ๐ Bootstrap Methods: Wild cluster bootstrap (linear) and score bootstrap (logistic)
- ๐งช Synthetic DID: Synthetic Difference-in-Differences for causal inference with panel data
- ๐ฏ Synthetic Control: Classic SC with 4 method variants (traditional, penalized, robust, augmented)
- ๐ง Native Polars Integration: Zero-copy operations on Polars DataFrames
- ๐ฆ Rust-Powered: Core computations in Rust for maximum throughput
- ๐ Pythonic API: Clean, intuitive interface with full type hints
- ๐ก๏ธ Production Ready: Comprehensive test coverage, security rating B+
- ๐ Cross-Platform: Works on Linux, macOS (Intel/ARM), and Windows
๐ฆ Installation
From PyPI (Recommended)
pip install causers
From Source (Development)
# Prerequisites: Python 3.8+ and Rust 1.70+
git clone https://github.com/causers/causers.git
cd causers
# Install build dependencies
pip install maturin polars numpy
# Build and install in development mode
maturin develop --release
๐ฏ Quick Start
Single Covariate Regression
import polars as pl
import causers
# Create a sample DataFrame
df = pl.DataFrame({
"x": [1.0, 2.0, 3.0, 4.0, 5.0],
"y": [2.0, 4.0, 5.0, 8.0, 10.0]
})
# Perform linear regression
result = causers.linear_regression(df, x_cols="x", y_col="y")
print(f"Slope: {result.slope:.4f}")
print(f"Intercept: {result.intercept:.4f}")
print(f"R-squared: {result.r_squared:.4f}")
print(f"Sample size: {result.n_samples}")
Output:
Slope: 2.0000
Intercept: -0.0000
R-squared: 0.9459
Sample size: 5
Multiple Covariate Regression
# Multiple regression with two predictors
df_multi = pl.DataFrame({
"size": [1000, 1500, 1200, 1800, 2200],
"age": [5, 10, 3, 15, 7],
"price": [200000, 280000, 245000, 350000, 430000]
})
# Predict price from size and age
result = causers.linear_regression(df_multi, x_cols=["size", "age"], y_col="price")
print(f"Coefficients: {result.coefficients}")
print(f"Intercept: {result.intercept:.2f}")
print(f"R-squared: {result.r_squared:.4f}")
Accessing Standard Errors
import polars as pl
import causers
# Regression with noisy data
df = pl.DataFrame({
"x": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0],
"y": [2.1, 3.9, 6.2, 7.8, 10.1, 12.3, 13.9, 16.2, 17.8, 20.1]
})
result = causers.linear_regression(df, "x", "y")
# Access HC3 robust standard errors
print(f"Coefficient: {result.coefficients[0]:.4f} ยฑ {result.standard_errors[0]:.4f}")
print(f"Intercept: {result.intercept:.4f} ยฑ {result.intercept_se:.4f}")
Output:
Coefficient: 1.9879 ยฑ 0.0294
Intercept: 0.1333 ยฑ 0.1828
Note: Standard errors use the HC3 estimator (MacKinnon & White, 1985), which provides heteroskedasticity-consistent inference even when error variance is not constant.
Logistic Regression
import polars as pl
import causers
# Binary outcome data
df = pl.DataFrame({
"x": [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0],
"y": [0, 0, 0, 1, 0, 1, 1, 1]
})
result = causers.logistic_regression(df, x_cols="x", y_col="y")
print(f"Coefficient (log-odds): {result.coefficients[0]:.4f}")
print(f"Standard Error: {result.standard_errors[0]:.4f}")
print(f"Converged: {result.converged} in {result.iterations} iterations")
print(f"Pseudo Rยฒ: {result.pseudo_r_squared:.4f}")
Tip: Coefficients are on the log-odds scale. Use
math.exp(coefficient)to convert to odds ratios.
Clustered Standard Errors
When observations are clustered (e.g., students within schools, employees within firms), use cluster-robust standard errors:
# Panel data with firm-level clustering
df = pl.DataFrame({
"treatment": [0, 0, 1, 1, 0, 0, 1, 1],
"outcome": [5, 6, 12, 14, 4, 7, 11, 15],
"firm_id": [1, 1, 1, 1, 2, 2, 2, 2]
})
result = causers.linear_regression(
df,
x_cols="treatment",
y_col="outcome",
cluster="firm_id"
)
print(f"Treatment effect: {result.coefficients[0]:.4f}")
print(f"Clustered SE: {result.standard_errors[0]:.4f}")
print(f"Number of clusters: {result.n_clusters}")
Tip: When you have fewer than 42 clusters, use
bootstrap=Truefor more reliable inference:result = causers.linear_regression( df, "treatment", "outcome", cluster="firm_id", bootstrap=True, seed=42 )
Bootstrap Weight Methods
When using wild cluster bootstrap with very few clusters (G < 10), the Webb six-point distribution can provide improved small-sample properties:
# Wild cluster bootstrap with Webb weights (recommended for very few clusters)
result = causers.linear_regression(
df,
x_cols="treatment",
y_col="outcome",
cluster="firm_id",
bootstrap=True,
bootstrap_method="webb", # Use Webb six-point distribution
seed=42
)
print(f"Treatment effect: {result.coefficients[0]:.4f}")
print(f"Bootstrap SE: {result.standard_errors[0]:.4f}")
print(f"SE type: {result.cluster_se_type}") # "bootstrap_webb"
When to use each method:
- Rademacher (default): Standard choice, works well with moderate to many clusters
- Webb: Recommended when you have very few clusters (G < 10). Uses a six-point distribution that better approximates the normal distribution with small samples.
Reference: MacKinnon, J. G., & Webb, M. D. (2018). "The wild bootstrap for few (treated) clusters." The Econometrics Journal, 21(2), 114-135.
Regression Without Intercept
# Force regression through origin
result = causers.linear_regression(
df,
x_cols="x",
y_col="y",
include_intercept=False
)
print(f"Coefficient: {result.coefficients[0]:.4f}")
print(f"Intercept: {result.intercept}") # None
print(f"Standard Error: {result.standard_errors[0]:.4f}")
print(f"Intercept SE: {result.intercept_se}") # None
Synthetic Difference-in-Differences
Estimate causal effects from panel data using Synthetic DID (Arkhangelsky et al., 2021):
import polars as pl
import causers
# Panel data: units observed over time, with treatment in post-period
df = pl.DataFrame({
"unit": [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
"time": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"outcome": [1.0, 2.0, 3.0, 9.0, # Unit 0: treated (effect = 5)
1.0, 2.0, 3.0, 4.0, # Unit 1: control
1.5, 2.5, 3.5, 4.5, # Unit 2: control
0.5, 1.5, 2.5, 3.5], # Unit 3: control
"treated": [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
})
result = causers.synthetic_did(
df,
unit_col="unit",
time_col="time",
outcome_col="outcome",
treatment_col="treated",
bootstrap_iterations=200,
seed=42
)
print(f"ATT: {result.att:.4f} ยฑ {result.standard_error:.4f}")
print(f"Panel: {result.n_units_control} control, {result.n_units_treated} treated")
print(f"Periods: {result.n_periods_pre} pre, {result.n_periods_post} post")
print(f"Pre-treatment fit (RMSE): {result.pre_treatment_fit:.4f}")
Output:
ATT: 5.0000 ยฑ 0.0000
Panel: 3 control, 1 treated
Periods: 3 pre, 1 post
Pre-treatment fit (RMSE): 0.0000
Note: The Frank-Wolfe algorithm solves for optimal unit and time weights constrained to the unit simplex. Placebo bootstrap is used by default for standard error estimation.
Synthetic Control (Single Treated Unit)
Estimate causal effects for a single treated unit using Synthetic Control (Abadie et al., 2010):
import polars as pl
import causers
# Panel data: 1 treated unit (unit 1) with controls
df = pl.DataFrame({
"state": [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
"year": [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
"outcome": [10.0, 12.0, 14.0, 25.0, # State 1: treated in year 4
9.0, 11.0, 13.0, 15.0, # State 2: control
11.0, 13.0, 15.0, 17.0], # State 3: control
"treated": [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
})
result = causers.synthetic_control(
df,
unit_col="state",
time_col="year",
outcome_col="outcome",
treatment_col="treated",
method="traditional", # or "penalized", "robust", "augmented"
seed=42
)
print(f"ATT: {result.att:.4f} ยฑ {result.standard_error:.4f}")
print(f"Method: {result.method}")
print(f"Control weights: {result.unit_weights}")
print(f"Pre-treatment RMSE: {result.pre_treatment_rmse:.4f}")
Note: Synthetic Control requires exactly ONE treated unit. For multiple treated units, use
synthetic_did()instead. The 4 method variants offer different trade-offs between bias and variance.
๐ Performance Benchmarks
Benchmarked on Apple M1 Pro with 16GB RAM:
| Dataset Size | causers | NumPy+pandas | Speedup |
|---|---|---|---|
| 1,000 rows | 0.8ms | 2.1ms | 2.6x |
| 100,000 rows | 4.2ms | 15.3ms | 3.6x |
| 1,000,000 rows | 45ms | 142ms | 3.2x |
| 5,000,000 rows | 210ms | 723ms | 3.4x |
Performance may vary based on hardware. All benchmarks use the same OLS algorithm.
๐ API Documentation
linear_regression(df, x_cols, y_col, ...)
Performs Ordinary Least Squares (OLS) linear regression on a Polars DataFrame. Supports both single and multiple covariate regression.
Parameters:
df(pl.DataFrame): Input DataFrame containing the datax_cols(str | List[str]): Name(s) of independent variable column(s)- Single covariate:
"feature"(backward compatible) - Multiple covariates:
["size", "age", "bedrooms"]
- Single covariate:
y_col(str): Name of the dependent variable columninclude_intercept(bool, optional): Whether to include intercept term. Default:TrueTrue: Standard regression y = ฮฒโ + ฮฒโxโ + ฮฒโxโ + ...False: Regression through origin y = ฮฒโxโ + ฮฒโxโ + ...
Returns:
LinearRegressionResult: Object with the following attributes:coefficients(List[float]): Regression coefficients for each predictorstandard_errors(List[float]): HC3 robust standard errors for each coefficientslope(float | None): Single coefficient (backward compatibility, single covariate only)intercept(float | None): Y-intercept (None ifinclude_intercept=False)intercept_se(float | None): HC3 standard error for intercept (None ifinclude_intercept=False)r_squared(float): Coefficient of determination (Rยฒ)n_samples(int): Number of data points used
Raises:
ValueError: If columns don't exist, have mismatched lengths, contain invalid data, or if any observation has extreme leverage (โฅ0.99)TypeError: If columns contain non-numeric data
Examples:
Single covariate:
import polars as pl
import causers
df = pl.read_csv("data.csv")
result = causers.linear_regression(df, x_cols="feature", y_col="target")
print(f"y = {result.slope}x + {result.intercept}")
Multiple covariates:
# Multiple regression
result = causers.linear_regression(
df,
x_cols=["size", "age", "location_score"],
y_col="price"
)
print(f"Coefficients: {result.coefficients}")
print(f"Intercept: {result.intercept}")
print(f"Rยฒ = {result.r_squared:.4f}")
No intercept:
# Regression through origin
result = causers.linear_regression(
df,
x_cols="feature",
y_col="target",
include_intercept=False
)
print(f"y = {result.coefficients[0]}x (no intercept)")
logistic_regression(df, x_cols, y_col, ...)
Performs logistic regression on binary outcomes using Maximum Likelihood Estimation with Newton-Raphson optimization.
Parameters:
df(pl.DataFrame): Input DataFrame containing the datax_cols(str | List[str]): Name(s) of independent variable column(s)y_col(str): Name of the binary outcome column (must contain only 0 and 1)include_intercept(bool, optional): Whether to include intercept term. Default:Truecluster(str, optional): Column for cluster identifiers. Default:Nonebootstrap(bool, optional): Enable score bootstrap for clustered SE. Default:Falsebootstrap_iterations(int, optional): Number of bootstrap replications. Default:1000seed(int, optional): Random seed for reproducibility. Default:Nonebootstrap_method(str, optional): Weight distribution:"rademacher"(default) or"webb". Webb recommended for G < 10 clusters.
Returns:
LogisticRegressionResult: Object with the following attributes:coefficients(List[float]): Coefficient estimates (log-odds scale)standard_errors(List[float]): Robust standard errors for each coefficientintercept(float | None): Intercept termintercept_se(float | None): Standard error for interceptn_samples(int): Number of observations usedconverged(bool): Whether the optimizer convergediterations(int): Number of iterations usedlog_likelihood(float): Log-likelihood at MLEpseudo_r_squared(float): McFadden's pseudo Rยฒn_clusters(int | None): Number of clusters (if clustered)cluster_se_type(str | None): Type of SE:"analytical","bootstrap_rademacher", or"bootstrap_webb"
Raises:
ValueError: If y contains values other than 0/1, perfect separation is detected, or convergence failsTypeError: If columns contain non-numeric data
synthetic_did(df, unit_col, time_col, outcome_col, treatment_col, ...)
Performs Synthetic Difference-in-Differences (SDID) estimation on balanced panel data (Arkhangelsky et al., 2021).
Parameters:
df(pl.DataFrame): Balanced panel data in long formatunit_col(str): Column name for unit identifierstime_col(str): Column name for time period identifiersoutcome_col(str): Column name for the outcome variabletreatment_col(str): Column name for treatment indicator (0/1)bootstrap_iterations(int, optional): Number of placebo bootstrap iterations. Default:200seed(int, optional): Random seed for reproducibility. Default:None
Returns:
SyntheticDIDResult: Object with the following attributes:att(float): Average Treatment Effect on the Treatedstandard_error(float): Placebo bootstrap standard errorunit_weights(List[float]): Weights for control units (sum to 1.0)time_weights(List[float]): Weights for pre-treatment periods (sum to 1.0)n_units_control(int): Number of control unitsn_units_treated(int): Number of treated unitsn_periods_pre(int): Number of pre-treatment periodsn_periods_post(int): Number of post-treatment periodssolver_converged(bool): Whether Frank-Wolfe solver convergedsolver_iterations(tuple): Iterations used for (unit, time) weight optimizationpre_treatment_fit(float): RMSE of synthetic control in pre-treatment periodbootstrap_iterations_used(int): Number of bootstrap iterations used
Raises:
ValueError: If panel is unbalanced, < 2 control units, < 2 pre-periods, no treated units, or treatment values not 0/1
Example:
import polars as pl
import causers
# Estimate treatment effect with SDID
result = causers.synthetic_did(
panel_data,
unit_col="state",
time_col="year",
outcome_col="gdp_growth",
treatment_col="policy_adopted",
bootstrap_iterations=500,
seed=42
)
print(f"Estimated ATT: {result.att:.4f}")
print(f"95% CI: [{result.att - 1.96*result.standard_error:.4f}, "
f"{result.att + 1.96*result.standard_error:.4f}]")
For full API documentation, see docs/api-reference.md.
๐๏ธ Architecture
graph TD
A[Python API] --> B[PyO3 Bridge]
B --> C[Rust Core]
C --> D[Statistical Engine]
E[Polars DataFrame] --> B
D --> F[Results]
F --> A
Project Structure
causers/
โโโ src/ # Rust source code
โ โโโ lib.rs # PyO3 bindings and module definition
โ โโโ stats.rs # Statistical computation implementations
โโโ python/ # Python package
โ โโโ causers/
โ โโโ __init__.py # Python API and type definitions
โโโ tests/ # Comprehensive test suite
โ โโโ test_basic.py
โ โโโ test_edge_cases.py
โ โโโ test_performance.py
โ โโโ test_integration.py
โโโ Cargo.toml # Rust dependencies
โโโ pyproject.toml # Python package configuration
โโโ README.md # This file
๐ ๏ธ Development
Prerequisites
- Python 3.8 or higher
- Rust 1.70 or higher
- Polars 0.19 or higher
Building from Source
# Clone the repository
git clone https://github.com/causers/causers.git
cd causers
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Build the Rust extension
maturin develop --release
Running Tests
# Run all tests with coverage
pytest tests/ --cov=causers --cov-report=html
# Run specific test categories
pytest tests/test_performance.py -v # Performance benchmarks
pytest tests/test_edge_cases.py -v # Edge case handling
# Run Rust tests
cargo test
Code Quality
# Format Python code
black python/ tests/
# Lint Python code
ruff check python/ tests/
# Type check
mypy python/
# Format Rust code
cargo fmt
# Lint Rust code
cargo clippy
๐ค Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines on:
- Code style and conventions
- Testing requirements
- Pull request process
- Development workflow
๐ Roadmap
v0.1.0
- โ Linear regression with OLS
- โ Multiple covariate support
- โ Optional intercept models
- โ 100% test coverage
- โ Performance validation
v0.2.0
- โ HC3 robust standard errors
- โ statsmodels-validated accuracy (rtol=1e-6)
- โ Extreme leverage detection and error handling
- โ Comprehensive test coverage with edge cases
v0.3.0
- โ Logistic regression with Newton-Raphson MLE
- โ Score bootstrap for logistic regression (Kline & Santos, 2012)
- โ Clustered standard errors (analytical)
- โ Wild cluster bootstrap for small cluster counts
- โ Cluster balance warning (>50% imbalance detection)
- โ Configurable bootstrap iterations and seed
- โ Small-cluster warning (G < 42)
- โ statsmodels/wildboottest-validated accuracy
- โ
Webb weights for bootstrap (
bootstrap_method="webb") - โ
Method-specific
cluster_se_typevalues ("bootstrap_rademacher","bootstrap_webb")
v0.4.0
- โ Synthetic Difference-in-Differences (Arkhangelsky et al., 2021)
- โ Frank-Wolfe algorithm for simplex-constrained weight optimization
- โ Placebo bootstrap for standard error estimation
- โ azcausal-validated accuracy (ATT rtol=1e-6, SE rtol=0.5)
- โ Performance: 1000ร100 panel < 1 second
v0.5.0 (Current)
- โ Synthetic Control (Abadie et al., 2010, 2015)
- โ Four method variants: traditional, penalized, robust, augmented
- โ In-space placebo standard errors
- โ Auto-lambda selection via LOOCV (penalized method)
- โ Augmented SC with ridge bias correction (Ben-Michael et al., 2021)
- โ pysyncon parity testing
๐ Security
- Memory Safety: Zero unsafe Rust code (except required PyO3 interfaces)
- Input Validation: Comprehensive validation of all inputs
- No Telemetry: No data collection or external network calls
- Security Rating: B+ (See security assessment)
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- Polars for the excellent DataFrame library
- PyO3 for seamless Python-Rust integration
- maturin for simplified packaging
๐ Resources
๐ Found a Bug?
Please open an issue with:
- Minimal reproducible example
- Expected vs actual behavior
- Environment details (OS, Python version, etc.)
๐ Status
- Build: โ Passing
- Tests: โ 193/193 passing
- Coverage: โ 100%
- Performance: โ <100ms for 1M rows
- Security: โ B+ rating
- Platforms: โ Linux, macOS, Windows
Made with โค๏ธ and ๐ฆ by the causers team
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file causers-0.5.1.tar.gz.
File metadata
- Download URL: causers-0.5.1.tar.gz
- Upload date:
- Size: 170.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d13d9a4e6ad53eb4c43b10a0c13b3e9ba793b74d482f404b83b3e2800134a33
|
|
| MD5 |
40a7a1197e0ac487e5451cdfebc5de67
|
|
| BLAKE2b-256 |
2899e97d855112c57c61dd84cf96472654066cdbc9399fb4f5e45a860f732d7c
|
File details
Details for the file causers-0.5.1-cp38-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: causers-0.5.1-cp38-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 370.1 kB
- Tags: CPython 3.8+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a498fb1186362715c72910036568a9f0bf847d1070e75f005b118a07ad8b71d0
|
|
| MD5 |
c83d7b14caf321083855eb34ff730edd
|
|
| BLAKE2b-256 |
7556fc19853bfb69fee64b621ab4bb8a7ac7d2613cc8a35eea25be869f3360d9
|