Skip to main content

Compressed regression estimators for Ibis-compatible database backends

Project description

duckreg: very fast out-of-memory regressions on SQL backends

python package to run stratified/saturated regressions out-of-memory through Ibis. DuckDB remains the default local backend for backwards compatibility, and the same estimators can run against any Ibis SQL backend that supports the generated aggregation queries. R users, check out Grant McDermott's port of this package.

The package provides a simple interface to run regressions on very large datasets that do not fit in memory by reducing the data inside the database to a set of summary statistics and then running weighted least squares with frequency weights. Robust standard errors are computed from sufficient statistics, while clustered standard errors are computed using the cluster bootstrap. Methodological details and benchmarks are provided in this paper. See examples in notebooks/introduction.ipynb.

  • install
pip install duckreg
  • dev install (preferably in a venv) with
uv pip install -e '.[test]'

or install from git with uv pip install git+https://github.com/py-econometrics/duckreg.git.

By default, legacy DuckDB paths still work:

from duckreg import DuckRegression

model = DuckRegression(
    db_name="large_dataset.db",
    table_name="data",
    formula="Y ~ D + X",
    cluster_col="cluster_id",
    seed=42,
)
model.fit()

For a remote database, create an Ibis backend and pass it through connection. For example, with Databricks:

import ibis
from duckreg import DuckRegression

con = ibis.databricks.connect(
    server_hostname="...",
    http_path="...",
    access_token="...",
    catalog="main",
    schema="analytics",
)

model = DuckRegression(
    db_name=None,
    connection=con,
    table_name="large_experiment_table",
    formula="Y ~ D + X",
    cluster_col="cluster_id",
    seed=42,
    n_bootstraps=0,
)
model.fit()

Currently supports the following regression specifications:

  1. DuckRegression: general linear regression, which compresses the data to y averages stratified by all unique values of the x variables
  2. DuckMundlak: One- or Two-Way Mundlak regression, which compresses the data to the following RHS and avoids the need to incorporate unit (and time FEs)

$$ y \sim 1, w, \bar{w}_{i, .}, \bar{w}_{., t} $$

  1. DuckDoubleDemeaning: Double demeaning regression, which compresses the data to y averages by all values of $w$ after demeaning. This also eliminates unit and time FEs

$$ y \sim (W_{it} - \bar{w}_{i, .} - \bar{w}_{., t} + \bar{w}_{., .}) $$

  1. DuckMundlakEventStudy: Two-way mundlak with dynamic treatment effects. This incorporates treatment-cohort FEs ($\psi_i$), time-period FEs ($\gamma_t$) and dynamic treatment effects $\tau_k$ given by cohort X time interactions.

$$ y \sim \psi_i + \gamma_t + \sum_{k=1}^{T} \tau_{k} D_i 1(t = k) $$

All the above regressions are run in compressed fashion through the configured Ibis backend. Formula-level fixed effects are not part of DuckRegression; use the panel-specific DuckMundlak or DuckDoubleDemeaning estimators for fixed-effect style designs.

Please cite the following paper if you use duckreg in your research:

@misc{lal2024largescalelongitudinalexperiments,
      title={Large Scale Longitudinal Experiments: Estimation and Inference}, 
      author={Apoorva Lal and Alexander Fischer and Matthew Wardrop},
      year={2024},
      eprint={2410.09952},
      archivePrefix={arXiv},
      primaryClass={econ.EM},
      url={https://arxiv.org/abs/2410.09952}, 
}

references:

methods:

libraries:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duckreg-0.4.1.tar.gz (35.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duckreg-0.4.1-py3-none-any.whl (28.4 kB view details)

Uploaded Python 3

File details

Details for the file duckreg-0.4.1.tar.gz.

File metadata

  • Download URL: duckreg-0.4.1.tar.gz
  • Upload date:
  • Size: 35.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for duckreg-0.4.1.tar.gz
Algorithm Hash digest
SHA256 00e27fa42a6feca50ef385f7ede26ef9181a707682ece5953d2fb7a07fb8add0
MD5 fa659dfee1f396707f1362bc06afd0dc
BLAKE2b-256 32e3637d794eb2aa13d3c04ce17a36cb631ed58c71b2bdeb0935b3370687908b

See more details on using hashes here.

File details

Details for the file duckreg-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: duckreg-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 28.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for duckreg-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0bd2b4d05142a9355383d734ad124b6efa17843b4476d2cdca25cb632c300aaa
MD5 3b0b21394e10f55805335b08d684e829
BLAKE2b-256 ae1da7d6f0dd2cc73d9279bea42d21f002b4defd43b608580f9e9499f34a4a12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page