Skip to main content

A package for Regression in compressed representation powered by DuckDB

Project description

duckreg : very fast out-of-memory regressions with duckdb

python package to run stratified/saturated regressions out-of-memory with duckdb. R users, check out Grant McDermott's port of this package.

The package is a wrapper around the duckdb package and provides a simple interface to run regressions on very large datasets that do not fit in memory by reducing the data to a set of summary statistics and runs weighted least squares with frequency weights. Robust standard errors are computed from sufficient statistics, while clustered standard errors are computed using the cluster bootstrap. Methodological details and benchmarks are provided in this paper. See examples in notebooks/introduction.ipynb.

  • install
pip install duckreg
  • dev install (preferably in a venv) with
(uv) pip install git+https://github.com/apoorvalal/duckreg.git

or git clone this repository and install in editable mode.


Currently supports the following regression specifications:

  1. DuckRegression: general linear regression, which compresses the data to y averages stratified by all unique values of the x variables
  2. DuckMundlak: One- or Two-Way Mundlak regression, which compresses the data to the following RHS and avoids the need to incorporate unit (and time FEs)

$$ y \sim 1, w, \bar{w}_{i, .}, \bar{w}_{., t} $$

  1. DuckDoubleDemeaning: Double demeaning regression, which compresses the data to y averages by all values of $w$ after demeaning. This also eliminates unit and time FEs

$$ y \sim (W_{it} - \bar{w}_{i, .} - \bar{w}_{., t} + \bar{w}_{., .}) $$

  1. DuckMundlakEventStudy: Two-way mundlak with dynamic treatment effects. This incorporates treatment-cohort FEs ($\psi_i$), time-period FEs ($\gamma_t$) and dynamic treatment effects $\tau_k$ given by cohort X time interactions.

$$ y \sim \psi_i + \gamma_t + \sum_{k=1}^{T} \tau_{k} D_i 1(t = k) $$

All the above regressions are run in compressed fashion with duckdb. Formula-level fixed effects are not part of DuckRegression; use the panel-specific DuckMundlak or DuckDoubleDemeaning estimators for fixed-effect style designs.

Please cite the following paper if you use duckreg in your research:

@misc{lal2024largescalelongitudinalexperiments,
      title={Large Scale Longitudinal Experiments: Estimation and Inference}, 
      author={Apoorva Lal and Alexander Fischer and Matthew Wardrop},
      year={2024},
      eprint={2410.09952},
      archivePrefix={arXiv},
      primaryClass={econ.EM},
      url={https://arxiv.org/abs/2410.09952}, 
}

references:

methods:

libraries:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

duckreg-0.3.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

duckreg-0.3-py3-none-any.whl (19.4 kB view details)

Uploaded Python 3

File details

Details for the file duckreg-0.3.tar.gz.

File metadata

  • Download URL: duckreg-0.3.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for duckreg-0.3.tar.gz
Algorithm Hash digest
SHA256 88d3f978cabcba3dd7e956dcd1e7c5950d8c2aeda26099711102c91741156b42
MD5 4e3c5b7388c620d01cc397c84ac95069
BLAKE2b-256 318758555ec5e7c5946228257714590ba337539700352584830f7ae3ef725d77

See more details on using hashes here.

File details

Details for the file duckreg-0.3-py3-none-any.whl.

File metadata

  • Download URL: duckreg-0.3-py3-none-any.whl
  • Upload date:
  • Size: 19.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for duckreg-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2726a0420afc423940771b735f8c8533b954a3bf1caca7dfddadbf4ad628f6e6
MD5 4dc85ac5a7954a082e7e02c5782b1e70
BLAKE2b-256 2ee46ab2c88911cfc89e7ae6280148a5dc220b6d84e7feb7b58f6894beec7485

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page