Skip to main content

Reproducible regression workflow: loaders → dependency tracking → codegen → execution.

Project description

Regression Monkey

Regression Monkey is a reproducible regression workflow for empirical research. It connects structured data loading, dependency-aware refreshing, templated code generation, batch execution, and a Textual-based TUI for curating final tables. The goal is to replace ad-hoc notebooks with a traceable, automation-friendly stack. (中文介绍请见 README_zh.md。)

Highlights

  • Deterministic data refresh – DataLoader modules produce single artifacts; DataManager tracks ArcticDB/PKL/DataLoader sources, semantic hashes, and dependency propagation.
  • Task-centric modelingStandardRegTask captures Y/X/control/fixed-effect specs, cluster options, incremental controls, and classification filters; tasks serialize cleanly and carry fingerprints for auditing.
  • Code generation and executionCodeGenerator renders Jinja2 templates (currently R) with dependency injection; CodeExecutor orchestrates task trees via rpy2, wiring datasets and capturing normalized results (including stepwise regressions).
  • Table editing TUI – Textual UI lets you search tasks, attach columns (including stepwise variants), reorder/rename columns, and export reproducibility bundles (main.R, datasets).
  • International-ready messaging – All runtime prompts, logs, and TUI notifications are in English for cross-team collaboration.

Components at a Glance

Component Purpose
DataLoader Minimal class for defining clean_data() → DataFrame/PKL/Arctic output with declared dependencies.
DataManager Orchestrates multi-source loading (Arctic ↔ DataLoader ↔ PKL), semantic fingerprinting, cost-aware refresh decisions, and caching.
StandardRegTask Declarative regression spec with serialization, subset filters, incremental controls, and acceptance tests.
CodeGenerator Jinja2 macro toolkit that emits R code (OLS/FE/RE, stepwise, etc.) and dependency stubs.
CodeExecutor rpy2-based runner that feeds datasets, executes generated code, captures python_output, and records stepwise metadata.
Planner Builds task trees (sections/nodes) and coordinates downstream rendering/execution.
tui/* Textual UI for browsing tasks, selecting columns, editing tables, and exporting reproducibility bundles.

Installation

Requires Python 3.14+.

pip install regression_monkey

For development extras (testing, linting, packaging):

pip install "regression_monkey[dev]"

External Requirements

  • R runtime if you plan to execute generated R code.
  • rpy2 is installed automatically on non-Windows platforms (you can install it manually on Windows if R is available).
  • ArcticDB requires system dependencies compatible with LMDB.

Quick Start

1. Define a DataLoader

# data_loader/users.py
from reg_monkey.data_loader import DataLoader
import pandas as pd

class UsersLoader(DataLoader):
    output_pkl_name = "users.pkl"

    def clean_data(self):
        df = pd.read_csv("source_data/users_raw.csv")
        df = df.dropna(subset=["firm_id"]).rename(columns={"signup_time": "ts"})
        self.df = df
        return df

2. Refresh/load datasets

from reg_monkey.data_manager import DataManager

dm = DataManager(target_symbols=["users"], project_root=".")
df_users = dm.get("users")  # hits Arctic/Pickle/DataLoader according to priority

3. Describe a regression task

from reg_monkey.task_obj import StandardRegTask
from reg_monkey.code_generator import CodeGenerator

task = StandardRegTask(
    name="baseline",
    dataset="users",
    y="y",
    X=["treatment"],
    controls=["size","age"],
    category_controls=["industry","year"],
    model="OLS",
    incremental_controls=True,
)

cg = CodeGenerator(task)
segments = cg.assembly(internal_output=True)
print(segments["combined"])  # rendered R script

4. Execute and inspect results

from reg_monkey.code_executor import CodeExecutor

executor = CodeExecutor(plan=None, datasets={"users": df_users})
executor.run_single_task(task, segments["combined"])  # custom helper you implement
print(task.exec_result["forward_res"]["coefficients"].head())

5. Launch the TUI

from reg_monkey.tui import run_app

run_app(code_executor=executor, config_path="output_mapping.json")

Use the TUI (Table List → Table Editor → Result Browser) to add columns, toggle stepwise results, rename labels, and export reproducibility bundles (main.R + datasets + metadata).

Reproducibility Exports

ExportService bundles:

  • main.R with dependency installation, dataset loading, preparation sections, and regression execution (deduplicated by code hash).
  • Feather/CSV datasets referenced in tables.
  • Stepwise columns honoring user selections (enable columns via TUI and choose steps in the modal).

Project Layout

src/
  reg_monkey/
    data_loader.py
    data_manager.py
    task_obj.py
    code_generator.py
    code_executor.py
    planner.py
    export_service.py
    tui/
    r_template.jinja
  prd/        # design docs (Chinese allowed)
  bk/         # backups / historical references

Development Tips

  • Run pytest for unit tests; TUI flows are best verified manually.
  • Use ruff + black for lint/format.
  • When touching the TUI, ensure output_mapping.json remains backward compatible (columns carry controls, parent_task_id, etc.).
  • All user-facing text must remain in English.

Contributing

Pull requests are welcome. Please include:

  1. A clear description of the change.
  2. Tests or manual verification steps for regression-critical paths.
  3. Documentation updates if behavior changes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

regression_monkey-0.2.1.tar.gz (134.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

regression_monkey-0.2.1-py3-none-any.whl (147.6 kB view details)

Uploaded Python 3

File details

Details for the file regression_monkey-0.2.1.tar.gz.

File metadata

  • Download URL: regression_monkey-0.2.1.tar.gz
  • Upload date:
  • Size: 134.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for regression_monkey-0.2.1.tar.gz
Algorithm Hash digest
SHA256 a96e838272e8fcf7ee610b09b95d36c84a5de3d3522db16610eb0603a1604666
MD5 0ed709cc0cfa49fec18f0f096032579e
BLAKE2b-256 12d496c3c0879e4f3bfd2eeeae4e8aaab460d6e2aed0c494c678f3e309174ed4

See more details on using hashes here.

Provenance

The following attestation bundles were made for regression_monkey-0.2.1.tar.gz:

Publisher: publish.yml on guanzd88/regression_monkey

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file regression_monkey-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for regression_monkey-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7380c185dbe12407f6ac348d8c35813ac8e4b06833894161f3ed6cad329adc5f
MD5 a9765c05cdcc5d853c3dea9e9e08b761
BLAKE2b-256 1392a55a046bee18866ea52cc289d20f5e72375fbd1b3a163e62748e10801403

See more details on using hashes here.

Provenance

The following attestation bundles were made for regression_monkey-0.2.1-py3-none-any.whl:

Publisher: publish.yml on guanzd88/regression_monkey

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page