Skip to main content

Reproducible regression workflow: loaders → dependency tracking → codegen → execution.

Project description

Regression Monkey

Regression Monkey is a reproducible regression workflow for empirical research. It connects structured data loading, dependency-aware refreshing, templated code generation, batch execution, and a Textual-based TUI for curating final tables. The goal is to replace ad-hoc notebooks with a traceable, automation-friendly stack. (中文介绍请见 README_zh.md。)

Highlights

  • Deterministic data refresh – DataLoader modules produce single artifacts; DataManager tracks ArcticDB/PKL/DataLoader sources, semantic hashes, and dependency propagation.
  • Task-centric modelingStandardRegTask captures Y/X/control/fixed-effect specs, cluster options, incremental controls, and classification filters; tasks serialize cleanly and carry fingerprints for auditing.
  • Code generation and executionCodeGenerator renders Jinja2 templates (currently R) with dependency injection; CodeExecutor orchestrates task trees via rpy2, wiring datasets and capturing normalized results (including stepwise regressions).
  • Table editing TUI – Textual UI lets you search tasks, attach columns (including stepwise variants), reorder/rename columns, and export reproducibility bundles (main.R, datasets).
  • International-ready messaging – All runtime prompts, logs, and TUI notifications are in English for cross-team collaboration.

Components at a Glance

Component Purpose
DataLoader Minimal class for defining clean_data() → DataFrame/PKL/Arctic output with declared dependencies.
DataManager Orchestrates multi-source loading (Arctic ↔ DataLoader ↔ PKL), semantic fingerprinting, cost-aware refresh decisions, and caching.
StandardRegTask Declarative regression spec with serialization, subset filters, incremental controls, and acceptance tests.
CodeGenerator Jinja2 macro toolkit that emits R code (OLS/FE/RE, stepwise, etc.) and dependency stubs.
CodeExecutor rpy2-based runner that feeds datasets, executes generated code, captures python_output, and records stepwise metadata.
Planner Builds task trees (sections/nodes) and coordinates downstream rendering/execution.
tui/* Textual UI for browsing tasks, selecting columns, editing tables, and exporting reproducibility bundles.

Installation

Requires Python 3.14+.

pip install regression_monkey

For development extras (testing, linting, packaging):

pip install "regression_monkey[dev]"

External Requirements

  • R runtime if you plan to execute generated R code.
  • rpy2 is installed automatically on non-Windows platforms (you can install it manually on Windows if R is available).
  • ArcticDB requires system dependencies compatible with LMDB.

Quick Start

1. Define a DataLoader

# data_loader/users.py
from reg_monkey.data_loader import DataLoader
import pandas as pd

class UsersLoader(DataLoader):
    output_pkl_name = "users.pkl"

    def clean_data(self):
        df = pd.read_csv("source_data/users_raw.csv")
        df = df.dropna(subset=["firm_id"]).rename(columns={"signup_time": "ts"})
        self.df = df
        return df

2. Refresh/load datasets

from reg_monkey.data_manager import DataManager

dm = DataManager(target_symbols=["users"], project_root=".")
df_users = dm.get("users")  # hits Arctic/Pickle/DataLoader according to priority

3. Describe a regression task

from reg_monkey.task_obj import StandardRegTask
from reg_monkey.code_generator import CodeGenerator

task = StandardRegTask(
    name="baseline",
    dataset="users",
    y="y",
    X=["treatment"],
    controls=["size","age"],
    category_controls=["industry","year"],
    model="OLS",
    incremental_controls=True,
)

cg = CodeGenerator(task)
segments = cg.assembly(internal_output=True)
print(segments["combined"])  # rendered R script

4. Execute and inspect results

from reg_monkey.code_executor import CodeExecutor

executor = CodeExecutor(plan=None, datasets={"users": df_users})
executor.run_single_task(task, segments["combined"])  # custom helper you implement
print(task.exec_result["forward_res"]["coefficients"].head())

5. Launch the TUI

from reg_monkey.tui import run_app

run_app(code_executor=executor, config_path="output_mapping.json")

Use the TUI (Table List → Table Editor → Result Browser) to add columns, toggle stepwise results, rename labels, and export reproducibility bundles (main.R + datasets + metadata).

Reproducibility Exports

ExportService bundles:

  • main.R with dependency installation, dataset loading, preparation sections, and regression execution (deduplicated by code hash).
  • Feather/CSV datasets referenced in tables.
  • Stepwise columns honoring user selections (enable columns via TUI and choose steps in the modal).

Project Layout

src/
  reg_monkey/
    data_loader.py
    data_manager.py
    task_obj.py
    code_generator.py
    code_executor.py
    planner.py
    export_service.py
    tui/
    r_template.jinja
  prd/        # design docs (Chinese allowed)
  bk/         # backups / historical references

Development Tips

  • Run pytest for unit tests; TUI flows are best verified manually.
  • Use ruff + black for lint/format.
  • When touching the TUI, ensure output_mapping.json remains backward compatible (columns carry controls, parent_task_id, etc.).
  • All user-facing text must remain in English.

Contributing

Pull requests are welcome. Please include:

  1. A clear description of the change.
  2. Tests or manual verification steps for regression-critical paths.
  3. Documentation updates if behavior changes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

regression_monkey-0.2.0.tar.gz (131.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

regression_monkey-0.2.0-py3-none-any.whl (144.0 kB view details)

Uploaded Python 3

File details

Details for the file regression_monkey-0.2.0.tar.gz.

File metadata

  • Download URL: regression_monkey-0.2.0.tar.gz
  • Upload date:
  • Size: 131.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for regression_monkey-0.2.0.tar.gz
Algorithm Hash digest
SHA256 cd00e63e97211faa2771df8d7f1421be151de6ab3a918bd219a449ce9f840226
MD5 caa087f2561c656c9b1e5bb26a1821aa
BLAKE2b-256 c58a30fdc98df830610112514b0c1f2694d0c43c54f8f17f4ace4e513ea6e240

See more details on using hashes here.

Provenance

The following attestation bundles were made for regression_monkey-0.2.0.tar.gz:

Publisher: publish.yml on guanzd88/regression_monkey

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file regression_monkey-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for regression_monkey-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f1b1050ce2199ff474394e9fca8d85a01f454eccfd48c7a9f9a9a82fe0a4c7d6
MD5 6e082b05363bb7df4af0eed7d923cccc
BLAKE2b-256 aaf5f1410c4f164749b7b0c4b10d0c967acfbb007b228126b5d4011a06587575

See more details on using hashes here.

Provenance

The following attestation bundles were made for regression_monkey-0.2.0-py3-none-any.whl:

Publisher: publish.yml on guanzd88/regression_monkey

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page