DataForge: CLI-first data-quality detection and reversible repair for tabular data.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pranesh15

These details have not been verified by PyPI

Project links

Documentation

Project description

DataForge

DataForge is a CLI-first data-quality repair toolkit for tabular data. It detects common CSV issues, proposes deterministic repairs, checks proposed changes through safety and verification gates, and records applied changes in a reversible transaction log.

The final public product name is DataForge. The PyPI/TestPyPI distribution family is dataforge_07* because the unqualified dataforge project name is occupied by unrelated packages. Installing dataforge_07 still provides the dataforge import namespace and dataforge CLI. dataforge15 is only a temporary staging alias retained for local compatibility.

The current repository is an alpha implementation. It also contains the OpenEnv-compatible training environment, the SFT warmup workflow, a local MCP server package, and playground/demo sources. Warehouse integrations and production model-quality claims remain future work.

Before any public release, review THREAT_MODEL.md and docs/docs/release.md. They define the security, supply-chain, and evidence gates that separate the current alpha from the full original DataForge vision.

Current Status

Shipped in the current worktree:

dataforge profile, dataforge repair, dataforge revert, dataforge watch, dataforge audit, and dataforge bench
Three detector families: type_mismatch, decimal_shift, fd_violation
Reviewable schema inference in profile --json, including inferred column types, domains, regex candidates, uniqueness, and FD candidates
Pending constraint review artifacts via profile --constraints-out, which can feed repair only after individual candidates are marked accepted
Matching deterministic repairers wired through SafetyFilter -> SMTVerifier
Backend-neutral PatchPlan and TableStore contracts for CSV, DuckDB, and dry-run-only cloud warehouse boundaries
Reversible hash-chained transaction journals with immutable source snapshots
Public backend repair engine at dataforge.engine.repair
Real-world benchmark harness for Hospital, Flights, and Beers
OpenEnv-compatible HTTP environment with eight typed actions, including read-only ROOT_CAUSE
Causal root-cause analyzer for cascading data-quality errors
Standalone dataforge-mcp package exposing DataForge tools over MCP
Week 9 SFT oracle trajectory workflow, readiness gate, Kaggle notebook, and release verifier
Separate Gradio model-demo Space source for the published 0.5B SFT smoke checkpoint

Not shipped yet:

published dataforge_07, dataforge_07_mcp, dataforge_07_evals, dataforge_07_dbt, and dataforge_07_agent_patterns packages
committed production verification for the Cloudflare Workers playground
warehouse-native or external adapter packages
credentialed Snowflake, BigQuery, or Databricks apply/revert conformance
design-partner, pilot-user, or customer validation evidence is not yet claimed
A production-quality trained model family
Autonomous repair in the playground or model demo

Quickstart

python -m pip install -e ".[dev]"
dataforge profile fixtures/hospital_10rows.csv --schema fixtures/hospital_schema.yaml
dataforge profile fixtures/hospital_10rows.csv --constraints-out constraints.json
dataforge constraints review constraints.json
dataforge repair fixtures/hospital_10rows.csv --schema fixtures/hospital_schema.yaml --dry-run
dataforge repair fixtures/hospital_10rows.csv --constraints constraints.json --dry-run
dataforge watch fixtures/hospital_10rows.csv --schema fixtures/hospital_schema.yaml --once --json
dataforge bench --methods random,heuristic --datasets hospital,flights,beers --seeds 3 --seed-list 0,1,2

dataforge15 remains a temporary staging compatibility alias, but public docs and release evidence must use dataforge_07 for PyPI distribution identity and dataforge for the installed CLI/import identity.

To apply repairs, use --apply. Applied repairs write a transaction journal and source snapshot before mutating the CSV, so they can be reverted:

dataforge repair path/to/file.csv --schema path/to/schema.yaml --apply
dataforge audit <txn-id>
dataforge revert <txn-id>
dataforge revert <txn-id> --search-root path/to --json

Warehouse targets use warehouse:// URIs and always emit a patch_plan_v1 contract before any mutation. DuckDB is the local conformance backend; cloud warehouse adapters are dry-run-only boundaries until credentialed apply, audit, and rollback suites are enabled:

dataforge repair "warehouse://duckdb?database=dev.duckdb&relation=main.model&row_id=id" --dry-run --json
dataforge repair "warehouse://snowflake?relation=PUBLIC.MODEL&row_id=ID" --dry-run --json

DuckDB --apply requires a stable row identity, records the patch plan in the transaction journal, and can be reverted through the same audit and revert commands. Snowflake, BigQuery, and Databricks apply are intentionally refused until their conformance gates prove reversible transactions.

New transaction logs are local tamper-evident hash chains. dataforge audit verifies the chain head, event order, replayability, and revert prerequisites; legacy v1 logs remain replayable but are reported as unverified because they do not contain event hashes.

Week 9 SFT Warmup

The current SFT workflow builds split-safe expert_v1 trajectory records from dirty/clean CSV diffs. Exact repairs in the primary dataset are labeled oracle_from_clean_diff, not inferred from Groq, Cerebras, or Gemini teacher guesses. Clean train chunks are retained as finish examples so the model learns when no repair is justified.

$env:HF_TOKEN="..."
.\.venv\Scripts\python.exe scripts\data\build_oracle_sft_trajectories.py
.\.venv\Scripts\python.exe scripts\data\validate_sft_readiness.py

This writes local ignored JSONL at data/sft_traj/expert_v1.jsonl and an auditable row split at data/sft_traj/split_manifest.json. Push the dataset bundle only after the readiness gate passes:

$env:HF_TOKEN="..."
.\.venv\Scripts\python.exe scripts\data\build_oracle_sft_trajectories.py --push-to-hub --hf-dataset-repo Praneshrajan15/dataforge-sft-trajectories

The current public smoke checkpoint is Praneshrajan15/DataForge-0.5B-SFT, with trajectories at Praneshrajan15/dataforge-sft-trajectories. It proves the dataset, Kaggle training, merge, evaluation, and Hub upload path; it is not a production model-quality claim. Verify release artifacts before citing them:

.\.venv\Scripts\python.exe scripts\model\verify_sft_release.py --output eval\results\sft_release_v0_smoke.json
.\.venv\Scripts\python.exe scripts\model\verify_sft_release.py --min-dataset-records 272 --require-sha-metrics --output eval\results\sft_release_contract_v2_20260515.json

Week 12 GRPO Path

The repository now contains a gated GRPO post-training path for free-tier experiments:

training/configs/grpo_05b.yaml targets DataForge-0.5B-SFT -> DataForge-0.5B-GRPO.
training/configs/grpo_15b.yaml requires a verified DataForge-1.5B-SFT prerequisite before attempting DataForge-1.5B-GRPO.
training/rewards/dataforge_reward.py scores completions locally through the repair_contract_v1 exact-repair contract.
training/kaggle/grpo_kaggle.ipynb blocks Hub upload unless GRPO beats SFT by at least 3 absolute F1 points on DataForge-Bench-light-verified.

No GRPO checkpoint is described as a quality milestone in this README until scripts/model/verify_grpo_release.py produces committed verification evidence. Refresh benchmark tables only from generated JSON:

After GRPO eval evidence exists:

.\.venv\Scripts\python.exe scripts\bench\refresh_benchmark_table.py --skip-agent-run --trained-model-json eval\results\grpo_model_comparison.json

MCP Server

The nested dataforge-mcp/ source directory builds the standalone dataforge_07_mcp distribution. It is not published yet, so install it from source while release ownership is pending:

cd dataforge-mcp
python -m pip install -e ".[dev]"
dataforge-mcp serve

Tools: dataforge_profile, dataforge_detect_errors, dataforge_verify_fix, dataforge_apply_repairs, and dataforge_revert. The default transport is stdio. MCP reads and writes are sandboxed to configured allowed roots; dry-run works by default, while apply requires --enable-apply. Streamable HTTP is available for local experiments.

The monorepo packages/ directory contains the side-package release sources for dataforge_07_evals, dataforge_07_dbt, and dataforge_07_agent_patterns.

Playground And Model Demo

playground/api/ is the API backend for the CSV playground. Public Space deployments use dataforge-playground.
playground/web/ is the static browser UI deployed through Cloudflare Workers Static Assets. Its primary workflow is POST /api/analyze: upload a CSV, review categorical risk and pending inferred constraints, inspect verified dry-run repairs and non-repairs, then export a receipt with the local CLI apply/audit/revert command shape.
The current verified public playground URL is https://dataforge.praneshrajan15.workers.dev/playground, backed by https://Praneshrajan15-dataforge-playground.hf.space.
That Workers URL is the production playground surface for the full original vision; this is the release URL.
playground-model/ is a separate Gradio Space demo for the published DataForge-0.5B-SFT smoke checkpoint. It accepts small CSV snippets and is intentionally limited to demo use.

The playground does not persist uploaded files, does not use browser storage, does not mutate data in the hosted flow, and does not call an LLM unless a backend provider key is explicitly configured.

Benchmark Results

Generated from eval/results/agent_comparison.json (schema dataforge_benchmark_run_v2, seeds 0, 1, 2, git dbd1bed0a03c, dirty true).

Method	Precision	Recall	F1	Avg Steps	Quota Units	GPU Hours
heuristic	0.3167	0.3025	0.2772	374.33	0.0000	0.0000
random	0.0038	0.0003	0.0005	150.33	0.0000	0.0000

See BENCHMARK_REPORT.md for per-dataset tables, error bars, and citation-only SOTA rows.

Dataset bytes are pinned to BigDaMa/raha revision 7be1334b8c7bbdac3f47ef514fb3e1e8c5fc181c for hospital, flights, beers; dirty/clean SHA-256s are recorded in the JSON metadata.

Local Setup

make setup
make lint
make type
make test
make backend-gate
make release-gate

Verification works on Linux, macOS, and Windows with Git Bash available for GNU Make recipes. Python support is >=3.11,<3.13.

profile --constraints-out writes a strict constraint_review_v1 JSON artifact. Every inferred candidate starts as pending; repair ignores pending and rejected candidates. In v1, only accepted column_type, domain_bound, and functional_dependency candidates affect repair. Accepted regex and uniqueness candidates remain review evidence until verifier support is added. Use dataforge constraints review constraints.json for the Textual review UI, or use deterministic CI flags such as --accept cnd-... --no-tui --json.

make backend-gate is the release-quality backend check: lint, format, strict mypy, root tests, MCP tests, README truth, benchmark truth, OpenAPI snapshot drift, secret scan, dependency audit availability, SBOM generation availability, and package build availability for both dataforge_07 and dataforge_07_mcp. The gate covers the core dataforge_07 distribution and release surfaces; the historical data_quality_env namespace remains source-tree regression coverage, not part of the dataforge wheel or source distribution.

Before release, run scripts/ci/backend_gate.py --require-optional so dependency audit, SBOM generation, and package builds are hard failures rather than availability checks.

Release doctor scopes:

dataforge release doctor --core --json
dataforge release doctor --maintainer-deploy --json
dataforge release gate --json
dataforge release full-vision --json

--core is the default OSS release check. --maintainer-deploy additionally checks maintainer-specific Hugging Face, Kaggle OAuth plus clean-config Kaggle CLI execution, and Cloudflare state. release gate is the authoritative fresh-user proof: it builds the distribution, audits wheel contents, creates a dependency wheelhouse, installs with pip --no-index --find-links, then runs profile, repair dry-run, apply, constraint review, audit, revert, and post-revert audit from outside the source checkout.

Configure pending trusted publishers for dataforge_07 on TestPyPI and PyPI before tagging. The real PyPI workflow refuses pre-release metadata and should only run after trusted publishing, attestations, and fresh-install evidence are verified. dataforge release full-vision --json is expected to fail until PyPI publication evidence, dbt-duckdb proof, not yet met design-partner evidence, and model-family evidence are real.

Windows setup:

winget install -e --id Python.Python.3.12
winget install -e --id ezwinports.make
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -e ".[all]"
make lint && make type && make test

Environment Variables

Provider keys belong in a root .env file, which is gitignored and loaded with python-dotenv where needed.

GROQ_API_KEY
GEMINI_API_KEY
CEREBRAS_API_KEY
OPENROUTER_API_KEY
HF_TOKEN

When DataForge Is The Wrong Tool

Do not use DataForge for streaming data, very large warehouse tables, regulated workflows where every fix must be human-authored, strict low-latency SLAs, or teams already well served by maintained Great Expectations/dbt suites. DataForge is currently best suited to local CSV profiling, repair experiments, benchmark runs, and training/evaluation research.

Repository Docs

.cursor/rules/dataforge.md - always-applied contribution rules
ARCHITECTURE.md - current system architecture and dependencies
DECISIONS.md - technical decision log
CONTRIBUTING.md - workflow and code standards
CLAUDE.md - living gotcha log for agent sessions
CURSOR_MASTER.md - context and prompt pack
META_CONTEXT.md - project meta-context
FILE_STRUCTURE.md - current and planned directory map
SECURITY.md - vulnerability reporting policy
specs/SPEC_TEMPLATE.md - template for new module specs

License

Apache-2.0. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

pranesh15

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

0.1.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataforge_07-0.1.0.tar.gz (166.9 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataforge_07-0.1.0-py3-none-any.whl (214.8 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file dataforge_07-0.1.0.tar.gz.

File metadata

Download URL: dataforge_07-0.1.0.tar.gz
Upload date: Jun 12, 2026
Size: 166.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dataforge_07-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9948b8192300edd2a8058ee09b7b40d716d5748982e0524b83f21a08204e553f`
MD5	`bc149ae004932f0d734cf1636422b253`
BLAKE2b-256	`dd2e08cc488b6aa7cdd35cc1e0ea7aad94ffc7298072ff40b5521fa0fbc261fc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataforge_07-0.1.0.tar.gz:

Publisher: publish-dataforge.yml on Aegis15/dataforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataforge_07-0.1.0.tar.gz
- Subject digest: 9948b8192300edd2a8058ee09b7b40d716d5748982e0524b83f21a08204e553f
- Sigstore transparency entry: 1804372477
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: Aegis15/dataforge@d498b656734241e343673fafe1b11676b475bf60
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Aegis15
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-dataforge.yml@d498b656734241e343673fafe1b11676b475bf60
- Trigger Event: workflow_dispatch

File details

Details for the file dataforge_07-0.1.0-py3-none-any.whl.

File metadata

Download URL: dataforge_07-0.1.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 214.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dataforge_07-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f30dc5d8da690dad41e596b14e7ec5b49567eed6288a6d8bed6ce31b9d19bbcf`
MD5	`019f69f455ee6c98d99702974ddb8091`
BLAKE2b-256	`47f990e52c2a336688cda8555ef3b50d543bf202c7a2b8f9f008d832983290cf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for dataforge_07-0.1.0-py3-none-any.whl:

Publisher: publish-dataforge.yml on Aegis15/dataforge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: dataforge_07-0.1.0-py3-none-any.whl
- Subject digest: f30dc5d8da690dad41e596b14e7ec5b49567eed6288a6d8bed6ce31b9d19bbcf
- Sigstore transparency entry: 1804372905
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: Aegis15/dataforge@d498b656734241e343673fafe1b11676b475bf60
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Aegis15
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-dataforge.yml@d498b656734241e343673fafe1b11676b475bf60
- Trigger Event: workflow_dispatch

dataforge-07 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataForge

Current Status

Quickstart

Week 9 SFT Warmup

Week 12 GRPO Path

MCP Server

Playground And Model Demo

Benchmark Results

Local Setup

Environment Variables

When DataForge Is The Wrong Tool

Repository Docs

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance