DataForge: CLI-first data-quality detection and reversible repair for tabular data.
Project description
DataForge
DataForge is a CLI-first data-quality repair toolkit for tabular data. It detects common CSV issues, proposes deterministic repairs, checks proposed changes through safety and verification gates, and records applied changes in a reversible transaction log.
The final public product name is DataForge. The PyPI/TestPyPI distribution
family is dataforge_07* because the unqualified dataforge project name is
occupied by unrelated packages. Installing dataforge_07 still provides the
dataforge import namespace and dataforge CLI. dataforge15 is only a
temporary staging alias retained for local compatibility.
The current repository is an alpha implementation. It also contains the OpenEnv-compatible training environment, the SFT warmup workflow, a local MCP server package, and playground/demo sources. Warehouse integrations and production model-quality claims remain future work.
Before any public release, review THREAT_MODEL.md and docs/docs/release.md.
They define the security, supply-chain, and evidence gates that separate the
current alpha from the full original DataForge vision.
Current Status
Shipped in the current worktree:
dataforge profile,dataforge repair,dataforge revert,dataforge watch,dataforge audit, anddataforge bench- Three detector families:
type_mismatch,decimal_shift,fd_violation - Reviewable schema inference in
profile --json, including inferred column types, domains, regex candidates, uniqueness, and FD candidates - Pending constraint review artifacts via
profile --constraints-out, which can feed repair only after individual candidates are marked accepted - Matching deterministic repairers wired through SafetyFilter -> SMTVerifier
- Backend-neutral
PatchPlanandTableStorecontracts for CSV, DuckDB, and dry-run-only cloud warehouse boundaries - Reversible hash-chained transaction journals with immutable source snapshots
- Public backend repair engine at
dataforge.engine.repair - Real-world benchmark harness for Hospital, Flights, and Beers
- OpenEnv-compatible HTTP environment with eight typed actions, including
read-only
ROOT_CAUSE - Causal root-cause analyzer for cascading data-quality errors
- Standalone
dataforge-mcppackage exposing DataForge tools over MCP - Week 9 SFT oracle trajectory workflow, readiness gate, Kaggle notebook, and release verifier
- Separate Gradio model-demo Space source for the published 0.5B SFT smoke checkpoint
Not shipped yet:
- published
dataforge_07,dataforge_07_mcp,dataforge_07_evals,dataforge_07_dbt, anddataforge_07_agent_patternspackages - committed production verification for the Cloudflare Workers playground
- warehouse-native or external adapter packages
- credentialed Snowflake, BigQuery, or Databricks apply/revert conformance
- design-partner, pilot-user, or customer validation evidence is not yet claimed
- A production-quality trained model family
- Autonomous repair in the playground or model demo
Quickstart
python -m pip install -e ".[dev]"
dataforge profile fixtures/hospital_10rows.csv --schema fixtures/hospital_schema.yaml
dataforge profile fixtures/hospital_10rows.csv --constraints-out constraints.json
dataforge constraints review constraints.json
dataforge repair fixtures/hospital_10rows.csv --schema fixtures/hospital_schema.yaml --dry-run
dataforge repair fixtures/hospital_10rows.csv --constraints constraints.json --dry-run
dataforge watch fixtures/hospital_10rows.csv --schema fixtures/hospital_schema.yaml --once --json
dataforge bench --methods random,heuristic --datasets hospital,flights,beers --seeds 3 --seed-list 0,1,2
dataforge15 remains a temporary staging compatibility alias, but public docs
and release evidence must use dataforge_07 for PyPI distribution identity and
dataforge for the installed CLI/import identity.
To apply repairs, use --apply. Applied repairs write a transaction journal and
source snapshot before mutating the CSV, so they can be reverted:
dataforge repair path/to/file.csv --schema path/to/schema.yaml --apply
dataforge audit <txn-id>
dataforge revert <txn-id>
dataforge revert <txn-id> --search-root path/to --json
Warehouse targets use warehouse:// URIs and always emit a patch_plan_v1
contract before any mutation. DuckDB is the local conformance backend; cloud
warehouse adapters are dry-run-only boundaries until credentialed apply,
audit, and rollback suites are enabled:
dataforge repair "warehouse://duckdb?database=dev.duckdb&relation=main.model&row_id=id" --dry-run --json
dataforge repair "warehouse://snowflake?relation=PUBLIC.MODEL&row_id=ID" --dry-run --json
DuckDB --apply requires a stable row identity, records the patch plan in the
transaction journal, and can be reverted through the same audit and revert
commands. Snowflake, BigQuery, and Databricks apply are intentionally refused
until their conformance gates prove reversible transactions.
New transaction logs are local tamper-evident hash chains. dataforge audit
verifies the chain head, event order, replayability, and revert prerequisites;
legacy v1 logs remain replayable but are reported as unverified because they do
not contain event hashes.
Week 9 SFT Warmup
The current SFT workflow builds split-safe expert_v1 trajectory records from
dirty/clean CSV diffs. Exact repairs in the primary dataset are labeled
oracle_from_clean_diff, not inferred from Groq, Cerebras, or Gemini teacher
guesses. Clean train chunks are retained as finish examples so the model
learns when no repair is justified.
$env:HF_TOKEN="..."
.\.venv\Scripts\python.exe scripts\data\build_oracle_sft_trajectories.py
.\.venv\Scripts\python.exe scripts\data\validate_sft_readiness.py
This writes local ignored JSONL at data/sft_traj/expert_v1.jsonl and an
auditable row split at data/sft_traj/split_manifest.json. Push the dataset
bundle only after the readiness gate passes:
$env:HF_TOKEN="..."
.\.venv\Scripts\python.exe scripts\data\build_oracle_sft_trajectories.py --push-to-hub --hf-dataset-repo Praneshrajan15/dataforge-sft-trajectories
The current public smoke checkpoint is
Praneshrajan15/DataForge-0.5B-SFT, with trajectories at
Praneshrajan15/dataforge-sft-trajectories. It proves the dataset, Kaggle
training, merge, evaluation, and Hub upload path; it is not a production
model-quality claim. Verify release artifacts before citing them:
.\.venv\Scripts\python.exe scripts\model\verify_sft_release.py --output eval\results\sft_release_v0_smoke.json
.\.venv\Scripts\python.exe scripts\model\verify_sft_release.py --min-dataset-records 272 --require-sha-metrics --output eval\results\sft_release_contract_v2_20260515.json
Week 12 GRPO Path
The repository now contains a gated GRPO post-training path for free-tier experiments:
training/configs/grpo_05b.yamltargetsDataForge-0.5B-SFT->DataForge-0.5B-GRPO.training/configs/grpo_15b.yamlrequires a verifiedDataForge-1.5B-SFTprerequisite before attemptingDataForge-1.5B-GRPO.training/rewards/dataforge_reward.pyscores completions locally through therepair_contract_v1exact-repair contract.training/kaggle/grpo_kaggle.ipynbblocks Hub upload unless GRPO beats SFT by at least 3 absolute F1 points onDataForge-Bench-light-verified.
No GRPO checkpoint is described as a quality milestone in this README until
scripts/model/verify_grpo_release.py produces committed verification
evidence. Refresh benchmark tables only from generated JSON:
After GRPO eval evidence exists:
.\.venv\Scripts\python.exe scripts\bench\refresh_benchmark_table.py --skip-agent-run --trained-model-json eval\results\grpo_model_comparison.json
MCP Server
The nested dataforge-mcp/ source directory builds the standalone
dataforge_07_mcp distribution. It is not published yet, so install it from
source while release ownership is pending:
cd dataforge-mcp
python -m pip install -e ".[dev]"
dataforge-mcp serve
Tools: dataforge_profile, dataforge_detect_errors,
dataforge_verify_fix, dataforge_apply_repairs, and dataforge_revert.
The default transport is stdio. MCP reads and writes are sandboxed to configured
allowed roots; dry-run works by default, while apply requires --enable-apply.
Streamable HTTP is available for local experiments.
The monorepo packages/ directory contains the side-package release sources
for dataforge_07_evals, dataforge_07_dbt, and
dataforge_07_agent_patterns.
Playground And Model Demo
playground/api/is the API backend for the CSV playground. Public Space deployments usedataforge-playground.playground/web/is the static browser UI deployed through Cloudflare Workers Static Assets. Its primary workflow isPOST /api/analyze: upload a CSV, review categorical risk and pending inferred constraints, inspect verified dry-run repairs and non-repairs, then export a receipt with the local CLI apply/audit/revert command shape.- The current verified public playground URL is
https://dataforge.praneshrajan15.workers.dev/playground, backed byhttps://Praneshrajan15-dataforge-playground.hf.space. - That Workers URL is the production playground surface for the full original vision; this is the release URL.
playground-model/is a separate Gradio Space demo for the publishedDataForge-0.5B-SFTsmoke checkpoint. It accepts small CSV snippets and is intentionally limited to demo use.
The playground does not persist uploaded files, does not use browser storage, does not mutate data in the hosted flow, and does not call an LLM unless a backend provider key is explicitly configured.
Benchmark Results
Generated from eval/results/agent_comparison.json (schema dataforge_benchmark_run_v2, seeds 0, 1, 2, git dbd1bed0a03c, dirty true).
| Method | Precision | Recall | F1 | Avg Steps | Quota Units | GPU Hours |
|---|---|---|---|---|---|---|
| heuristic | 0.3167 | 0.3025 | 0.2772 | 374.33 | 0.0000 | 0.0000 |
| random | 0.0038 | 0.0003 | 0.0005 | 150.33 | 0.0000 | 0.0000 |
See BENCHMARK_REPORT.md for per-dataset tables, error bars, and citation-only SOTA rows.
Dataset bytes are pinned to BigDaMa/raha revision 7be1334b8c7bbdac3f47ef514fb3e1e8c5fc181c for hospital, flights, beers; dirty/clean SHA-256s are recorded in the JSON metadata.
Local Setup
make setup
make lint
make type
make test
make backend-gate
make release-gate
Verification works on Linux, macOS, and Windows with Git Bash available for GNU
Make recipes. Python support is >=3.11,<3.13.
profile --constraints-out writes a strict constraint_review_v1 JSON artifact.
Every inferred candidate starts as pending; repair ignores pending and
rejected candidates. In v1, only accepted column_type, domain_bound, and
functional_dependency candidates affect repair. Accepted regex and uniqueness
candidates remain review evidence until verifier support is added. Use
dataforge constraints review constraints.json for the Textual review UI, or
use deterministic CI flags such as --accept cnd-... --no-tui --json.
make backend-gate is the release-quality backend check: lint, format, strict
mypy, root tests, MCP tests, README truth, benchmark truth, OpenAPI snapshot
drift, secret scan, dependency audit availability, SBOM generation
availability, and package build availability for both dataforge_07 and
dataforge_07_mcp. The gate covers the core dataforge_07 distribution and
release surfaces; the historical
data_quality_env namespace remains source-tree regression coverage, not part
of the dataforge wheel or source distribution.
Before release, run scripts/ci/backend_gate.py --require-optional so
dependency audit, SBOM generation, and package builds are hard failures rather
than availability checks.
Release doctor scopes:
dataforge release doctor --core --json
dataforge release doctor --maintainer-deploy --json
dataforge release gate --json
dataforge release full-vision --json
--core is the default OSS release check. --maintainer-deploy additionally
checks maintainer-specific Hugging Face, Kaggle OAuth plus clean-config Kaggle
CLI execution, and Cloudflare state.
release gate is the authoritative fresh-user proof: it builds the
distribution, audits wheel contents, creates a dependency wheelhouse, installs
with pip --no-index --find-links, then runs profile, repair dry-run, apply,
constraint review, audit, revert, and post-revert audit from outside the source
checkout.
Configure pending trusted publishers for dataforge_07 on TestPyPI and PyPI
before tagging. The real PyPI workflow refuses pre-release metadata and should
only run after trusted publishing, attestations, and fresh-install evidence are
verified. dataforge release full-vision --json is expected to fail until PyPI
publication evidence, dbt-duckdb proof, not yet met design-partner evidence,
and model-family evidence are real.
Windows setup:
winget install -e --id Python.Python.3.12
winget install -e --id ezwinports.make
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install -e ".[all]"
make lint && make type && make test
Environment Variables
Provider keys belong in a root .env file, which is gitignored and loaded with
python-dotenv where needed.
GROQ_API_KEYGEMINI_API_KEYCEREBRAS_API_KEYOPENROUTER_API_KEYHF_TOKEN
When DataForge Is The Wrong Tool
Do not use DataForge for streaming data, very large warehouse tables, regulated workflows where every fix must be human-authored, strict low-latency SLAs, or teams already well served by maintained Great Expectations/dbt suites. DataForge is currently best suited to local CSV profiling, repair experiments, benchmark runs, and training/evaluation research.
Repository Docs
- .cursor/rules/dataforge.md - always-applied contribution rules
- ARCHITECTURE.md - current system architecture and dependencies
- DECISIONS.md - technical decision log
- CONTRIBUTING.md - workflow and code standards
- CLAUDE.md - living gotcha log for agent sessions
- CURSOR_MASTER.md - context and prompt pack
- META_CONTEXT.md - project meta-context
- FILE_STRUCTURE.md - current and planned directory map
- SECURITY.md - vulnerability reporting policy
- specs/SPEC_TEMPLATE.md - template for new module specs
License
Apache-2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataforge_07-0.1.0.tar.gz.
File metadata
- Download URL: dataforge_07-0.1.0.tar.gz
- Upload date:
- Size: 166.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9948b8192300edd2a8058ee09b7b40d716d5748982e0524b83f21a08204e553f
|
|
| MD5 |
bc149ae004932f0d734cf1636422b253
|
|
| BLAKE2b-256 |
dd2e08cc488b6aa7cdd35cc1e0ea7aad94ffc7298072ff40b5521fa0fbc261fc
|
Provenance
The following attestation bundles were made for dataforge_07-0.1.0.tar.gz:
Publisher:
publish-dataforge.yml on Aegis15/dataforge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataforge_07-0.1.0.tar.gz -
Subject digest:
9948b8192300edd2a8058ee09b7b40d716d5748982e0524b83f21a08204e553f - Sigstore transparency entry: 1804372477
- Sigstore integration time:
-
Permalink:
Aegis15/dataforge@d498b656734241e343673fafe1b11676b475bf60 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Aegis15
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-dataforge.yml@d498b656734241e343673fafe1b11676b475bf60 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file dataforge_07-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dataforge_07-0.1.0-py3-none-any.whl
- Upload date:
- Size: 214.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f30dc5d8da690dad41e596b14e7ec5b49567eed6288a6d8bed6ce31b9d19bbcf
|
|
| MD5 |
019f69f455ee6c98d99702974ddb8091
|
|
| BLAKE2b-256 |
47f990e52c2a336688cda8555ef3b50d543bf202c7a2b8f9f008d832983290cf
|
Provenance
The following attestation bundles were made for dataforge_07-0.1.0-py3-none-any.whl:
Publisher:
publish-dataforge.yml on Aegis15/dataforge
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dataforge_07-0.1.0-py3-none-any.whl -
Subject digest:
f30dc5d8da690dad41e596b14e7ec5b49567eed6288a6d8bed6ce31b9d19bbcf - Sigstore transparency entry: 1804372905
- Sigstore integration time:
-
Permalink:
Aegis15/dataforge@d498b656734241e343673fafe1b11676b475bf60 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Aegis15
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-dataforge.yml@d498b656734241e343673fafe1b11676b475bf60 -
Trigger Event:
workflow_dispatch
-
Statement type: