Recipe-driven data preparation and caching for machine learning

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

pointmatic

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- MacOS
- POSIX :: Linux
Programming Language
- Python :: 3.12
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

DataRefinery

Recipe-driven data preparation and caching for machine learning.

DataRefinery compiles a single YAML recipe into a materialized instance: the recipe, the prepared dataset, the fitted statistics produced during preparation, and a report describing both. Re-running an unchanged recipe over unchanged inputs returns the cached instance unchanged; any semantic edit invalidates and rebuilds.

The first-class category is image classification; tabular and text plugin stubs exist to keep the abstractions honest. See the concept, features, and tech-spec documents for design depth.

Why DataRefinery

Data prep for ML training usually lives in throwaway notebooks: steps decay between projects, train/inference skew creeps in, splits leak, and reproducibility relies on muscle memory. DataRefinery replaces that with one declarative artifact (the recipe) and a deterministic materialization path. Highlights:

Reproducibility contract. Same recipe + same inputs + same seed produces a byte-identical instance. Every stochastic operation is seeded; fitted statistics from the training split are persisted so inference replays the exact transformations.
Semantic cache identity. Cache addressing combines a canonical hash of the recipe (whitespace- and key-order-insensitive) with a hash of the raw inputs and the seed. Cosmetic edits don't rebuild; meaningful edits do.
Atomic temp-then-promote. Materialization writes into a temp directory and promotes only after success — partial instances never appear in the cache.
Recipe is authoritative. The recipe is the single source of truth for pipeline semantics. CLI flags and env vars only control execution context (cache root, log level, workers, plugin path). The one sanctioned override is --seed for ad-hoc runs.
Library + CLI as co-equal surfaces. Same recipe, same operations, no second code path to drift from at serving time.

Installation

DataRefinery requires Python 3.12.

pip install ml-datarefinery

The PyPI distribution name is ml-datarefinery; the Python import name and console script remain datarefinery (e.g. import datarefinery and datarefinery --help). Same shape as scikit-learn / import sklearn.

For the optional LLM-enhancement layer (FR-17), install with the [llm] extra:

pip install 'ml-datarefinery[llm]'

For the robustness-evaluation imagecorruptions_apply Generation op (FR-GEN-1), which applies Hendrycks-Dietterich (ICLR 2019) image corruptions, install with the [corruptions] extra:

pip install 'ml-datarefinery[corruptions]'

From source (development)

DataRefinery uses pyve to manage two isolated environments: one for the runtime package and one for dev tooling (ruff, mypy, pytest).

git clone https://github.com/pointmatic/datarefinery.git
cd datarefinery

# Runtime env
pyve init
pyve run pip install -e /path/to/datarefinery

# Dev tooling env (one-time)
pyve testenv init
pyve testenv install -r requirements-dev.txt
pyve testenv run pip install -e /path/to/datarefinery

# Run the test suite
pyve test

Quickstart

DataRefinery's documented user journey is init → validate → materialize → status. The example below uses an image_classification layout: a directory of class-named subdirectories, each holding image files.

my-images/
  cat/
    cat_001.png
    cat_002.png
    ...
  dog/
    dog_001.png
    ...

# 1. Scaffold a starter recipe from raw images (deterministic, offline).
datarefinery init --input my-images --output recipe.yaml

# 2. (Optional) Review the recipe and uncomment any suggested
#    Transformations (e.g. resize, normalize).

# 3. Validate the recipe against the schema and FR-2 static checks.
datarefinery validate recipe.yaml

# 4. Materialize the pipeline end-to-end. The first run is a cache
#    miss; the cached instance is promoted atomically on success.
datarefinery --cache-root ./cache materialize recipe.yaml

# 5. Inspect the instance summary (cache hit on a re-run).
datarefinery --cache-root ./cache status recipe.yaml

After a successful materialize, the cache layout looks like:

cache/instances/<recipe-hash>/<input-hash>/<seed>/
├── recipe.yaml              # exact recipe used (canonicalized for the cache key)
├── manifest.json            # full hashes, record counts, schema version
├── dataset/                 # prepared dataset (e.g. <split>.jsonl)
├── fitted_statistics/       # statistics fitted on the training split
└── report/
    ├── report.md            # human-readable summary
    ├── drift.json           # stable contract for downstream drift tooling
    └── visualizations/      # PNGs declared in the recipe

The <recipe-hash> and <input-hash> directory names use the first 16 hex characters of each SHA-256; the full hashes are recorded in manifest.json.

Alternative layout: flat directory + sidecar labels

If your dataset is a flat directory of images plus a separate manifest of labels (the common third-party shape — Kaggle CSVs, re-labeled datasets, etc.), declare the source as image_flat and point its label_from at the manifest:

my-dataset/
  images/
    img_001.png
    img_002.png
    ...
  labels.csv         # filename,class

Input:
  sources:
    - name: images
      type: image_flat
      path: ./my-dataset/images
      label_from:
        path: ./my-dataset/labels.csv
        join: by_id
        id_field: filename
        label_field: class
Labels:
  field: label
  source: { kind: direct }

The loader joins each image's filename stem against the manifest's filename column and writes the matching class value into the record's label field at load time. validate enforces the join (check 19): missing ids, duplicate ids, and column-name typos are caught before materialize runs. See docs/guides/recipe-authoring.md for headerless manifests and by_row_order (CIFAR-style) variants.

Pre-partitioned sources (Kaggle-style train/test)

Most third-party datasets ship pre-partitioned: a train/ directory authored by the publisher and a test/ directory intended to remain heldout from training. Declare each source's role with partition:

my-dataset/
  train/cat/, train/dog/, …            # ImageFolder layout per partition
  test/cat/,  test/dog/,  …

Input:
  sources:
    - name: train_data
      type: image_folder
      path: ./my-dataset/train
      partition: train
    - name: test_data
      type: image_folder
      path: ./my-dataset/test
      partition: test
Splits:
  ratios: { train: 0.85, val: 0.15 }
  applies_to: train                     # carve val from train; test stays heldout
  stratify_by: label
  seed: 7

The materialized instance contains three splits: train and val (sub-partitioned from the source's train directory) and test (passed through verbatim from the source's test directory). Omitting Splits (or writing Splits: {}) honors the source partitions as the final splits without sub-partitioning. Validator check 20 enforces consistency — every record's partition declaration is honored end-to-end.

Unlabeled partitions (Kaggle-style test set with no labels)

The classic Kaggle shape ships a labeled training set together with an unlabeled heldout test partition. Declare the unlabeled source with type: image_flat (the heldout side has no class subdirectories) and unlabeled: true:

Input:
  sources:
    - name: train_data
      type: image_folder
      path: ./my-dataset/train
      partition: train
    - name: test_data
      type: image_flat                  # flat layout, no labels
      path: ./my-dataset/test
      partition: test
      unlabeled: true
Labels:
  field: label
  source: { kind: direct }              # labels exist for labeled partitions
Splits:
  ratios: { train: 0.85, val: 0.15 }
  applies_to: train                     # only sub-partition the labeled side
  stratify_by: label

Records loaded from test_data land without a label field. They flow through label-independent stages (resize, normalize) normally; label-dependent stages (stratify_by on an unlabeled partition, filter_by_label, label-reading featurizations) are rejected at validate time (check 21). report.md flags the unlabeled split with *(unlabeled)*; drift.json reports class_distribution: null with a "skipped: unlabeled" note. The materialized dataset/test.jsonl is ready for downstream inference — train a model on train+val, predict against test, and submit. (Inference itself is external to DataRefinery.)

Recipe anatomy

A recipe is a single YAML file. Field names match the section set used by the validator and runner; each operation declares the stages and splits it applies to so train-only behavior is explicit.

schema_version: 1
plugin: image_classification
seed: 0

Input:
  sources:
    - name: train
      type: image_folder
      path: my-images

Output:
  record_schema:
    image: { dtype: uint8, shape: [32, 32, 3] }
    label: { dtype: str }
    path:  { dtype: str }

Labels:
  field: label
  source: { kind: derived, derivation: parent_directory_name }

Splits:
  ratios: { train: 0.7, val: 0.15, test: 0.15 }
  seed: 11
  stratify_by: label

Transformations:
  - name: resize
    op: resize
    params: { size: 32, method: bilinear }
    splits: [train, val, test]
  - name: normalize
    op: normalize
    fit_source: train          # statistics fit on train, applied everywhere
    splits: [train, val, test]

Featurizations:
  - name: derive_label
    inputs: [path]
    output_field: label
    op: label_from_path
    params: { source: parent_directory_name }
    splits: [train, val, test]

Visualizations:
  - name: class_distribution
    op: class_distribution_histogram
    stage: post_pipeline
    mode: reporting
  - name: samples
    op: sample_grid
    params: { n: 16, per_class: true }
    stage: post_pipeline
    mode: reporting

variants:
  no_augment:
    Augmentations: []

Section roles at a glance:

Section	Purpose
`Input`	Declared raw sources (files, directories, parquet, etc.)
`Output`	Record schema the materialized dataset must satisfy
`Labels`	Where labels come from (column, derivation, sidecar)
`SampleData`	A small inlined sample for documentation/testing
`InputContracts`	Pre-pipeline assertions on raw input shape/values
`Filters`	Row-removal policy (e.g. class-imbalance via subsetting)
`Generation`	Synthetic record generation
`Splits`	Train/val/test ratios, stratification, seed
`Transformations`	Deterministic ops; `fit_source: train` persists stats
`Augmentations`	Stochastic, train-only by default
`Featurizations`	Derive new fields from existing ones
`OutputExpectations`	Post-pipeline assertions on materialized data
`Visualizations`	Exploration (on-demand) or reporting (persisted) views
`variants`	Named overlays on any section (experiment knobs)

Select a variant at materialize time with --variant no_augment. Variants change the canonical hash (and therefore the cache identity).

For a section-by-section walk-through — including fit-on-train discipline, variants, contracts/expectations, and the Filters-vs-Splits choice for class imbalance — see the Recipe authoring guide.

CLI verbs

datarefinery --help

Verb	Purpose	FR
`check`	Report environment soundness (Python, deps, plugins discovered).	FR-18
`init`	Scaffold a starter recipe deterministically from raw inputs.	FR-17
`validate`	Schema + 22 enumerated static logical checks.	FR-2
`materialize`	Run the pipeline end-to-end against the recipe's inputs.	FR-3
`status`	Summarize a materialized instance or resolve a recipe to one.	FR-19
`report`	Re-render `report.md`, `drift.json`, and reporting visualizations.	FR-15
`inspect`	Read-only views of a materialized instance.	FR-20
`clean`	Remove cached instances and orphan temp directories.	FR-21

Execution-context flags (never alter pipeline semantics):

Flag	Env var	Effect
`--cache-root`	`DATAREFINERY_CACHE_ROOT`	Root directory for the cache.
`--log-level`	`DATAREFINERY_LOG_LEVEL`	Operational log level.
`--log-target`	`DATAREFINERY_LOG_TARGET`	Log routing target (reserved).
`--plugin-path`	`DATAREFINERY_PLUGIN_PATH`	Extra plugin discovery paths.
`--workers`	`DATAREFINERY_WORKERS`	Process-pool worker count.
`--variant`	—	Recipe variant to apply before canonicalization.
`--seed`	—	Override the recipe seed (changes cache identity).

Plugin model

A plugin contributes the operations that make sense for one data category. Plugins register through the datarefinery.plugins entry-point group and are discovered automatically.

v1 ships:

image_classification — first-class, full operation set (resize, normalize, class-distribution and sample-grid visualizations, label-from-path featurization, etc.).
tabular — stub plugin exercising the abstractions; no working ops.
text — stub plugin exercising the abstractions; no working ops.

A plugin declares, for each operation, an OperationSpec covering parameter schema, fit_on_train flag, applicable splits, and applicable recipe sections. The validator cross-checks recipe operations against these specs (FR-2 check 18). See src/datarefinery/plugins/base.py for the protocol, and the plugin authoring guide for a walk-through of writing your own.

Library API

Library and CLI are co-equal surfaces driven by the same recipe.

from pathlib import Path

from datarefinery import DataRefinery, materialize
from datarefinery.core.config import RuntimeConfig

config = RuntimeConfig(cache_root=Path("./cache"))

# High-level: one-shot materialize a recipe path.
instance = materialize("recipe.yaml", config=config)
print(instance.manifest.record_counts)

# Lower-level: load once, then call verbs against the loaded recipe.
dr = DataRefinery.from_recipe("recipe.yaml", config=config)
report = dr.validate()
instance = dr.materialize()

v1 scope and non-goals

In scope for v1:

Recipe-driven pipeline with the section set above; explicit per-operation stage/split applicability.
Schema-versioned YAML recipes; load-time refusal of unknown versions; documented migration path between versions.
Materialized instance = recipe + dataset + fitted statistics + report. No statistical artifacts persisted outside the report.
Semantic cache identity (canonical recipe hash ⊕ raw-input hash ⊕ seed). Whitespace/key-order edits do not trigger rebuilds.
Atomic temp-then-promote materialization; no partial instances in cache.
Named variants within a recipe; experiment knobs are variants, not separate recipes.
Image plugin scoped to classification; tabular and text stubs.
Deterministic init scaffolder; optional LLM enhancement layer via lmentry as an extra.
Stable drift-relevant report subsection for downstream tooling.

Non-goals for v1:

Image tasks beyond classification (detection, segmentation).
Model framework abstraction, training, evaluation, inference.
Production streaming and drift-detection logic.
Persisted statistical artifacts beyond the report (no sidecar pickles, no separate stats files).
Recipe inheritance or multi-file recipe composition (variants suffice).
Resume-from-stage during materialization.
Hard LLM dependency (DataRefinery must work fully offline).
Tabular and text plugin implementations (stubs only).
Hard performance targets (reactive performance work only).
init for non-image categories.

For the full requirements, see docs/specs/features.md. For implementation details, see docs/specs/tech-spec.md.

License

Licensed under the Apache License, Version 2.0.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

pointmatic

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- MacOS
- POSIX :: Linux
Programming Language
- Python :: 3.12
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

0.16.0

May 23, 2026

This version

0.15.0

May 23, 2026

0.14.1

May 23, 2026

0.14.0

May 23, 2026

0.9.4

May 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml_datarefinery-0.15.0.tar.gz (2.2 MB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ml_datarefinery-0.15.0-py3-none-any.whl (2.2 MB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file ml_datarefinery-0.15.0.tar.gz.

File metadata

Download URL: ml_datarefinery-0.15.0.tar.gz
Upload date: May 23, 2026
Size: 2.2 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ml_datarefinery-0.15.0.tar.gz
Algorithm	Hash digest
SHA256	`805dee265d8c795f6ff2ca7cac98160925b12ae3cd7e3c95dd765a3c0ff57adb`
MD5	`f7192f8f2da86850008f29ea946a07b7`
BLAKE2b-256	`1486cf5b8d1fe35624c166204b6515cd4ba90289f32b535ac7272c6c1edde78c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ml_datarefinery-0.15.0.tar.gz:

Publisher: publish.yml on pointmatic/datarefinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ml_datarefinery-0.15.0.tar.gz
- Subject digest: 805dee265d8c795f6ff2ca7cac98160925b12ae3cd7e3c95dd765a3c0ff57adb
- Sigstore transparency entry: 1614523206
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: pointmatic/datarefinery@9f54399cad2014c5efcbd9583c1ed135b5f52e16
- Branch / Tag: refs/tags/v0.15.0
- Owner: https://github.com/pointmatic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9f54399cad2014c5efcbd9583c1ed135b5f52e16
- Trigger Event: push

File details

Details for the file ml_datarefinery-0.15.0-py3-none-any.whl.

File metadata

Download URL: ml_datarefinery-0.15.0-py3-none-any.whl
Upload date: May 23, 2026
Size: 2.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ml_datarefinery-0.15.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`be5825ae38612e6ae6d3a0bc4a300a5e5ed6c5230ce1aa64bdd725fd6959f43f`
MD5	`a58cb2d33110c8c0901551d91b483d0c`
BLAKE2b-256	`f2160953763a163edd6a1158b0bdee35dd7a2ae021cf051e5cbfc7c92cb9a458`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ml_datarefinery-0.15.0-py3-none-any.whl:

Publisher: publish.yml on pointmatic/datarefinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ml_datarefinery-0.15.0-py3-none-any.whl
- Subject digest: be5825ae38612e6ae6d3a0bc4a300a5e5ed6c5230ce1aa64bdd725fd6959f43f
- Sigstore transparency entry: 1614523297
- Sigstore integration time: May 23, 2026
Source repository:
- Permalink: pointmatic/datarefinery@9f54399cad2014c5efcbd9583c1ed135b5f52e16
- Branch / Tag: refs/tags/v0.15.0
- Owner: https://github.com/pointmatic
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9f54399cad2014c5efcbd9583c1ed135b5f52e16
- Trigger Event: push

ml-datarefinery 0.15.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

DataRefinery

Why DataRefinery

Installation

From source (development)

Quickstart

Alternative layout: flat directory + sidecar labels

Pre-partitioned sources (Kaggle-style train/test)

Unlabeled partitions (Kaggle-style test set with no labels)

Recipe anatomy

CLI verbs

Plugin model

Library API

v1 scope and non-goals

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance