Skip to main content

Simple data versioning

Project description

Rair - Research Archival & Integrity Recorder

Rair is a CLI-tool for simple data versioning.

When running experiments, model parameters and model details needs to be tweaked frequently. Committing minimal changes clutters the git history and makes it hard to track actual program modifications. Rair allows to link computation results to exact code versions without the need to commit all changes to git. Therefore it stores code diffs alongside the git commit reference. It tracks input and intermediate data as well to guarantee full reproducibility for every run.

Using heuristics, Rair can be used in many scenarios without any manual configuration.

Usage example

Suppose you have a simple script mymodel.py committed to git, looking like this:

import time

print('Test text output to stdio')
p1 = 5.9
p2 = 9.5

with open("test_result.txt", "w") as f:
    current_time = time.strftime("%Y-%m-%d %H:%M:%S")
    f.write(f"Current time: {current_time}\n")
    f.write(f"result = {p1 + p2}\n")

You tune some parameters (here p1 and p2) without committing the changes to git.

Now you run your script like this:

rair mymodel.py

Rair runs the script and creates an archive with the following structure:

rairarchive/
├── data/
│   └── 9157ce88256e95668977_test_result.txt  # deduplicated data file
└── runs/
    └── 20260603-001-023aa51f/
        ├── info.md          # human-readable run overview
        ├── run.json         # machine-readable run metadata
        ├── out.txt          # captured stdout/stderr
        ├── git_diff.patch   # uncommitted changes (patch format)
        └── test_result.txt  # output file (hardlink)

The file info.md gives an overview of the run:

# Run Information
- Start time: 2026-06-03 11:53:09
- Execution time: 0.185 s
- Command: `python mymodel.py`
- Run hash: `023aa51fa4981ebe097f2045947d2108cff014c42332d5f6ef5a9d71cbf5273b`

## Git Information
- Commit: `95aa8c491f8a3e5c44890ea3c6616e123692c4cd`
- Short git hash: `95aa8c4`
- Branch: `main`
- Tracking URL: `no-upstream`

## Uncommitted Changes
in mymodel.py:
p1 = 7.1
p2 = 3.3

## Output Files
- `test_result.txt` -> `rairarchive/data/9157ce88256e95668977_test_result.txt` (hash: `9157ce88`)

The "Run hash" captures the git hash, code diff, command line parameters and input file content.

Install

Rair can be installed with pip. Its tested on Windows and Unix:

pip install rair

Features

  • Auto-discovery: Automatically discover input/output files using hash-based change detection for outputs
  • Hash caching: Cache hash calculations in .rair_cache/ for fast operation on large data files
  • Git diff tracking: Track uncommitted changes alongside git commits
  • Archive format: Human-readable markdown and machine-readable JSON
  • Flexible configuration: Configure via CLI, or config file .rair.toml, or pyproject.toml
  • Output capture: Captures stdout/stderr to a file
  • Deduplication: Avoids storing duplicate data files by using content hashes
  • Selective tracking: Use --no-auto-discover to require explicit --input/--output
  • Output hardlinks: --output-files-in-run creates hardlinks for easy access
  • Default command: Configure a default command to run when no script specified
  • Hierarchical config: Local configs override project settings

Running Rair

# Run a Python script with automatic tracking of all data files
# in the project directory
rair myscript.py

# Run a Python script with script arguments
rair myscript.py arg1 arg2

# Run with explicit command
rair python3 mymodel.py arg1 arg2

# The first argument can be a Python script or any arbitrary command
rair make --all

# Manually specify which files to track
rair --input "data/*.csv" --output "results/*.json" myscript.py

# If only input files are specified, outputs are auto-discovered
rair --input "data/*.csv" --input parameters.txt myscript.py

# Selective tracking - specify exactly which files to track
# Use --no-auto-discover to require explicit --input and --output
rair --no-auto-discover --input "data/*.csv" --output "results/*.json" myscript.py

# Run the default command set in config file and add
# a comment that is stored with the results
rair --comment "experiment 1"

All CLI flags

--config FILE              Path to config file
--input TEXT               Glob pattern for input files to track
--output TEXT              Glob pattern for output files to track
--exclude TEXT             Glob pattern to exclude from tracking
--archive-dir DIRECTORY    Directory for archive data (default: rairarchive)
--autodata DIRECTORY       Directory for auto-discovering input/output files
--capture-output/--no-capture-output
                            Capture stdout to out.txt [default: enabled]
--auto-discover/--no-auto-discover
                            Enable/disable auto-discovery [default: enabled]
--output-files-in-run      Create hardlinks to output files in run folder
--comment TEXT             Add a comment to the run
--setup                    Run interactive setup dialog
--help                     Show help message

Configuration

As alternative to CLI parameters, configuration can be provided via a .rair.toml file or in pyproject.toml under [tool.rair]:

.rair.toml:

archive_dir = "rairarchive"
input_glob = ["data/*.csv", "cache/*.pkl"]
output_glob = ["results/*.json", "logs/*.txt"]
exclude_glob = ["data/temp/*"]
autodata_dir = "./data"
capture_output = true
auto_discover = true          # Enable auto-discovery (default)
output_files_in_run = false   # Create hardlinks to outputs in run folder
default_command = "make"      # Default command when no script specified

pyproject.toml:

[tool.rair]
archive_dir = "rairarchive"
input = ["data/*.csv", "cache/*.pkl"]
output = ["results/*.json", "logs/*.txt"]
exclude = ["data/temp/*"]
autodata_dir = "./data"
capture_output = true
auto_discover = true          # Enable auto-discovery
output_files_in_run = false   # Create hardlinks to outputs in run folder
default_command = "make"      # Default command when no script specified

Hierarchical Configuration

You can have different configurations for different directories:

  • A .rair.toml in the current directory overrides project-level config
  • Use rair --setup in subdirectories to create local configs
  • Run rair --setup and choose "(c)urrent directory" or "(p)roject"

Example directory structure:

project/
├── .rair.toml          # Project config
└── experiments/
    ├── .rair.toml      # Pverrides project config
    └── train.py

Developer Guide

Feedback and contributions are welcome - please open an issue or submit a pull request on GitHub.

To get started with development, first clone the repository:

git clone https://github.com/DLR-Institute-of-Future-Fuels/rair.git
cd rair

You may set up a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: `.venv\Scripts\activate`

Build and install the package and dev dependencies:

pip install -e .[dev]

Run the tests:

pytest

License

This project is licensed under the MIT license - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rair-0.1.2.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rair-0.1.2-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file rair-0.1.2.tar.gz.

File metadata

  • Download URL: rair-0.1.2.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for rair-0.1.2.tar.gz
Algorithm Hash digest
SHA256 565966d58319672ab85b9ac8e5625fdd79001d7375fbf1900d6fed9b5b5fd1ad
MD5 952f19223ab520946959f3ec97584150
BLAKE2b-256 aa2c93b7e63e26f5685acace12fcc4176e6874a43464fb250a3de2fdfb4788d0

See more details on using hashes here.

File details

Details for the file rair-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: rair-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for rair-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9d464bff4506415b18fa3e36c0fc66445588412b574f1e88c432dc38064d9b44
MD5 5d5fd077a78bf03e5153665d71f0e0b6
BLAKE2b-256 b795eb07234e1b527467d4e5bc520e7b0972e4cb02b10af1c14b51f83a4ee17f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page