Simple data versioning
Project description
Rair - Research Archival & Integrity Recorder
Rair is a CLI-tool for simple data versioning.
When running experiments, model parameters and model details needs to be tweaked frequently. Committing minimal changes clutters the git history and makes it hard to track actual program modifications. Rair allows to link computation results to exact code versions without the need to commit all changes to git. Therefore it stores code diffs alongside the git commit reference. It tracks input and intermediate data as well to guarantee full reproducibility for every run.
Using heuristics, Rair can be used in many scenarios without any manual configuration.
Usage example
Suppose you have a simple script mymodel.py committed to git, looking like this:
import time
print('Test text output to stdio')
p1 = 5.9
p2 = 9.5
with open("test_result.txt", "w") as f:
current_time = time.strftime("%Y-%m-%d %H:%M:%S")
f.write(f"Current time: {current_time}\n")
f.write(f"result = {p1 + p2}\n")
You tune some parameters (here p1 and p2) without committing the changes to git.
Now you run your script like this:
rair mymodel.py
Rair runs the script and creates an archive with the following structure:
rairarchive/
├── data/
│ └── 9157ce88256e95668977_test_result.txt # deduplicated data file
└── runs/
└── 20260603-001-023aa51f/
├── info.md # human-readable run overview
├── run.json # machine-readable run metadata
├── out.txt # captured stdout/stderr
├── git_diff.patch # uncommitted changes (patch format)
└── test_result.txt # output file (hardlink)
The file info.md gives an overview of the run:
# Run Information
- Start time: 2026-06-03 11:53:09
- Execution time: 0.185 s
- Command: `python mymodel.py`
- Run hash: `023aa51fa4981ebe097f2045947d2108cff014c42332d5f6ef5a9d71cbf5273b`
## Git Information
- Commit: `95aa8c491f8a3e5c44890ea3c6616e123692c4cd`
- Short git hash: `95aa8c4`
- Branch: `main`
- Tracking URL: `no-upstream`
## Uncommitted Changes
in mymodel.py:
p1 = 7.1
p2 = 3.3
## Output Files
- `test_result.txt` -> `rairarchive/data/9157ce88256e95668977_test_result.txt` (hash: `9157ce88`)
The "Run hash" captures the git hash, code diff, command line parameters and input file content.
Install
Rair can be installed with pip. Its tested on Windows and Unix:
pip install rair
Features
- Auto-discovery: Automatically discover input/output files using hash-based change detection for outputs
- Hash caching: Cache hash calculations in
.rair_cache/for fast operation on large data files - Git diff tracking: Track uncommitted changes alongside git commits
- Archive format: Human-readable markdown and machine-readable JSON
- Flexible configuration: Configure via CLI, or config file
.rair.toml, orpyproject.toml - Output capture: Captures stdout/stderr to a file
- Deduplication: Avoids storing duplicate data files by using content hashes
- Selective tracking: Use
--no-auto-discoverto require explicit--input/--output - Output hardlinks:
--output-files-in-runcreates hardlinks for easy access - Default command: Configure a default command to run when no script specified
- Hierarchical config: Local configs override project settings
Running Rair
# Run a Python script with automatic tracking of all data files
# in the project directory
rair myscript.py
# Run a Python script with script arguments
rair myscript.py arg1 arg2
# Run with explicit command
rair python3 mymodel.py arg1 arg2
# The first argument can be a Python script or any arbitrary command
rair make --all
# Manually specify which files to track
rair --input "data/*.csv" --output "results/*.json" myscript.py
# If only input files are specified, outputs are auto-discovered
rair --input "data/*.csv" --input parameters.txt myscript.py
# Selective tracking - specify exactly which files to track
# Use --no-auto-discover to require explicit --input and --output
rair --no-auto-discover --input "data/*.csv" --output "results/*.json" myscript.py
# Run the default command set in config file and add
# a comment that is stored with the results
rair --comment "experiment 1"
All CLI flags
--config FILE Path to config file
--input TEXT Glob pattern for input files to track
--output TEXT Glob pattern for output files to track
--exclude TEXT Glob pattern to exclude from tracking
--archive-dir DIRECTORY Directory for archive data (default: rairarchive)
--autodata DIRECTORY Directory for auto-discovering input/output files
--capture-output/--no-capture-output
Capture stdout to out.txt [default: enabled]
--auto-discover/--no-auto-discover
Enable/disable auto-discovery [default: enabled]
--output-files-in-run Create hardlinks to output files in run folder
--comment TEXT Add a comment to the run
--setup Run interactive setup dialog
--help Show help message
Configuration
As alternative to CLI parameters, configuration can be provided via a .rair.toml file or in pyproject.toml under [tool.rair]:
.rair.toml:
archive_dir = "rairarchive"
input_glob = ["data/*.csv", "cache/*.pkl"]
output_glob = ["results/*.json", "logs/*.txt"]
exclude_glob = ["data/temp/*"]
autodata_dir = "./data"
capture_output = true
auto_discover = true # Enable auto-discovery (default)
output_files_in_run = false # Create hardlinks to outputs in run folder
default_command = "make" # Default command when no script specified
pyproject.toml:
[tool.rair]
archive_dir = "rairarchive"
input = ["data/*.csv", "cache/*.pkl"]
output = ["results/*.json", "logs/*.txt"]
exclude = ["data/temp/*"]
autodata_dir = "./data"
capture_output = true
auto_discover = true # Enable auto-discovery
output_files_in_run = false # Create hardlinks to outputs in run folder
default_command = "make" # Default command when no script specified
Hierarchical Configuration
You can have different configurations for different directories:
- A
.rair.tomlin the current directory overrides project-level config - Use
rair --setupin subdirectories to create local configs - Run
rair --setupand choose "(c)urrent directory" or "(p)roject"
Example directory structure:
project/
├── .rair.toml # Project config
└── experiments/
├── .rair.toml # Pverrides project config
└── train.py
Developer Guide
Feedback and contributions are welcome - please open an issue or submit a pull request on GitHub.
To get started with development, first clone the repository:
git clone https://github.com/DLR-Institute-of-Future-Fuels/rair.git
cd rair
You may set up a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: `.venv\Scripts\activate`
Build and install the package and dev dependencies:
pip install -e .[dev]
Run the tests:
pytest
License
This project is licensed under the MIT license - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rair-0.1.2.tar.gz.
File metadata
- Download URL: rair-0.1.2.tar.gz
- Upload date:
- Size: 41.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
565966d58319672ab85b9ac8e5625fdd79001d7375fbf1900d6fed9b5b5fd1ad
|
|
| MD5 |
952f19223ab520946959f3ec97584150
|
|
| BLAKE2b-256 |
aa2c93b7e63e26f5685acace12fcc4176e6874a43464fb250a3de2fdfb4788d0
|
File details
Details for the file rair-0.1.2-py3-none-any.whl.
File metadata
- Download URL: rair-0.1.2-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d464bff4506415b18fa3e36c0fc66445588412b574f1e88c432dc38064d9b44
|
|
| MD5 |
5d5fd077a78bf03e5153665d71f0e0b6
|
|
| BLAKE2b-256 |
b795eb07234e1b527467d4e5bc520e7b0972e4cb02b10af1c14b51f83a4ee17f
|