Skip to main content

Universal scientific data I/O with plugin registry

Project description

SciTeX

scitex-io

Universal scientific data I/O with plugin registry

Full Documentation · pip install scitex-io

PyPI Python Tests Install Test Coverage Docs License: AGPL v3


Problem and Solution

# Problem Solution
1 Format zoo -- save/load scattered across pd.read_csv, np.load, pickle, json, h5py, torch.save, cv2.imread, etc. Every format = a different API One call -- stx.io.save(obj, "x.ext") / stx.io.load("x.ext") routes by extension across 30+ formats; plugin registry lets users register custom handlers
2 FileNotFoundError after save -- save() auto-routes to {script}_out/ but load() resolves cwd-relative, so the naive round-trip breaks for new users Predictable paths -- symlink_from_cwd=True flag, CONFIG.SDIR_RUN session path, or absolute path on both sides — documented prominently in the skill
3 Figure + data diverge -- save a PNG; the underlying DataFrame lives in a separate .csv that goes out of sync Auto-CSV export -- stx.io.save(fig, "plot.png") writes plot.png + plot.csv + plot.yaml (figrecipe recipe) atomically, hash-tracked by Clew

Problem

Three problems recur in every scientific Python project:

  1. Format fragmentation. Loading a CSV requires pandas.read_csv(), an HDF5 file requires h5py.File(), a NumPy array requires numpy.load(). Each format demands its own library, its own API, and its own boilerplate. Operating systems solved this decades ago — double-click any file and the OS dispatches to the right application. Python has no equivalent.

  2. Hard-coded parameters scattered across scripts. Sample rates, thresholds, model hyperparameters, plot dimensions — magic numbers buried in code, duplicated across files, impossible to track or share. Changing one parameter means grepping through the entire project.

  3. Figures without provenance. A saved PNG has no record of the code, parameters, or session that produced it. Months later, reproducing a figure means reverse-engineering which script with which settings generated it.

Solution

scitex-io addresses all three:

  • save()/load() — One interface for 30+ formats with automatic extension-based dispatch. A plugin registry lets you add custom formats without modifying the library.
  • load_configs() — Loads all YAML files from a config/ directory into a single DotDict with dot-notation access. Parameters are version-controlled, centralized, and separate from code.
  • embed_metadata()/read_metadata() — Embeds provenance (timestamps, session IDs, parameters) directly into image and PDF files. The figure carries its own history.
Supported Formats (30+)
Category Extensions
Spreadsheet .csv, .tsv, .xlsx, .xls, .xlsm, .xlsb
Scientific .npy, .npz, .mat, .hdf5, .h5, .zarr
Serialization .pkl, .pickle, .pkl.gz, .joblib
ML/DL .pth, .pt, .cbm
Config .json, .yaml, .yml, .xml
Database .db (SQLite3)
Documents .txt, .md, .pdf, .docx, .tex, .log
Code .py, .sh, .css, .js
Images .png, .jpg, .jpeg, .gif, .tiff, .tif, .svg
Media .mp4
Web .html
Bibliography .bib
EEG .vhdr, .vmrk, .edf, .bdf, .gdf, .cnt, .egi, .eeg, .set, .con

Installation

Requires Python >= 3.9.

pip install scitex-io

For MCP server support:

pip install scitex-io[mcp]

Architecture

scitex_io/
├── _save.py / _save_modules/      ← extension-based dispatcher + per-format savers
├── _loading/ / _load_modules/     ← unified load() + per-format loaders
├── _registry.py                   ← plugin registry (register_saver / register_loader)
├── _cache.py / _flush.py / _reload.py  ← path+mtime cache
├── _glob.py                       ← natural-sorted glob, brace expansion, parse_glob
├── _path.py                       ← auto-routing path resolver (script_out/, etc.)
├── _metadata.py / _metadata_modules/   ← embed_metadata / read_metadata (PNG/JPEG/SVG/PDF)
├── _builtin_handlers.py           ← bundled handlers wired to the registry
├── _image_csv_handler.py          ← figure → png + csv + yaml triplet
├── _linter_plugin.py              ← STX-IO001..007 rule plugins for scitex-linter
├── _cli/                          ← `scitex-io` Click CLI
├── _mcp/                          ← MCP server tools
└── _skills/                       ← agent-facing skill pages

save() and load() route by file extension through _registry; every format is a small handler module discovered eagerly at import. The registry is the extension point — see "Custom Format Registration" below.

Quickstart

Save and Load

from scitex_io import save, load

# Universal save/load — format auto-detected from extension
import pandas as pd
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
save(df, "data.csv")
loaded = load("data.csv")

# 30+ formats work the same way
import numpy as np
save(np.array([1, 2, 3]), "data.npy")
save({"key": "value"}, "config.yaml")
save({"nested": [1, 2]}, "data.json")

Project Configuration

Hard-coded parameters belong in config files, not in code. Use UPPER_CASE keys — Python's convention for constants — to signal that these are user-defined values:

project/
  config/
    PATHS.yaml          # DATA_DIR: /data/experiment_01
    PREPROCESS.yaml     # SAMPLE_RATE: 1000, BANDPASS: [0.5, 40]
    MODEL.yaml          # HIDDEN_DIM: 256, DROPOUT: 0.3
    PLOT.yaml           # FIGSIZE: [180, 60], DPI: 300
    IS_DEBUG.yaml       # IS_DEBUG: true
from scitex_io import load_configs

CONFIG = load_configs()          # loads ./config/*.yaml
CONFIG.PATHS.DATA_DIR            # "/data/experiment_01"
CONFIG.PREPROCESS.SAMPLE_RATE    # 1000
CONFIG.MODEL.HIDDEN_DIM          # 256

# Debug mode: DEBUG_ prefixed keys override their counterparts
# In MODEL.yaml: { HIDDEN_DIM: 256, DEBUG_HIDDEN_DIM: 32 }
CONFIG = load_configs(IS_DEBUG=True)
CONFIG.MODEL.HIDDEN_DIM          # 32 (debug value promoted)

Returns a DotDict — a nested dictionary with dot-notation access. Parameters become version-controlled, shareable, and separate from code.

Metadata Embedding

Embed provenance into figures so they carry their own history:

from scitex_io import embed_metadata, read_metadata, has_metadata

# Embed metadata into an image
embed_metadata("figure.png", {
    "experiment": "exp_042",
    "model": "resnet50",
    "accuracy": 0.94,
    "timestamp": "2026-03-11",
})

# Read it back — months later, from the file alone
meta = read_metadata("figure.png")
print(meta["experiment"])    # "exp_042"

# Check if a file has embedded metadata
has_metadata("figure.png")   # True

Supports PNG (tEXt chunks), JPEG (EXIF), SVG (XML metadata), and PDF (XMP metadata).

Advanced Save Features

save() auto-routes relative paths based on execution context and supports symlinks and dry runs:

from scitex_io import save

# Auto path routing — relative paths resolve based on context:
#   Script analysis.py  → analysis_out/results.csv
#   Notebook exp.ipynb  → exp_out/results.csv
#   Interactive/IPython → /tmp/{USER}/results.csv
#   Absolute paths      → used as-is
save(df, "results.csv")

# Create symlink from cwd to the auto-routed save location
save(df, "results.csv", symlink_from_cwd=True)

# Create symlink at a specific path
save(fig, "fig1.png", symlink_to="/data/latest/fig1.png")

# Skip auto CSV export for image saves
save(fig, "plot.png", no_csv=True)

# use_caller_path=True — resolve path from the calling script,
# not the immediate caller. Essential when save() is wrapped by a library.
save(df, "results.csv", use_caller_path=True)

# Dry run — print resolved path without writing
save(df, "results.csv", dry_run=True)

Glob and Caching

from scitex_io import glob, parse_glob, load

# Natural-sorted file matching (1, 2, 10 — not 1, 10, 2)
paths = glob("data/**/*.csv")
paths = glob("results/{exp1,exp2}/*.npy")  # brace expansion

# Parse named placeholders from paths
paths, parsed = parse_glob("sub_{id}/ses_{session}/*.vhdr")
# parsed = [{'id': '001', 'session': 'pre'}, ...]

# Glob patterns work directly in load()
dfs = load("results/*.csv")  # → list of DataFrames

# Caching is automatic (by path + mtime)
data = load("large.hdf5")        # disk read
data = load("large.hdf5")        # cache hit (instant)
Custom Format Registration
from scitex_io import register_saver, register_loader, save, load

@register_saver(".custom")
def save_custom(obj, path, **kwargs):
    with open(path, "w") as f:
        f.write(str(obj))

@register_loader(".custom")
def load_custom(path, **kwargs):
    with open(path) as f:
        return f.read()

save("hello", "data.custom")
assert load("data.custom") == "hello"

Four Interfaces

Python API
from scitex_io import save, load, list_formats, register_saver, register_loader
from scitex_io import load_configs, DotDict
from scitex_io import embed_metadata, read_metadata, has_metadata

save(obj, "path.ext")        # Save any object
data = load("path.ext")      # Load any file
fmts = list_formats()        # Show all registered formats
cfg  = load_configs()        # Load ./config/*.yaml as DotDict
embed_metadata("fig.png", d) # Embed provenance into figure

Full API reference

CLI Commands
scitex-io --help-recursive          # Show all commands
scitex-io info                      # Show registered formats
scitex-io configs                   # Load and display project configs
scitex-io configs -d ./my_configs   # Custom config directory
scitex-io configs --json            # Output as JSON
scitex-io list-python-apis -vv      # List Python APIs with signatures
scitex-io --version                 # Show version
scitex-io mcp start                 # Start MCP server
scitex-io mcp doctor                # Check MCP health
scitex-io mcp list-tools -vv        # List MCP tools with parameters

Full CLI reference

MCP Server — for AI Agents

AI agents can save, load, and discover formats autonomously.

Tool Description
io_list_formats List all registered save/load formats
io_load Load data from any supported format
io_save Save data to any supported format
io_load_configs Load YAML project configurations
io_register_info Show how to register custom formats
scitex-io mcp start

Full MCP specification

Skills — for AI Agent Discovery

Skills provide structured documentation that AI agents can query to discover package capabilities, API signatures, and usage patterns.

scitex-io skills list              # List available skill pages
scitex-io skills get save-and-load # Get detailed save/load documentation
scitex-io skills get glob          # Get glob/parse_glob patterns
scitex-io skills get supported-formats  # Get all format tables
Skill Content
save-and-load Core API, path routing, symlinks, use_caller_path
centralized-config load_configs(), DotDict, DEBUG_ override
metadata-embedding Provenance in PNG/JPEG/SVG/PDF
cache Load caching, reload, flush
glob Pattern matching with natural sort and parsing
linting-rules STX-IO001–007 lint rules
supported-formats All 30+ format tables
path-resolution Auto save-path routing, scitex.path utilities

Also available via MCP: io_skills_list() / io_skills_get(name).

Demo

flowchart LR
    A["scitex_io.save(obj, 'x.ext')"] --> R["registry: lookup(.ext)"]
    R --> H["per-format saver"]
    H --> F["file on disk + sidecar (csv/yaml)"]
    L["scitex_io.load('x.ext')"] --> R2["registry: lookup(.ext)"]
    R2 --> H2["per-format loader"]
    H2 --> O["Python object<br/>(DataFrame, ndarray, ...)"]
    G["glob('data/**/*.csv')"] --> N["natural-sorted paths"]
    N --> L
>>> import pandas as pd, scitex_io as sio
>>> df = pd.DataFrame({"x": [1, 2, 3]})
>>> sio.save(df, "out.csv")          # routes by extension
>>> sio.load("out.csv").equals(df)
True
>>> sio.list_formats()[:5]
['.csv', '.tsv', '.xlsx', '.npy', '.npz']

A figure save additionally emits the underlying CSV + a figrecipe YAML sidecar, keeping figure-and-data atomically in sync.

Lint Rules

Detected by scitex-linter when this package is installed.

Rule Severity Message
STX-IO001 warning np.save() detected — use stx.io.save() for provenance tracking
STX-IO002 warning np.load() detected — use stx.io.load() for provenance tracking
STX-IO003 warning pd.read_csv() detected — use stx.io.load() for provenance tracking
STX-IO004 warning .to_csv() detected — use stx.io.save() for provenance tracking
STX-IO005 warning pickle.dump() detected — use stx.io.save() for provenance tracking
STX-IO006 warning json.dump() detected — use stx.io.save() for provenance tracking
STX-IO007 warning .savefig() detected — use stx.io.save(fig, path) for metadata embedding

Part of SciTeX

scitex-io is part of SciTeX. Install via the umbrella with pip install scitex[io] to use as scitex.io (Python) or scitex io ... (CLI).

import scitex

@scitex.session
def main(CONFIG=scitex.INJECTED):
    data = scitex.io.load("input.csv")     # auto-tracked by clew
    result = process(data)
    scitex.io.save(result, "output.csv")   # auto-tracked by clew
    return 0

scitex.io delegates to scitex_io — they share the same API and registry.

The SciTeX system follows the Four Freedoms for Research below, inspired by the Free Software Definition:

Four Freedoms for Research

  1. The freedom to run your research anywhere — your machine, your terms.
  2. The freedom to study how every step works — from raw data to final manuscript.
  3. The freedom to redistribute your workflows, not just your papers.
  4. The freedom to modify any module and share improvements with the community.

AGPL-3.0 — because we believe research infrastructure deserves the same freedoms as the software it runs on.


SciTeX

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scitex_io-0.2.12.tar.gz (11.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scitex_io-0.2.12-py3-none-any.whl (5.8 MB view details)

Uploaded Python 3

File details

Details for the file scitex_io-0.2.12.tar.gz.

File metadata

  • Download URL: scitex_io-0.2.12.tar.gz
  • Upload date:
  • Size: 11.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scitex_io-0.2.12.tar.gz
Algorithm Hash digest
SHA256 a7da17c5340a2d65d79676f6e67b73e9e62ec7f98edb412d4a49788930befa48
MD5 36dbc71bff523d88531c15f8a0615d57
BLAKE2b-256 0a8fe50365ffad854038e48d8001078fe218df0516eb33a7765eb3f3788f76a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for scitex_io-0.2.12.tar.gz:

Publisher: publish-pypi.yml on ywatanabe1989/scitex-io

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scitex_io-0.2.12-py3-none-any.whl.

File metadata

  • Download URL: scitex_io-0.2.12-py3-none-any.whl
  • Upload date:
  • Size: 5.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scitex_io-0.2.12-py3-none-any.whl
Algorithm Hash digest
SHA256 c0bcbbc04aa38017c65b535cd2e5a628cd079e340b08028618b184c402abdd12
MD5 4826e76ec7d98c34c047a8ba3dc60a06
BLAKE2b-256 f40a44d54c46a2b13fea9be7aa3691deb07ba39277ee6055342900ffca998d09

See more details on using hashes here.

Provenance

The following attestation bundles were made for scitex_io-0.2.12-py3-none-any.whl:

Publisher: publish-pypi.yml on ywatanabe1989/scitex-io

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page