Skip to main content

Dimension-aware ML pipelines for scientific data

Project description

xdflow

Dimension-aware ML pipelines for scientific data

License: MIT Python 3.11+

xdflow is a machine learning framework designed for structured, multidimensional scientific data. Built on xarray, it brings reproducible, metadata-aware pipelines to domains where sklearn falls short: neuroscience, sensor arrays, time series, medical imaging, and any field working with labeled, high-dimensional data.


The Problem

If you work with scientific data, you've probably hit these walls:

sklearn pipelines break on structured data

# Your data has dimensions: (trials × channels × time × frequency)
# sklearn expects: (samples × features)
# You spend hours reshaping, lose metadata, break reproducibility

No standard way to handle trial structure, sessions, or groups

# You need: "fit PCA per subject, then pool for classifier"
# sklearn offers: global fit() or manual loops

Cross-validation doesn't respect your data's structure

# You need: "leave-one-session-out, stratify by condition"
# sklearn offers: basic K-fold, group CV with no stratification

Transforms don't preserve metadata

# After 5 pipeline steps, you've lost track of which channel is which
# Debugging is impossible, reproducibility is a prayer

The Solution

xdflow provides:

Dimension-aware transforms that preserve labeled axes ✅ Reproducible pipelines with deterministic state tracking ✅ Sophisticated cross-validation that respects trial/session/subject structure ✅ First-class metadata propagation through every step ✅ Flexible composition patterns (sequential, parallel, conditional, per-group) ✅ Native xarray integration with seamless sklearn interop ✅ Experiment tracking with MLflow out of the box


Quick Example

import xdflow as xf
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from xdflow.cv.kfold import KFoldValidator
from xdflow.transforms.basic_transforms import FlattenTransform
from xdflow.transforms.cleaning import CARTransform
from xdflow.transforms.normalization import ZScoreTransform
from xdflow.transforms.sklearn_transform import SKLearnPredictor, SKLearnTransformer
from xdflow.transforms.spectral import MultiTaperTransform

# Your data: xarray.DataArray with dims (trial, channel, time)
# Coords include 'session', 'stimulus', etc.
data_container = xf.DataContainer(your_xarray_data)

freq_ranges = {"theta": [4, 8], "alpha": [8, 12], "beta": [12, 30], "gamma": [30, 58]}

# Build a pipeline: CAR → Z-score → Spectral features → PCA → Classifier
pipeline = xf.Pipeline(
    name="decode_stimulus",
    steps=[
        ("car", CARTransform(car_method="all")),
        ("zscore", ZScoreTransform(by_dim=["trial"])),
        ("multitaper", MultiTaperTransform(
            fs=data_container.attrs["sampling_frequency_hz"],
            num_time_windows=4,
            time_halfbandwidth_product=2,
            avg_over_time_windows=True,
            avg_over_freq_bands=True,
            freq_ranges=freq_ranges,
        )),
        ("flatten", FlattenTransform(dims=("channel", "freq_band"))),
        ("pca", SKLearnTransformer(
            estimator_cls=PCA, sample_dim="trial",
            output_dim_name="feature", n_components=30,
        )),
        ("logreg", SKLearnPredictor(
            estimator_cls=LogisticRegression, sample_dim="trial",
            target_coord="stimulus", max_iter=500,
        )),
    ],
)

# Cross-validate with structure-aware semantics
cv = KFoldValidator(n_splits=5, shuffle=True, random_state=0, test_size=0.2)
cv.set_pipeline(pipeline)

score = cv.cross_validate(data_container, verbose=False)
print(f"Weighted F1: {score:.3f}")

# Stateless transforms (CAR, z-score, spectral) computed once
# Stateful transforms (PCA, classifier) refitted per fold
# Metadata preserved throughout every step

Installation

# Core framework (minimal dependencies)
pip install xdflow

# With hyperparameter tuning
pip install xdflow[tuning]

# With all extras (LightGBM, visualization, MLflow, spectral analysis)
pip install xdflow[all]

Development Setup

This project uses uv for dependency management.

# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/canaery/xdflow.git
cd xdflow
uv sync --all-extras    # creates .venv and installs everything

# Run commands via uv
uv run pytest            # run tests
uv run ruff check        # lint

# Or activate the venv directly
source .venv/bin/activate
pytest

Requirements: Python 3.11+


Key Features

1. Transform System

All transforms follow a fit() / transform() / fit_transform() contract with:

  • Automatic input/output dimension validation
  • Deterministic state serialization (every transform is exactly reproducible)
  • Metadata preservation (channel names, coordinates, etc. flow through)
  • Immutability (safe for parallel execution, nested CV)

2. Composite Transforms

Build complex pipelines with:

  • Pipeline: Sequential composition (A → B → C)
  • TransformUnion: Parallel feature extraction ([A, B, C] → concatenate)
  • SwitchTransform: Conditional selection (if condition: A else: B)
  • GroupApplyTransform: Per-group fitting (fit PCA separately per subject)
  • OptionalTransform: Toggle transforms on/off for ablation studies

3. Intelligent Cross-Validation

  • Automatically separates stateless preprocessing (computed once) from stateful models (refitted per fold)
  • Orders-of-magnitude speedup on expensive transforms (spectrograms, wavelets)
  • Supports grouping, stratification, custom CV strategies
  • Out-of-fold predictions for stacking/ensembles

4. Hyperparameter Tuning

  • Optuna integration with Bayesian optimization
  • Multi-pipeline comparison (compare architectures, not just hyperparams)
  • Automatic MLflow logging
  • Seed management for reproducibility

5. Multi-Output Support

  • Native multi-target regression (predict multiple outputs simultaneously)
  • Proper handling of sample weights
  • Classification and regression in unified interface

Who Should Use This?

xdflow is designed for researchers and engineers working with:

Domains:

  • Neuroscience (EEG, ECoG, MEG, calcium imaging, spike trains)
  • Biosignals (ECG, EMG, respiration)
  • Sensor arrays (industrial IoT, environmental monitoring)
  • Medical time series (sleep studies, patient monitoring)
  • Geophysical signals (seismology, climate data)
  • Any labeled, multidimensional scientific data

Use Cases:

  • You have metadata that must flow through your pipeline
  • You need cross-validation that respects experiment structure
  • You want reproducible experiments without custom infrastructure
  • You're tired of reshaping data to fit sklearn's assumptions
  • You need to compare dozens of pipeline architectures systematically

Comparison to Other Tools

Feature xdflow sklearn Kedro/ZenML
Dimension-aware transforms
Metadata preservation ⚠️ (manual)
Structured CV semantics ⚠️ (basic)
xarray-native
Stateful/stateless optimization
Reproducible by default ⚠️ (manual)
Scientific data focus
Learning curve Medium Low High

When to use sklearn: Tabular data, classic ML problems, well-established workflows When to use Kedro/ZenML: Large-scale MLOps, multi-team production deployments When to use xdflow: Structured scientific data, experiment reproducibility, metadata-aware pipelines


Documentation

Full Documentation (coming soon)

Quick Links:


Contributing

We welcome contributions! Whether you're:

  • Adding a new transform
  • Improving documentation
  • Reporting bugs
  • Requesting features
  • Sharing use cases

See CONTRIBUTING.md for guidelines.

Early adopters especially welcome - the API is still stabilizing, and your feedback shapes the future of this project.


License

MIT License - see LICENSE for details.


Acknowledgments

Built on the shoulders of giants:

Inspired by the needs of the scientific computing community and years of building neural decoding pipelines.


Contact


Built by scientists, for scientists. Let's make reproducible ML the default.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xdflow-0.1.0.tar.gz (588.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xdflow-0.1.0-py3-none-any.whl (176.1 kB view details)

Uploaded Python 3

File details

Details for the file xdflow-0.1.0.tar.gz.

File metadata

  • Download URL: xdflow-0.1.0.tar.gz
  • Upload date:
  • Size: 588.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for xdflow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d2890d1b52b4070be2400cb7f13b6e578fbe1e53c485ad2c6d738a4926dd7a93
MD5 84757bd9c0ab23c636831a0a2e9e38f8
BLAKE2b-256 41d1006310f4a528e9242ecfa19fc47290b90fb0904bfccd983f692d87c97182

See more details on using hashes here.

File details

Details for the file xdflow-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: xdflow-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 176.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for xdflow-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0a83da9dcae2a378d7c471f717748d330ba66e8f94a1b5351b31a85d45ba169b
MD5 579bcea158943befa6eb272dd5cb1de7
BLAKE2b-256 2502d1f18f088344bad37a4318e33c54a361087e71937d1eeda54b37eed6234a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page