Dimension-aware ML pipelines for scientific data
Project description
xdflow
Dimension-aware ML pipelines for scientific data
xdflow is a machine learning framework designed for structured, multidimensional scientific data. Built on xarray, it brings reproducible, metadata-aware pipelines to domains where sklearn falls short: neuroscience, sensor arrays, time series, medical imaging, and any field working with labeled, high-dimensional data.
The Problem
If you work with scientific data, you've probably hit these walls:
sklearn pipelines break on structured data
# Your data has dimensions: (trials × channels × time × frequency)
# sklearn expects: (samples × features)
# You spend hours reshaping, lose metadata, break reproducibility
No standard way to handle trial structure, sessions, or groups
# You need: "fit PCA per subject, then pool for classifier"
# sklearn offers: global fit() or manual loops
Cross-validation doesn't respect your data's structure
# You need: "leave-one-session-out, stratify by condition"
# sklearn offers: basic K-fold, group CV with no stratification
Transforms don't preserve metadata
# After 5 pipeline steps, you've lost track of which channel is which
# Debugging is impossible, reproducibility is a prayer
The Solution
xdflow provides:
✅ Dimension-aware transforms that preserve labeled axes ✅ Reproducible pipelines with deterministic state tracking ✅ Sophisticated cross-validation that respects trial/session/subject structure ✅ First-class metadata propagation through every step ✅ Flexible composition patterns (sequential, parallel, conditional, per-group) ✅ Native xarray integration with seamless sklearn interop ✅ Experiment tracking with MLflow out of the box
Quick Example
import xdflow as xf
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from xdflow.cv.kfold import KFoldValidator
from xdflow.transforms.basic_transforms import FlattenTransform
from xdflow.transforms.cleaning import CARTransform
from xdflow.transforms.normalization import ZScoreTransform
from xdflow.transforms.sklearn_transform import SKLearnPredictor, SKLearnTransformer
from xdflow.transforms.spectral import MultiTaperTransform
# Your data: xarray.DataArray with dims (trial, channel, time)
# Coords include 'session', 'stimulus', etc.
data_container = xf.DataContainer(your_xarray_data)
freq_ranges = {"theta": [4, 8], "alpha": [8, 12], "beta": [12, 30], "gamma": [30, 58]}
# Build a pipeline: CAR → Z-score → Spectral features → PCA → Classifier
pipeline = xf.Pipeline(
name="decode_stimulus",
steps=[
("car", CARTransform(car_method="all")),
("zscore", ZScoreTransform(by_dim=["trial"])),
("multitaper", MultiTaperTransform(
fs=data_container.attrs["sampling_frequency_hz"],
num_time_windows=4,
time_halfbandwidth_product=2,
avg_over_time_windows=True,
avg_over_freq_bands=True,
freq_ranges=freq_ranges,
)),
("flatten", FlattenTransform(dims=("channel", "freq_band"))),
("pca", SKLearnTransformer(
estimator_cls=PCA, sample_dim="trial",
output_dim_name="feature", n_components=30,
)),
("logreg", SKLearnPredictor(
estimator_cls=LogisticRegression, sample_dim="trial",
target_coord="stimulus", max_iter=500,
)),
],
)
# Cross-validate with structure-aware semantics
cv = KFoldValidator(n_splits=5, shuffle=True, random_state=0, test_size=0.2)
cv.set_pipeline(pipeline)
score = cv.cross_validate(data_container, verbose=False)
print(f"Weighted F1: {score:.3f}")
# Stateless transforms (CAR, z-score, spectral) computed once
# Stateful transforms (PCA, classifier) refitted per fold
# Metadata preserved throughout every step
Installation
# Core framework (minimal dependencies)
pip install xdflow
# With hyperparameter tuning
pip install xdflow[tuning]
# With all extras (LightGBM, visualization, MLflow, spectral analysis)
pip install xdflow[all]
Development Setup
This project uses uv for dependency management.
# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and install
git clone https://github.com/canaery/xdflow.git
cd xdflow
uv sync --all-extras # creates .venv and installs everything
# Run commands via uv
uv run pytest # run tests
uv run ruff check # lint
# Or activate the venv directly
source .venv/bin/activate
pytest
Requirements: Python 3.11+
Key Features
1. Transform System
All transforms follow a fit() / transform() / fit_transform() contract with:
- Automatic input/output dimension validation
- Deterministic state serialization (every transform is exactly reproducible)
- Metadata preservation (channel names, coordinates, etc. flow through)
- Immutability (safe for parallel execution, nested CV)
2. Composite Transforms
Build complex pipelines with:
- Pipeline: Sequential composition (
A → B → C) - TransformUnion: Parallel feature extraction (
[A, B, C] → concatenate) - SwitchTransform: Conditional selection (
if condition: A else: B) - GroupApplyTransform: Per-group fitting (
fit PCA separately per subject) - OptionalTransform: Toggle transforms on/off for ablation studies
3. Intelligent Cross-Validation
- Automatically separates stateless preprocessing (computed once) from stateful models (refitted per fold)
- Orders-of-magnitude speedup on expensive transforms (spectrograms, wavelets)
- Supports grouping, stratification, custom CV strategies
- Out-of-fold predictions for stacking/ensembles
4. Hyperparameter Tuning
- Optuna integration with Bayesian optimization
- Multi-pipeline comparison (compare architectures, not just hyperparams)
- Automatic MLflow logging
- Seed management for reproducibility
5. Multi-Output Support
- Native multi-target regression (predict multiple outputs simultaneously)
- Proper handling of sample weights
- Classification and regression in unified interface
Who Should Use This?
xdflow is designed for researchers and engineers working with:
Domains:
- Neuroscience (EEG, ECoG, MEG, calcium imaging, spike trains)
- Biosignals (ECG, EMG, respiration)
- Sensor arrays (industrial IoT, environmental monitoring)
- Medical time series (sleep studies, patient monitoring)
- Geophysical signals (seismology, climate data)
- Any labeled, multidimensional scientific data
Use Cases:
- You have metadata that must flow through your pipeline
- You need cross-validation that respects experiment structure
- You want reproducible experiments without custom infrastructure
- You're tired of reshaping data to fit sklearn's assumptions
- You need to compare dozens of pipeline architectures systematically
Comparison to Other Tools
| Feature | xdflow | sklearn | Kedro/ZenML |
|---|---|---|---|
| Dimension-aware transforms | ✅ | ❌ | ❌ |
| Metadata preservation | ✅ | ❌ | ⚠️ (manual) |
| Structured CV semantics | ✅ | ⚠️ (basic) | ❌ |
| xarray-native | ✅ | ❌ | ❌ |
| Stateful/stateless optimization | ✅ | ❌ | ❌ |
| Reproducible by default | ✅ | ⚠️ (manual) | ✅ |
| Scientific data focus | ✅ | ❌ | ❌ |
| Learning curve | Medium | Low | High |
When to use sklearn: Tabular data, classic ML problems, well-established workflows When to use Kedro/ZenML: Large-scale MLOps, multi-team production deployments When to use xdflow: Structured scientific data, experiment reproducibility, metadata-aware pipelines
Documentation
Full Documentation (coming soon)
Quick Links:
- Data & Dimension Contract — understand the container + transform rules
- Installation & Setup
- Core Concepts
- Tutorials
- API Reference
Contributing
We welcome contributions! Whether you're:
- Adding a new transform
- Improving documentation
- Reporting bugs
- Requesting features
- Sharing use cases
See CONTRIBUTING.md for guidelines.
Early adopters especially welcome - the API is still stabilizing, and your feedback shapes the future of this project.
License
MIT License - see LICENSE for details.
Acknowledgments
Built on the shoulders of giants:
- xarray - Labeled multidimensional arrays
- scikit-learn - ML fundamentals
- Optuna - Hyperparameter optimization
- MLflow - Experiment tracking
Inspired by the needs of the scientific computing community and years of building neural decoding pipelines.
Contact
- GitHub Issues: Report bugs or request features
- Discussions: Ask questions, share ideas
Built by scientists, for scientists. Let's make reproducible ML the default.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xdflow-0.1.0.tar.gz.
File metadata
- Download URL: xdflow-0.1.0.tar.gz
- Upload date:
- Size: 588.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2890d1b52b4070be2400cb7f13b6e578fbe1e53c485ad2c6d738a4926dd7a93
|
|
| MD5 |
84757bd9c0ab23c636831a0a2e9e38f8
|
|
| BLAKE2b-256 |
41d1006310f4a528e9242ecfa19fc47290b90fb0904bfccd983f692d87c97182
|
File details
Details for the file xdflow-0.1.0-py3-none-any.whl.
File metadata
- Download URL: xdflow-0.1.0-py3-none-any.whl
- Upload date:
- Size: 176.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a83da9dcae2a378d7c471f717748d330ba66e8f94a1b5351b31a85d45ba169b
|
|
| MD5 |
579bcea158943befa6eb272dd5cb1de7
|
|
| BLAKE2b-256 |
2502d1f18f088344bad37a4318e33c54a361087e71937d1eeda54b37eed6234a
|