A unified dataset framework for mass spectrometry
Project description
msdatasets
A unified dataset framework for mass spectrometry.
msdatasets is a Python client and CLI for downloading mass spectrometry
datasets from the msdatasets server. Datasets are fetched by server UUID or
by repository accession (PRIDE, MassIVE), cached on disk, and optionally
loaded as a PyTorch Dataset for training pipelines.
Features
- Download by server UUID or by PRIDE / MassIVE accession — the server imports and converts remote projects on demand
- Choose the on-disk format per download:
mszx(raw archive),msz(inner compressed MS data), ormzml(fully decompressed) - Parallel downloads with a live progress bar
- Filename subsets via
accession[file1.raw,file2.mzML]syntax - Server-side extraction is tracked over SSE until files are ready
- Optional PyTorch integration via the
torchextra
Installation
pip install msdatasets # base install
pip install 'msdatasets[torch]' # with PyTorch integration
Quick start
CLI
# By server UUID
msdatasets download 550e8400-e29b-41d4-a716-446655440000
# From a PRIDE project
msdatasets download pride/PXD075509
# Subset of files, stored as mzML
msdatasets download pride/PXD075509[19HCD_3.mzML] --store-as mzml
# Write directly to a directory instead of the shared cache
msdatasets download massive/MSV000101460 -o ./my-data
Python
from msdatasets import download_dataset, download_repo_dataset
# By UUID
ds = download_dataset("550e8400-e29b-41d4-a716-446655440000")
print(ds.dataset_name, len(ds), "files")
for path in ds:
...
# By PRIDE accession (filename subset, stored as mzML)
ds = download_repo_dataset(
"pride",
"PXD075509",
filenames=["19HCD_3.mzML"],
store_as="mzml",
)
PyTorch
from msdatasets import load_dataset
# Returns an mscompress.datasets.torch.MSCompressDataset.
# Accepts UUIDs and repository specs.
dataset = load_dataset("pride/PXD075509[19HCD_3.mzML]")
Configuration
| Environment variable | Purpose | Default |
|---|---|---|
MS_API_URL |
Server base URL | https://datasets.lab.gy |
MS_DATASETS_CACHE |
Explicit cache directory | — |
MS_HOME |
Alternative cache root ($MS_HOME/datasets) |
~/.ms |
Full CLI reference, storage-format details, and Python API are in the documentation.
Development
git clone https://github.com/chrisagrams/msdatasets.git
cd msdatasets
uv sync --extra dev --extra docs
uv run pre-commit install
uv run pytest
Pre-commit runs ruff, mypy, and pytest (90% coverage gate). CI runs on
Python 3.10, 3.11, and 3.12.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file msdatasets-0.1.2.tar.gz.
File metadata
- Download URL: msdatasets-0.1.2.tar.gz
- Upload date:
- Size: 180.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f4f8ca2fbad42f386a99fb91f6e64661318fbd272dccdbb1ee77fb30136c8ce
|
|
| MD5 |
7b1041eb8479b736c7c48968053cfcb2
|
|
| BLAKE2b-256 |
7f1642f9bead25377771a280299589e638693d91bac6452186d64bd9ad9d97ae
|
Provenance
The following attestation bundles were made for msdatasets-0.1.2.tar.gz:
Publisher:
publish.yml on chrisagrams/msdatasets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
msdatasets-0.1.2.tar.gz -
Subject digest:
4f4f8ca2fbad42f386a99fb91f6e64661318fbd272dccdbb1ee77fb30136c8ce - Sigstore transparency entry: 1312143781
- Sigstore integration time:
-
Permalink:
chrisagrams/msdatasets@302ed5332cca3c2aef030bbb252f1052ae6e5a20 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/chrisagrams
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@302ed5332cca3c2aef030bbb252f1052ae6e5a20 -
Trigger Event:
release
-
Statement type:
File details
Details for the file msdatasets-0.1.2-py3-none-any.whl.
File metadata
- Download URL: msdatasets-0.1.2-py3-none-any.whl
- Upload date:
- Size: 15.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
643e5eb760e3d823c5924d2635dffea9eb8f2b0caad59aea58e83574bc65fd73
|
|
| MD5 |
8d4057c5e52242e95cf62daf6e348734
|
|
| BLAKE2b-256 |
bed456ca28f80aeb45b939b4ae771734d3c40a55799e63573d5b633aece73cd0
|
Provenance
The following attestation bundles were made for msdatasets-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on chrisagrams/msdatasets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
msdatasets-0.1.2-py3-none-any.whl -
Subject digest:
643e5eb760e3d823c5924d2635dffea9eb8f2b0caad59aea58e83574bc65fd73 - Sigstore transparency entry: 1312143886
- Sigstore integration time:
-
Permalink:
chrisagrams/msdatasets@302ed5332cca3c2aef030bbb252f1052ae6e5a20 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/chrisagrams
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@302ed5332cca3c2aef030bbb252f1052ae6e5a20 -
Trigger Event:
release
-
Statement type: