Skip to main content

A unified dataset framework for mass spectrometry

Project description

msdatasets

CI codecov PyPI version

A unified dataset framework for mass spectrometry.

msdatasets is a Python client and CLI for downloading mass spectrometry datasets from the msdatasets server. Datasets are fetched by server UUID or by repository accession (PRIDE, MassIVE), cached on disk, and optionally loaded as a PyTorch Dataset for training pipelines.

Features

  • Download by server UUID or by PRIDE / MassIVE accession — the server imports and converts remote projects on demand
  • Choose the on-disk format per download: mszx (raw archive), msz (inner compressed MS data), or mzml (fully decompressed)
  • Parallel downloads with a live progress bar
  • Filename subsets via accession[file1.raw,file2.mzML] syntax
  • Server-side extraction is tracked over SSE until files are ready
  • Optional PyTorch integration via the torch extra

Installation

pip install msdatasets              # base install
pip install 'msdatasets[torch]'     # with PyTorch integration
pip install 'msdatasets[hf]'        # with HuggingFace Hub support

Quick start

CLI

# By server UUID
msdatasets download 550e8400-e29b-41d4-a716-446655440000

# From a PRIDE project
msdatasets download pride/PXD075509

# Subset of files, stored as mzML
msdatasets download pride/PXD075509[19HCD_3.mzML] --store-as mzml

# Write directly to a directory instead of the shared cache
msdatasets download massive/MSV000101460 -o ./my-data

# From a HuggingFace dataset repo of .mszx files
msdatasets download hf/myorg/proteomics-bench
msdatasets download 'hf/myorg/proteomics-bench[run_01.mszx,run_02.mszx]'

Python

from msdatasets import download_dataset, download_repo_dataset

# By UUID
ds = download_dataset("550e8400-e29b-41d4-a716-446655440000")
print(ds.dataset_name, len(ds), "files")
for path in ds:
    ...

# By PRIDE accession (filename subset, stored as mzML)
ds = download_repo_dataset(
    "pride",
    "PXD075509",
    filenames=["19HCD_3.mzML"],
    store_as="mzml",
)

PyTorch

from msdatasets import load_dataset

# Returns an mscompress.datasets.torch.MSCompressDataset.
# Accepts UUIDs, repository specs, and HuggingFace specs.
dataset = load_dataset("pride/PXD075509[19HCD_3.mzML]")
dataset = load_dataset("hf/myorg/proteomics-bench")

HuggingFace Hub

from msdatasets import download_hf_dataset

ds = download_hf_dataset(
    "myorg/proteomics-bench",
    filenames=["run_01.mszx", "run_02.mszx"],   # optional subset
    revision="v1.0",                              # optional branch/tag/commit
    token=None,                                   # falls back to HF_TOKEN
)
print(ds.cache_dir, len(ds), "files")

HF downloads land at ~/.ms/datasets/hf/<owner>/<repo>/ by default. Files are stored as-is — --store-as conversion is not supported for HF specs in this version.

Configuration

Environment variable Purpose Default
MS_API_URL Server base URL https://datasets.lab.gy
MS_DATASETS_CACHE Explicit cache directory
MS_HOME Alternative cache root ($MS_HOME/datasets) ~/.ms

Full CLI reference, storage-format details, and Python API are in the documentation.

Development

git clone https://github.com/chrisagrams/msdatasets.git
cd msdatasets
uv sync --extra dev --extra docs
uv run pre-commit install
uv run pytest

Pre-commit runs ruff, mypy, and pytest (90% coverage gate). CI runs on Python 3.10, 3.11, and 3.12.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

msdatasets-0.1.3.tar.gz (192.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

msdatasets-0.1.3-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file msdatasets-0.1.3.tar.gz.

File metadata

  • Download URL: msdatasets-0.1.3.tar.gz
  • Upload date:
  • Size: 192.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for msdatasets-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d47fab9d0d4d67acad816446a063c1520473e723ec4f1cd3eb6e85b3429b69d3
MD5 ffc01192e985dcc57efd08052ec5f3c4
BLAKE2b-256 9c557b4224b4159bfa75fcfa2507eeef6d3654b941098e4d58ad5528fd35ab26

See more details on using hashes here.

Provenance

The following attestation bundles were made for msdatasets-0.1.3.tar.gz:

Publisher: publish.yml on chrisagrams/msdatasets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file msdatasets-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: msdatasets-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for msdatasets-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4515dc63c74658903b9326f8372e9dcec7723381f97ccf9afe8729d9c0a0ec34
MD5 46480233a746a9f2190b4b74292527dd
BLAKE2b-256 5f19d6e22afc6d939726c2ef1fbae2ca0d324821c0839a7e153239e0474400ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for msdatasets-0.1.3-py3-none-any.whl:

Publisher: publish.yml on chrisagrams/msdatasets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page