Skip to main content

A unified dataset framework for mass spectrometry

Project description

msdatasets

CI codecov PyPI version

A unified dataset framework for mass spectrometry.

msdatasets is a Python client and CLI for downloading mass spectrometry datasets from the msdatasets server. Datasets are fetched by server UUID or by repository accession (PRIDE, MassIVE), cached on disk, and optionally loaded as a PyTorch Dataset for training pipelines.

Features

  • Download by server UUID or by PRIDE / MassIVE accession — the server imports and converts remote projects on demand
  • Choose the on-disk format per download: mszx (raw archive), msz (inner compressed MS data), or mzml (fully decompressed)
  • Parallel downloads with a live progress bar
  • Filename subsets via accession[file1.raw,file2.mzML] syntax
  • Server-side extraction is tracked over SSE until files are ready
  • Optional PyTorch integration via the torch extra

Installation

pip install msdatasets              # base install
pip install 'msdatasets[torch]'     # with PyTorch integration

Quick start

CLI

# By server UUID
msdatasets download 550e8400-e29b-41d4-a716-446655440000

# From a PRIDE project
msdatasets download pride/PXD075509

# Subset of files, stored as mzML
msdatasets download pride/PXD075509[19HCD_3.mzML] --store-as mzml

# Write directly to a directory instead of the shared cache
msdatasets download massive/MSV000101460 -o ./my-data

Python

from msdatasets import download_dataset, download_repo_dataset

# By UUID
ds = download_dataset("550e8400-e29b-41d4-a716-446655440000")
print(ds.dataset_name, len(ds), "files")
for path in ds:
    ...

# By PRIDE accession (filename subset, stored as mzML)
ds = download_repo_dataset(
    "pride",
    "PXD075509",
    filenames=["19HCD_3.mzML"],
    store_as="mzml",
)

PyTorch

from msdatasets import load_dataset

# Returns an mscompress.datasets.torch.MSCompressDataset.
# Accepts UUIDs and repository specs.
dataset = load_dataset("pride/PXD075509[19HCD_3.mzML]")

Configuration

Environment variable Purpose Default
MS_API_URL Server base URL https://datasets.lab.gy
MS_DATASETS_CACHE Explicit cache directory
MS_HOME Alternative cache root ($MS_HOME/datasets) ~/.ms

Full CLI reference, storage-format details, and Python API are in the documentation.

Development

git clone https://github.com/chrisagrams/msdatasets.git
cd msdatasets
uv sync --extra dev --extra docs
uv run pre-commit install
uv run pytest

Pre-commit runs ruff, mypy, and pytest (90% coverage gate). CI runs on Python 3.10, 3.11, and 3.12.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

msdatasets-0.1.2.tar.gz (180.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

msdatasets-0.1.2-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file msdatasets-0.1.2.tar.gz.

File metadata

  • Download URL: msdatasets-0.1.2.tar.gz
  • Upload date:
  • Size: 180.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for msdatasets-0.1.2.tar.gz
Algorithm Hash digest
SHA256 4f4f8ca2fbad42f386a99fb91f6e64661318fbd272dccdbb1ee77fb30136c8ce
MD5 7b1041eb8479b736c7c48968053cfcb2
BLAKE2b-256 7f1642f9bead25377771a280299589e638693d91bac6452186d64bd9ad9d97ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for msdatasets-0.1.2.tar.gz:

Publisher: publish.yml on chrisagrams/msdatasets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file msdatasets-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: msdatasets-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for msdatasets-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 643e5eb760e3d823c5924d2635dffea9eb8f2b0caad59aea58e83574bc65fd73
MD5 8d4057c5e52242e95cf62daf6e348734
BLAKE2b-256 bed456ca28f80aeb45b939b4ae771734d3c40a55799e63573d5b633aece73cd0

See more details on using hashes here.

Provenance

The following attestation bundles were made for msdatasets-0.1.2-py3-none-any.whl:

Publisher: publish.yml on chrisagrams/msdatasets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page