Download, normalize metadata, and convert public sc/snRNA-seq + spatial datasets to standardized .h5ad (AnnData).

These details have not been verified by PyPI

Project description

h5adify

h5adify is a small Python library + CLI to search, download, and convert public single-cell / spatial datasets into standardized .h5ad (AnnData) with consistent metadata fields (.obs).
It can also merge multiple datasets (even across sources) into a single .h5ad.

Best-effort by design: public portals vary wildly. Some provide direct .h5ad, others provide 10x MTX/H5 and many clinical datasets are controlled-access. h5adify focuses on workflows that can be automated reliably without proprietary tooling being able to homogenously, automatically and download and annotate a very large number of datasets.

Supported sources

GEO (GSE/GSM)
Downloads processed supplementary matrices (10x MTX/H5, etc.) and converts to .h5ad (does not require SRA).
CZ CELLxGENE Discover
Accepts dataset UUIDs or direct .h5ad URLs.
Search is best-effort (API schema can vary and may return different JSON shapes depending on endpoint/proxy).
Zenodo
Best-effort download via public endpoints / direct file links (when available).
UCSC Cell Browser (single-cell + some spatial datasets)
Search via UCSC dataset registry, and download when a dataset exposes a direct .h5ad in the dataset directory.
EMA (EBI) — BioStudies / ArrayExpress
Search via EBI BioStudies API (ArrayExpress collection).
Download works only when a study provides an attached .h5ad file.

Install (local)

1) Clone + venv

git clone <your-fork-or-local-repo>
cd h5adify
python -m venv .venv
source .venv/bin/activate
pip install -U pip

### 2) Install h5adify
```bash
pip install -e .          # core
pip install -e ".[docs]"  # docs build dependencies (optional)

Install (from pip)

pip install h5adify

Quickstart (CLI)

1) Search datasets

# GEO
h5adify search geo --query "human brain spatial transcriptomics" --max-results 20

# CELLxGENE
h5adify search cellxgene --query "human brain spatial transcriptomics" --max-results 20

# UCSC Cell Browser
h5adify search ucsc --query "human hippocampus" --max-results 20

# EMA / EBI (BioStudies / ArrayExpress)
h5adify search ema --query "single cell brain" --max-results 20

2) Download + convert (per dataset -> one .h5ad)

# GEO: converts all samples with parseable supplementary matrices
h5adify download geo --gse GSE229409 --outdir data/out

# CELLxGENE: dataset UUID or direct .h5ad URL
h5adify download cellxgene --id e52ed1cc-d59f-4bf5-9716-8d81f14a89fd --outdir data/out
h5adify download cellxgene --id https://datasets.cellxgene.cziscience.com/e52ed1cc-d59f-4bf5-9716-8d81f14a89fd.h5ad --outdir data/out

# SODB: dataset-level (downloads all experiments -> one merged file)
h5adify download sodb --id "Mouse brain atlas" --outdir data/out

# SODB: single experiment
h5adify download sodb --id "Mouse brain atlas::exp_001" --outdir data/out

# UCSC: dataset id from search results (download works when a .h5ad is exposed)
h5adify download ucsc --id human-hippo-axis --outdir data/out

# EMA: E-MTAB / E-XXXX study accession (download works when an attached .h5ad is present)
h5adify download ema --id E-MTAB-XXXX --outdir data/out

3) Multi-source batch + merge

h5adify batch \
  --ids geo:GSE229409 \
       cellxgene:e52ed1cc-d59f-4bf5-9716-8d81f14a89fd \
       sodb:"Mouse brain atlas::exp_001" \
  --outdir data/out \
  --merge-out data/out/merged_all.h5ad

4) Batch multiple files from different databases

h5adify batch --ids geo:GSE229409 \
                    cellxgene:e52ed1cc-d59f-4bf5-9716-8d81f14a89fd \
              --outdir data/out \
              --merge-out data/out/merged.h5ad

5) Provide a manifest of a list of h5ad files

h5adify manifest --root data/stereo_seq_mouse_embryo/ \
                 --out data/stereo_seq_mouse_embryo/out

It gives a .csv and .jsonl files, allowing to analyze the metadata of a large list of samples.

6) Query the metadata of a list of h5ad files

There are 2 .h5ad in this folder:

h5adify query --root data/stereo_seq_mouse_embryo/
UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")
[
  {
    "path": "data/stereo_seq_mouse_embryo/mouse_embryo_all_slices.h5ad",
    "filename": "mouse_embryo_all_slices.h5ad",
    "n_obs": 176711,
    "n_vars": 1923,
    "x_dtype": "float32",
    "is_sparse": false,
    "has_raw_counts": false,
    "has_spatial": true,
    "layers": "count,norm",
    "obsm": "spatial,spatial_aligned,spatial_pair",
    "source": "",
    "dataset_id": "",
    "species": "",
    "technology": "",
    "condition": "",
    "disease": "",
    "batch": "real",
    "checksum_sha256": ""
  },
  {
    "path": "data/stereo_seq_mouse_embryo/E16.5_E1S3_cell_bin.h5ad",
    "filename": "E16.5_E1S3_cell_bin.h5ad",
    "n_obs": 281377,
    "n_vars": 28103,
    "x_dtype": "float32",
    "is_sparse": false,
    "has_raw_counts": false,
    "has_spatial": true,
    "layers": "counts",
    "obsm": "spatial",
    "source": "",
    "dataset_id": "",
    "species": "",
    "technology": "",
    "condition": "",
    "disease": "",
    "batch": "",
    "checksum_sha256": ""
  }
]

7) Inspect the metadata of h5ad

h5adify inspect --path data/stereo_seq_mouse_embryo/mouse_embryo_all_slices.h5ad 
UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")

{
  "path": "/home/aalentorn/Téléchargements/data/stereo_seq_mouse_embryo/mouse_embryo_all_slices.h5ad",
  "n_obs": 176711,
  "n_vars": 1923,
  "obs_cols": [
    "n_genes_by_counts",
    "log1p_n_genes_by_counts",
    "total_counts",
    "log1p_total_counts",
    "annotation"
  ],
  "var_cols": [],
  "layers": [
    "count",
    "norm"
  ],
  "obsm": [
    "spatial",
    "spatial_aligned",
    "spatial_pair"
  ],
  "uns": [],
  "has_spatial": true,
  "has_raw_counts": false,
  "x_dtype": "float32",
  "x_is_sparse": false,
  "missing_std_fields": {
    "source": 1.0,
    "dataset_id": 1.0,
    "species": 1.0,
    "technology": 1.0,
    "sex": 1.0,
    "age": 1.0,
    "condition": 1.0,
    "disease": 1.0,
    "batch": 0.0
  }
}

Standardized metadata (`.obs`)

By default, h5adify tries to fill a standard set of .obs fields where possible, e.g.:

species technology sex age condition disease batch source dataset_id

You can override any fields via repeatable --set:

h5adify download geo --gse GSE229409 --outdir data/out \
  --set species=human --set condition=control --set technology=10x_visium

Python usage (notebook)

from h5adify import download, merge_h5ads

# Download one dataset into standardized .h5ad
paths = download("geo", gse="GSE229409", outdir="data/out")

# Merge multiple .h5ad files
merged = merge_h5ads(["data/out/A.h5ad", "data/out/B.h5ad"], join="outer")
merged.write_h5ad("data/out/merged.h5ad")

Notes on GEO (GSE) conversion

h5adify download geo focuses on processed supplementary matrices (e.g., 10x MTX/H5).

If a GEO series only provides raw SRA, you’ll need a dedicated pipeline (SRA → FASTQ → CellRanger/STARsolo → matrix). h5adify will detect “raw-only” cases and explain what’s missing.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Mar 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

h5adify-0.1.1.tar.gz (26.5 kB view details)

Uploaded Mar 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

h5adify-0.1.1-py3-none-any.whl (31.6 kB view details)

Uploaded Mar 1, 2026 Python 3

File details

Details for the file h5adify-0.1.1.tar.gz.

File metadata

Download URL: h5adify-0.1.1.tar.gz
Upload date: Mar 1, 2026
Size: 26.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for h5adify-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`c94dc21f7093d797559c15eeb1eca6a37e84621371e779d8e94917342b4a3292`
MD5	`3f9d7b59a8c98d8616540f56d98627e7`
BLAKE2b-256	`77583f7729f1d827161b16affea8ebcc05636c47f06b0d5b822ebd3d91b8fa33`

See more details on using hashes here.

File details

Details for the file h5adify-0.1.1-py3-none-any.whl.

File metadata

Download URL: h5adify-0.1.1-py3-none-any.whl
Upload date: Mar 1, 2026
Size: 31.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for h5adify-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`beb408f871d43ff55118357e1f95884682c243ec2ee7487a1028ceb3cfe7b897`
MD5	`9bdd8b0d923a9ecf1ba4cd1aefc76856`
BLAKE2b-256	`ccf93f3970d7456dc6b01b3620e82ead41fa7560973ff11bc583adf08cf8efa1`

See more details on using hashes here.

h5adify 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

h5adify

Supported sources

Install (local)

1) Clone + venv

Install (from pip)

Quickstart (CLI)

1) Search datasets

2) Download + convert (per dataset -> one .h5ad)

3) Multi-source batch + merge

4) Batch multiple files from different databases

5) Provide a manifest of a list of h5ad files

6) Query the metadata of a list of h5ad files

7) Inspect the metadata of h5ad

Standardized metadata (.obs)

Python usage (notebook)

Notes on GEO (GSE) conversion

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Standardized metadata (`.obs`)