Download, normalize metadata, and convert public sc/snRNA-seq + spatial datasets to standardized .h5ad (AnnData).
Project description
h5adify
h5adify is a small Python library + CLI to search, download, and convert public single-cell / spatial datasets into standardized .h5ad (AnnData) with consistent metadata fields (.obs).
It can also merge multiple datasets (even across sources) into a single .h5ad.
Best-effort by design: public portals vary wildly. Some provide direct
.h5ad, others provide 10x MTX/H5 and many clinical datasets are controlled-access.h5adifyfocuses on workflows that can be automated reliably without proprietary tooling being able to homogenously, automatically and download and annotate a very large number of datasets.
Supported sources
-
GEO (GSE/GSM)
Downloads processed supplementary matrices (10x MTX/H5, etc.) and converts to.h5ad(does not require SRA). -
CZ CELLxGENE Discover
Accepts dataset UUIDs or direct.h5adURLs.
Search is best-effort (API schema can vary and may return different JSON shapes depending on endpoint/proxy). -
Zenodo
Best-effort download via public endpoints / direct file links (when available). -
UCSC Cell Browser (single-cell + some spatial datasets)
Search via UCSC dataset registry, and download when a dataset exposes a direct.h5adin the dataset directory. -
EMA (EBI) — BioStudies / ArrayExpress
Search via EBI BioStudies API (ArrayExpress collection).
Download works only when a study provides an attached.h5adfile.
Install (local)
1) Clone + venv
git clone <your-fork-or-local-repo>
cd h5adify
python -m venv .venv
source .venv/bin/activate
pip install -U pip
### 2) Install h5adify
```bash
pip install -e . # core
pip install -e ".[docs]" # docs build dependencies (optional)
Install (from pip)
pip install h5adify
Quickstart (CLI)
1) Search datasets
# GEO
h5adify search geo --query "human brain spatial transcriptomics" --max-results 20
# CELLxGENE
h5adify search cellxgene --query "human brain spatial transcriptomics" --max-results 20
# UCSC Cell Browser
h5adify search ucsc --query "human hippocampus" --max-results 20
# EMA / EBI (BioStudies / ArrayExpress)
h5adify search ema --query "single cell brain" --max-results 20
2) Download + convert (per dataset -> one .h5ad)
# GEO: converts all samples with parseable supplementary matrices
h5adify download geo --gse GSE229409 --outdir data/out
# CELLxGENE: dataset UUID or direct .h5ad URL
h5adify download cellxgene --id e52ed1cc-d59f-4bf5-9716-8d81f14a89fd --outdir data/out
h5adify download cellxgene --id https://datasets.cellxgene.cziscience.com/e52ed1cc-d59f-4bf5-9716-8d81f14a89fd.h5ad --outdir data/out
# SODB: dataset-level (downloads all experiments -> one merged file)
h5adify download sodb --id "Mouse brain atlas" --outdir data/out
# SODB: single experiment
h5adify download sodb --id "Mouse brain atlas::exp_001" --outdir data/out
# UCSC: dataset id from search results (download works when a .h5ad is exposed)
h5adify download ucsc --id human-hippo-axis --outdir data/out
# EMA: E-MTAB / E-XXXX study accession (download works when an attached .h5ad is present)
h5adify download ema --id E-MTAB-XXXX --outdir data/out
3) Multi-source batch + merge
h5adify batch \
--ids geo:GSE229409 \
cellxgene:e52ed1cc-d59f-4bf5-9716-8d81f14a89fd \
sodb:"Mouse brain atlas::exp_001" \
--outdir data/out \
--merge-out data/out/merged_all.h5ad
4) Batch multiple files from different databases
h5adify batch --ids geo:GSE229409 \
cellxgene:e52ed1cc-d59f-4bf5-9716-8d81f14a89fd \
--outdir data/out \
--merge-out data/out/merged.h5ad
5) Provide a manifest of a list of h5ad files
h5adify manifest --root data/stereo_seq_mouse_embryo/ \
--out data/stereo_seq_mouse_embryo/out
It gives a .csv and .jsonl files, allowing to analyze the metadata of a large list of samples.
6) Query the metadata of a list of h5ad files
There are 2 .h5ad in this folder:
h5adify query --root data/stereo_seq_mouse_embryo/
UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
[
{
"path": "data/stereo_seq_mouse_embryo/mouse_embryo_all_slices.h5ad",
"filename": "mouse_embryo_all_slices.h5ad",
"n_obs": 176711,
"n_vars": 1923,
"x_dtype": "float32",
"is_sparse": false,
"has_raw_counts": false,
"has_spatial": true,
"layers": "count,norm",
"obsm": "spatial,spatial_aligned,spatial_pair",
"source": "",
"dataset_id": "",
"species": "",
"technology": "",
"condition": "",
"disease": "",
"batch": "real",
"checksum_sha256": ""
},
{
"path": "data/stereo_seq_mouse_embryo/E16.5_E1S3_cell_bin.h5ad",
"filename": "E16.5_E1S3_cell_bin.h5ad",
"n_obs": 281377,
"n_vars": 28103,
"x_dtype": "float32",
"is_sparse": false,
"has_raw_counts": false,
"has_spatial": true,
"layers": "counts",
"obsm": "spatial",
"source": "",
"dataset_id": "",
"species": "",
"technology": "",
"condition": "",
"disease": "",
"batch": "",
"checksum_sha256": ""
}
]
7) Inspect the metadata of h5ad
h5adify inspect --path data/stereo_seq_mouse_embryo/mouse_embryo_all_slices.h5ad
UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
{
"path": "/home/aalentorn/Téléchargements/data/stereo_seq_mouse_embryo/mouse_embryo_all_slices.h5ad",
"n_obs": 176711,
"n_vars": 1923,
"obs_cols": [
"n_genes_by_counts",
"log1p_n_genes_by_counts",
"total_counts",
"log1p_total_counts",
"annotation"
],
"var_cols": [],
"layers": [
"count",
"norm"
],
"obsm": [
"spatial",
"spatial_aligned",
"spatial_pair"
],
"uns": [],
"has_spatial": true,
"has_raw_counts": false,
"x_dtype": "float32",
"x_is_sparse": false,
"missing_std_fields": {
"source": 1.0,
"dataset_id": 1.0,
"species": 1.0,
"technology": 1.0,
"sex": 1.0,
"age": 1.0,
"condition": 1.0,
"disease": 1.0,
"batch": 0.0
}
}
Standardized metadata (.obs)
By default, h5adify tries to fill a standard set of .obs fields where possible, e.g.:
species
technology
sex
age
condition
disease
batch
source
dataset_id
You can override any fields via repeatable --set:
h5adify download geo --gse GSE229409 --outdir data/out \
--set species=human --set condition=control --set technology=10x_visium
Python usage (notebook)
from h5adify import download, merge_h5ads
# Download one dataset into standardized .h5ad
paths = download("geo", gse="GSE229409", outdir="data/out")
# Merge multiple .h5ad files
merged = merge_h5ads(["data/out/A.h5ad", "data/out/B.h5ad"], join="outer")
merged.write_h5ad("data/out/merged.h5ad")
Notes on GEO (GSE) conversion
h5adify download geo focuses on processed supplementary matrices (e.g., 10x MTX/H5).
If a GEO series only provides raw SRA, you’ll need a dedicated pipeline (SRA → FASTQ → CellRanger/STARsolo → matrix). h5adify will detect “raw-only” cases and explain what’s missing.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file h5adify-0.1.1.tar.gz.
File metadata
- Download URL: h5adify-0.1.1.tar.gz
- Upload date:
- Size: 26.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c94dc21f7093d797559c15eeb1eca6a37e84621371e779d8e94917342b4a3292
|
|
| MD5 |
3f9d7b59a8c98d8616540f56d98627e7
|
|
| BLAKE2b-256 |
77583f7729f1d827161b16affea8ebcc05636c47f06b0d5b822ebd3d91b8fa33
|
File details
Details for the file h5adify-0.1.1-py3-none-any.whl.
File metadata
- Download URL: h5adify-0.1.1-py3-none-any.whl
- Upload date:
- Size: 31.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
beb408f871d43ff55118357e1f95884682c243ec2ee7487a1028ceb3cfe7b897
|
|
| MD5 |
9bdd8b0d923a9ecf1ba4cd1aefc76856
|
|
| BLAKE2b-256 |
ccf93f3970d7456dc6b01b3620e82ead41fa7560973ff11bc583adf08cf8efa1
|