Add your description here
Project description
anndata-metadata
anndata-metadata is a Python library and CLI tool for extracting metadata from AnnData .h5ad files, both locally and on S3. When extracting metadata from S3, it uses partial downloads to dramatically speed up extraction.
It provides utilities to summarize cell, gene, and matrix information, and supports batch processing of directories.
It can create a .parquet index of the metadata for all of the files in a directory (S3 or local).
Library Overview
The core library is in src/anndata_metadata/ and provides:
- Metadata extraction: Functions to extract key metadata (cell count, gene count, matrix format, group contents, etc.) from AnnData
.h5adfiles. - S3 and local support: Utilities to process files both on local disk and in S3 buckets.
- JSON-serializable output: All metadata is returned as Python dictionaries with native types.
Installing
pip install anndata-metadata
CLI Usage
Usage:
usage: anndata-metadata [-h] [-o OBS] [-c COUNT] [-f FILE_LIST] [-p S3_PREFIX]
[-m OBS_MAX_CARDINALITY] [-w WORKERS] [-b S3_BLOCK_SIZE]
[-e {thread,process}]
[input_paths ...] output
Extract AnnData metadata from file(s) or S3 object(s).
positional arguments:
input_paths Input file(s), directory, or S3 URI(s)/directory (may be
combined with --file-list)
output Output filename (JSON for a single file, Parquet for
multiple/--file-list, '-' for stdout)
options:
-h, --help show this help message and exit
-o OBS, --obs OBS Observation column to count (can be specified multiple times)
-c COUNT, --count COUNT
Maximum number of files to process
-f FILE_LIST, --file-list FILE_LIST
Path to a TSV (with a 'file' column) or newline-delimited list
of files to index; entries are prefixed with --s3-prefix
-p S3_PREFIX, --s3-prefix S3_PREFIX
Prefix prepended to each --file-list entry (e.g. s3://bucket/prefix/)
-m OBS_MAX_CARDINALITY, --obs-max-cardinality OBS_MAX_CARDINALITY
Auto-count value distributions for every obs column with at most
this many distinct values (the obs index column is always skipped)
-w WORKERS, --workers WORKERS
Number of concurrent workers for multi-file/--file-list mode
(default 1)
-b S3_BLOCK_SIZE, --s3-block-size S3_BLOCK_SIZE
s3fs read-ahead block size in bytes. Small (e.g. 262144) is best
over a high-latency link; larger (e.g. 1048576) is best in-region.
-e {thread,process}, --executor {thread,process}
Concurrency model for multi-file mode. Reading H5AD metadata is
GIL-bound, so 'process' scales near-linearly with cores.
A single input file produces JSON; multiple inputs (several paths, a directory, or
--file-list) produce a resumable Parquet index. Extracted metadata includes the
detected organism and ensembl_prefix (inferred from Ensembl gene-ID prefixes).
Examples:
anndata-metadata data/myfile.h5ad metadata.json
anndata-metadata data/ metadata.parquet
anndata-metadata s3://my-bucket/ metadata.parquet
# Multiple mixed inputs in one run
anndata-metadata a.h5ad b.h5ad s3://bucket/dir/ metadata.parquet
# Index a curated list of files from S3, counting low-cardinality obs columns.
# Resumable: re-running skips files already present in the output Parquet.
# In-region (e.g. on EC2) use process workers + a larger block size:
anndata-metadata --file-list files.tsv \
--s3-prefix s3://my-bucket/prefix/ \
--obs-max-cardinality 1000 \
--workers 12 --executor process --s3-block-size 1048576 \
metadata.parquet
Development
Setup
This project uses uv for fast Python environment management.
-
Install dependencies:
uv sync # this gets the dependenceis you need to run the command uv sync --group dev # this gets the dev dependencies for testing and formatting
-
Run tests:
uv run pytest
-
Format code:
uv run yapf --recursive . --in-place
-
Type check (mypy):
uv run mypy
-
Run CLI
PYTHONPATH=src uv run python -m anndata_metadata
-
Build and test the wheel
uv run python -m build
and test it using
python -m venv testenv source testenv/bin/activate pip install dist/anndata_metadata-*.whl --force-reinstall
you will now be able to run the cli command like this
anndata-metadata
Project Structure
.
├── src/
│ └── anndata_metadata/
│ ├── extract.py # Core metadata extraction logic
│ └── main.py # CLI entry point
├── test/ # Unit tests for extraction functions and CLI
├── README.md # Project documentation
└── pyproject.toml # Project metadata and dependencies
TODO
- add mypy support
- add a wheel and submit to pypy
- CI/CD pipeline for updating pyp
- write partial results and skip previously written values
- Add module level documentation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anndata_metadata-0.1.3-py3-none-any.whl.
File metadata
- Download URL: anndata_metadata-0.1.3-py3-none-any.whl
- Upload date:
- Size: 13.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4044b934c4dd7fae374b451b15127c63f4cf484053ba373c47564449e9c5e32
|
|
| MD5 |
e1969325b10a67f710e5ce9b4e7422ac
|
|
| BLAKE2b-256 |
3e73a53ab8f1c7400938c6eb858e69ba17a04ed0da35b656803adf6492ddae6c
|