Add your description here

Project description

anndata-metadata

anndata-metadata is a Python library and CLI tool for extracting metadata from AnnData .h5ad files, both locally and on S3. When extracting metadata from S3, it uses partial downloads to dramatically speed up extraction.

It provides utilities to summarize cell, gene, and matrix information, and supports batch processing of directories.

It can create a .parquet index of the metadata for all of the files in a directory (S3 or local).

Library Overview

The core library is in src/anndata_metadata/ and provides:

Metadata extraction: Functions to extract key metadata (cell count, gene count, matrix format, group contents, etc.) from AnnData .h5ad files.
S3 and local support: Utilities to process files both on local disk and in S3 buckets.
JSON-serializable output: All metadata is returned as Python dictionaries with native types.

Installing

pip install anndata-metadata

CLI Usage

Usage:

usage: anndata-metadata [-h] [-o OBS] [-c COUNT] [-f FILE_LIST] [-p S3_PREFIX]
                        [-m OBS_MAX_CARDINALITY] [-w WORKERS] [-b S3_BLOCK_SIZE]
                        [-e {thread,process}]
                        [input_paths ...] output

Extract AnnData metadata from file(s) or S3 object(s).

positional arguments:
  input_paths           Input file(s), directory, or S3 URI(s)/directory (may be
                        combined with --file-list)
  output                Output filename (JSON for a single file, Parquet for
                        multiple/--file-list, '-' for stdout)

options:
  -h, --help            show this help message and exit
  -o OBS, --obs OBS     Observation column to count (can be specified multiple times)
  -c COUNT, --count COUNT
                        Maximum number of files to process
  -f FILE_LIST, --file-list FILE_LIST
                        Path to a TSV (with a 'file' column) or newline-delimited list
                        of files to index; entries are prefixed with --s3-prefix
  -p S3_PREFIX, --s3-prefix S3_PREFIX
                        Prefix prepended to each --file-list entry (e.g. s3://bucket/prefix/)
  -m OBS_MAX_CARDINALITY, --obs-max-cardinality OBS_MAX_CARDINALITY
                        Auto-count value distributions for every obs column with at most
                        this many distinct values (the obs index column is always skipped)
  -w WORKERS, --workers WORKERS
                        Number of concurrent workers for multi-file/--file-list mode
                        (default 1)
  -b S3_BLOCK_SIZE, --s3-block-size S3_BLOCK_SIZE
                        s3fs read-ahead block size in bytes. Small (e.g. 262144) is best
                        over a high-latency link; larger (e.g. 1048576) is best in-region.
  -e {thread,process}, --executor {thread,process}
                        Concurrency model for multi-file mode. Reading H5AD metadata is
                        GIL-bound, so 'process' scales near-linearly with cores.

A single input file produces JSON; multiple inputs (several paths, a directory, or --file-list) produce a resumable Parquet index. Extracted metadata includes the detected organism and ensembl_prefix (inferred from Ensembl gene-ID prefixes).

Examples:

anndata-metadata data/myfile.h5ad metadata.json
anndata-metadata data/ metadata.parquet
anndata-metadata s3://my-bucket/ metadata.parquet

# Multiple mixed inputs in one run
anndata-metadata a.h5ad b.h5ad s3://bucket/dir/ metadata.parquet

# Index a curated list of files from S3, counting low-cardinality obs columns.
# Resumable: re-running skips files already present in the output Parquet.
# In-region (e.g. on EC2) use process workers + a larger block size:
anndata-metadata --file-list files.tsv \
  --s3-prefix s3://my-bucket/prefix/ \
  --obs-max-cardinality 1000 \
  --workers 12 --executor process --s3-block-size 1048576 \
  metadata.parquet

Development

Setup

This project uses uv for fast Python environment management.

Install dependencies:

uv sync # this gets the dependenceis you need to run the command
uv sync --group dev # this gets the dev dependencies for testing and formatting

Run tests:
```
uv run pytest
```
Format code:
```
uv run yapf --recursive . --in-place
```
Type check (mypy):
```
uv run mypy
```

Run CLI

PYTHONPATH=src uv run python -m anndata_metadata

Build and test the wheel

uv run python -m build

and test it using

 python -m venv testenv
 source testenv/bin/activate
 pip install dist/anndata_metadata-*.whl --force-reinstall

you will now be able to run the cli command like this

 anndata-metadata

Project Structure

.
├── src/
│ └── anndata_metadata/
│   ├── extract.py # Core metadata extraction logic
│   └── main.py # CLI entry point
├── test/ # Unit tests for extraction functions and CLI
├── README.md # Project documentation
└── pyproject.toml # Project metadata and dependencies

TODO

add mypy support
add a wheel and submit to pypy
CI/CD pipeline for updating pyp
write partial results and skip previously written values
Add module level documentation

Project details

Release history Release notifications | RSS feed

This version

0.1.3

Jul 3, 2026

0.1.2

May 21, 2025

0.1.1

May 18, 2025

0.1.0

May 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

anndata_metadata-0.1.3-py3-none-any.whl (13.8 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file anndata_metadata-0.1.3-py3-none-any.whl.

File metadata

Download URL: anndata_metadata-0.1.3-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 13.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.12

File hashes

Hashes for anndata_metadata-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a4044b934c4dd7fae374b451b15127c63f4cf484053ba373c47564449e9c5e32`
MD5	`e1969325b10a67f710e5ce9b4e7422ac`
BLAKE2b-256	`3e73a53ab8f1c7400938c6eb858e69ba17a04ed0da35b656803adf6492ddae6c`

See more details on using hashes here.

anndata-metadata 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Meta