Skip to main content

Add your description here

Project description

anndata-metadata

anndata-metadata is a Python library and CLI tool for extracting metadata from AnnData .h5ad files, both locally and on S3. When extracting metadata from S3, it uses partial downloads to dramatically speed up extraction.

It provides utilities to summarize cell, gene, and matrix information, and supports batch processing of directories.

It can create a .parquet index of the metadata for all of the files in a directory (S3 or local).

Library Overview

The core library is in src/anndata_metadata/ and provides:

  • Metadata extraction: Functions to extract key metadata (cell count, gene count, matrix format, group contents, etc.) from AnnData .h5ad files.
  • S3 and local support: Utilities to process files both on local disk and in S3 buckets.
  • JSON-serializable output: All metadata is returned as Python dictionaries with native types.

Installing

pip install anndata-metadata

CLI Usage

Usage:

usage: anndata-metadata [-h] [-o OBS] [-c COUNT] [-f FILE_LIST] [-p S3_PREFIX]
                        [-m OBS_MAX_CARDINALITY] [-w WORKERS] [-b S3_BLOCK_SIZE]
                        [-e {thread,process}]
                        [input_paths ...] output

Extract AnnData metadata from file(s) or S3 object(s).

positional arguments:
  input_paths           Input file(s), directory, or S3 URI(s)/directory (may be
                        combined with --file-list)
  output                Output filename (JSON for a single file, Parquet for
                        multiple/--file-list, '-' for stdout)

options:
  -h, --help            show this help message and exit
  -o OBS, --obs OBS     Observation column to count (can be specified multiple times)
  -c COUNT, --count COUNT
                        Maximum number of files to process
  -f FILE_LIST, --file-list FILE_LIST
                        Path to a TSV (with a 'file' column) or newline-delimited list
                        of files to index; entries are prefixed with --s3-prefix
  -p S3_PREFIX, --s3-prefix S3_PREFIX
                        Prefix prepended to each --file-list entry (e.g. s3://bucket/prefix/)
  -m OBS_MAX_CARDINALITY, --obs-max-cardinality OBS_MAX_CARDINALITY
                        Auto-count value distributions for every obs column with at most
                        this many distinct values (the obs index column is always skipped)
  -w WORKERS, --workers WORKERS
                        Number of concurrent workers for multi-file/--file-list mode
                        (default 1)
  -b S3_BLOCK_SIZE, --s3-block-size S3_BLOCK_SIZE
                        s3fs read-ahead block size in bytes. Small (e.g. 262144) is best
                        over a high-latency link; larger (e.g. 1048576) is best in-region.
  -e {thread,process}, --executor {thread,process}
                        Concurrency model for multi-file mode. Reading H5AD metadata is
                        GIL-bound, so 'process' scales near-linearly with cores.

A single input file produces JSON; multiple inputs (several paths, a directory, or --file-list) produce a resumable Parquet index. Extracted metadata includes the detected organism and ensembl_prefix (inferred from Ensembl gene-ID prefixes).

Examples:

anndata-metadata data/myfile.h5ad metadata.json
anndata-metadata data/ metadata.parquet
anndata-metadata s3://my-bucket/ metadata.parquet

# Multiple mixed inputs in one run
anndata-metadata a.h5ad b.h5ad s3://bucket/dir/ metadata.parquet

# Index a curated list of files from S3, counting low-cardinality obs columns.
# Resumable: re-running skips files already present in the output Parquet.
# In-region (e.g. on EC2) use process workers + a larger block size:
anndata-metadata --file-list files.tsv \
  --s3-prefix s3://my-bucket/prefix/ \
  --obs-max-cardinality 1000 \
  --workers 12 --executor process --s3-block-size 1048576 \
  metadata.parquet

Development

Setup

This project uses uv for fast Python environment management.

  1. Install dependencies:

    uv sync # this gets the dependenceis you need to run the command
    uv sync --group dev # this gets the dev dependencies for testing and formatting
    
  2. Run tests:

    uv run pytest
    
  3. Format code:

    uv run yapf --recursive . --in-place
    
  4. Type check (mypy):

    uv run mypy
    
  5. Run CLI

    PYTHONPATH=src uv run python -m anndata_metadata
    
  6. Build and test the wheel

    uv run python -m build
    

    and test it using

     python -m venv testenv
     source testenv/bin/activate
     pip install dist/anndata_metadata-*.whl --force-reinstall   
    

    you will now be able to run the cli command like this

     anndata-metadata
    

Project Structure

.
├── src/
│ └── anndata_metadata/
│   ├── extract.py # Core metadata extraction logic
│   └── main.py # CLI entry point
├── test/ # Unit tests for extraction functions and CLI
├── README.md # Project documentation
└── pyproject.toml # Project metadata and dependencies

TODO

  • add mypy support
  • add a wheel and submit to pypy
  • CI/CD pipeline for updating pyp
  • write partial results and skip previously written values
  • Add module level documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anndata_metadata-0.1.3-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file anndata_metadata-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for anndata_metadata-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a4044b934c4dd7fae374b451b15127c63f4cf484053ba373c47564449e9c5e32
MD5 e1969325b10a67f710e5ce9b4e7422ac
BLAKE2b-256 3e73a53ab8f1c7400938c6eb858e69ba17a04ed0da35b656803adf6492ddae6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page