Skip to main content

HCA schema validation for single-cell datasets

Project description

HCA Schema Validator

HCA-specific extensions for cellxgene schema validation.

Installation

From PyPI (Recommended)

pip install hca-schema-validator

From Source (Development)

# Clone the repository
git clone https://github.com/clevercanary/hca-validation-tools.git
cd hca-validation-tools/packages/hca-schema-validator

# Install Poetry if you haven't already
curl -sSL https://install.python-poetry.org | python3 -

# Install dependencies and package
poetry install

# Run tests
poetry run pytest tests/

Usage

from hca_schema_validator import HCAValidator

# Create validator instance
validator = HCAValidator()

# Validate an h5ad file
is_valid = validator.validate_adata("path/to/file.h5ad")

# Check results
if is_valid:
    print("✅ Validation passed!")
else:
    print("❌ Validation failed:")
    for error in validator.errors:
        print(f"  - {error}")

Development Status

Current Version: 0.1.0 - Minimal passthrough implementation

Currently a passthrough wrapper around cellxgene-schema Validator. HCA-specific validation rules will be added incrementally.

Testing

cd hca_schema_validator
poetry run pytest tests/

Project Structure

hca_schema_validator/
├── src/
│   └── hca_schema_validator/
│       ├── __init__.py       # Package exports
│       └── validator.py      # HCAValidator class
├── tests/
│   └── test_validator.py # Unit tests
├── pyproject.toml        # Poetry configuration & dependencies
└── README.md            # This file

Ontology Data Overlay

The validator depends on cellxgene-ontology-guide for ontology term lookups. When that package is missing terms we need (e.g., newly added CL or EFO terms), we generate updated ontology data files and overlay them at runtime.

How it works

_vendored/cellxgene_schema/ontology_parser.py monkey-patches two functions from cellxgene_ontology_guide.supported_versions:

  • load_supported_versions() loads upstream version data and patches only the ontology versions listed in _ONTOLOGY_VERSION_OVERRIDES — all other ontologies and any new entries added by future package releases are preserved unchanged.
  • load_ontology_file(file_name) checks ontology_data/ first for a .json.zst file, falling back to the package's bundled data.

Current overlays

Ontology Overlay Version Bundled Version Why
CL v2025-12-17 v2025-07-30 Missing salivary gland cell types (CL:4052065-4052069)

How to add/update an ontology overlay

Prerequisites: Python 3.10+, Docker, ~1GB disk for OWL files.

  1. Clone CZI's ontology-guide repo (contains the build pipeline):

    cd /tmp && mkdir ontology-guide-build && cd ontology-guide-build
    git clone --depth 1 https://github.com/chanzuckerberg/cellxgene-ontology-guide.git
    
  2. Set up build environment:

    python3 -m venv venv && source venv/bin/activate
    pip install owlready2==0.48 zstandard jsonschema semantic-version referencing cellxgene-ontology-guide
    docker pull obolibrary/robot:v1.9.8
    
  3. Create a targeted ontology_info JSON with only the ontology to build. Save as cellxgene-ontology-guide/ontology-assets/ontology_info_custom.json:

    {
      "7.0.0": {
        "ontologies": {
          "EFO": {
            "version": "v3.86.0",
            "source": "https://github.com/EBISPOT/efo/releases/download/{version}/{filename}",
            "filename": "efo.owl"
          }
        }
      }
    }
    

    Find the latest release version on the ontology's GitHub releases page (e.g., CL releases, EFO releases).

    Copy the ontology entry from the existing ontology_info.json in ontology-assets/ and update the version field. Keep source, filename, and any other fields (like cross_ontology_mapping) the same.

  4. Run the build script:

    #!/usr/bin/env python3
    import json, logging, os, sys
    logging.basicConfig(level=logging.INFO)
    
    REPO_DIR = "/tmp/ontology-guide-build/cellxgene-ontology-guide"
    sys.path.insert(0, os.path.join(REPO_DIR, "tools/ontology-builder/src"))
    import env
    env.ONTOLOGY_INFO_FILE = os.path.join(REPO_DIR, "ontology-assets/ontology_info_custom.json")
    env.ONTOLOGY_ASSETS_DIR = os.path.join(REPO_DIR, "ontology-assets")
    
    from all_ontology_generator import _download_ontologies, _parse_ontologies, get_ontology_info_file
    onto_info = get_ontology_info_file(env.ONTOLOGY_INFO_FILE)["7.0.0"]["ontologies"]
    _download_ontologies(onto_info)
    for f in _parse_ontologies(onto_info):
        logging.info(f"Generated: {f}")
    
  5. Copy the .json.zst output into src/hca_schema_validator/ontology_data/.

  6. Add an entry to _ONTOLOGY_VERSION_OVERRIDES in ontology_parser.py:

    _ONTOLOGY_VERSION_OVERRIDES = {
        ("7.0.0", "CL"): "v2025-12-17",
        ("7.0.0", "EFO"): "v3.86.0",  # new
    }
    
  7. Verify and test:

    poetry run python -c "
    from hca_schema_validator._vendored.cellxgene_schema.ontology_parser import ONTOLOGY_PARSER
    print(ONTOLOGY_PARSER.is_valid_term_id('CL:4052065'))  # True
    "
    poetry run pytest tests/ -v
    
  8. Clean up: rm -rf /tmp/ontology-guide-build

Removing the overlay

Once cellxgene-ontology-guide publishes a version that includes all the terms we need:

  1. Delete the overlay files from ontology_data/ (keep only __init__.py)
  2. Revert ontology_parser.py to its original form:
    from cellxgene_ontology_guide.ontology_parser import OntologyParser
    ONTOLOGY_PARSER = OntologyParser(schema_version="v7.0.0")
    
  3. Bump cellxgene-ontology-guide version in pyproject.toml

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hca_schema_validator-0.9.0.tar.gz (6.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hca_schema_validator-0.9.0-py3-none-any.whl (6.2 MB view details)

Uploaded Python 3

File details

Details for the file hca_schema_validator-0.9.0.tar.gz.

File metadata

  • Download URL: hca_schema_validator-0.9.0.tar.gz
  • Upload date:
  • Size: 6.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hca_schema_validator-0.9.0.tar.gz
Algorithm Hash digest
SHA256 e35ff88a2f2d74d2168adb685fca76c37f5e1061408571495b0c367b57a39d4f
MD5 1aa2256fcf192a5abb84319ee244aa58
BLAKE2b-256 e5e82c2682ed9d41fcc92c87f77d6724bfd10c01d5173924e63b05d740ff59c9

See more details on using hashes here.

Provenance

The following attestation bundles were made for hca_schema_validator-0.9.0.tar.gz:

Publisher: release-please.yml on clevercanary/hca-validation-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hca_schema_validator-0.9.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hca_schema_validator-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d8a6ce13250e5584880100bc5b1726311181b2ea6d226ae4aa3e95d86c945c8e
MD5 7441c60f4917abd08c786070b8a4836a
BLAKE2b-256 dc2a53dad40f9cfe96f94de473aa031af004bfb5f9628b84b611f7ce92ca0374

See more details on using hashes here.

Provenance

The following attestation bundles were made for hca_schema_validator-0.9.0-py3-none-any.whl:

Publisher: release-please.yml on clevercanary/hca-validation-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page