Skip to main content

HCA schema validation for single-cell datasets

Project description

HCA Schema Validator

HCA-specific extensions for cellxgene schema validation.

Installation

From PyPI (Recommended)

pip install hca-schema-validator

From Source (Development)

# Clone the repository
git clone https://github.com/clevercanary/hca-validation-tools.git
cd hca-validation-tools/packages/hca-schema-validator

# Install Poetry if you haven't already
curl -sSL https://install.python-poetry.org | python3 -

# Install dependencies and package
poetry install

# Run tests
poetry run pytest tests/

Usage

from hca_schema_validator import HCAValidator

# Create validator instance
validator = HCAValidator()

# Validate an h5ad file
is_valid = validator.validate_adata("path/to/file.h5ad")

# Check results
if is_valid:
    print("✅ Validation passed!")
else:
    print("❌ Validation failed:")
    for error in validator.errors:
        print(f"  - {error}")

Development Status

Current Version: 0.1.0 - Minimal passthrough implementation

Currently a passthrough wrapper around cellxgene-schema Validator. HCA-specific validation rules will be added incrementally.

Testing

cd hca_schema_validator
poetry run pytest tests/

Project Structure

hca_schema_validator/
├── src/
│   └── hca_schema_validator/
│       ├── __init__.py       # Package exports
│       └── validator.py      # HCAValidator class
├── tests/
│   └── test_validator.py # Unit tests
├── pyproject.toml        # Poetry configuration & dependencies
└── README.md            # This file

Ontology Data Overlay

The validator depends on cellxgene-ontology-guide for ontology term lookups. When that package is missing terms we need (e.g., newly added CL or EFO terms), we generate updated ontology data files and overlay them at runtime.

How it works

_vendored/cellxgene_schema/ontology_parser.py monkey-patches two functions from cellxgene_ontology_guide.supported_versions:

  • load_supported_versions() loads upstream version data and patches only the ontology versions listed in _ONTOLOGY_VERSION_OVERRIDES — all other ontologies and any new entries added by future package releases are preserved unchanged.
  • load_ontology_file(file_name) checks ontology_data/ first for a .json.zst file, falling back to the package's bundled data.

Current overlays

Ontology Overlay Version Bundled Version Why
CL v2025-12-17 v2025-07-30 Missing salivary gland cell types (CL:4052065-4052069)

How to add/update an ontology overlay

Prerequisites: Python 3.10+, Docker, ~1GB disk for OWL files.

  1. Clone CZI's ontology-guide repo (contains the build pipeline):

    cd /tmp && mkdir ontology-guide-build && cd ontology-guide-build
    git clone --depth 1 https://github.com/chanzuckerberg/cellxgene-ontology-guide.git
    
  2. Set up build environment:

    python3 -m venv venv && source venv/bin/activate
    pip install owlready2==0.48 zstandard jsonschema semantic-version referencing cellxgene-ontology-guide
    docker pull obolibrary/robot:v1.9.8
    
  3. Create a targeted ontology_info JSON with only the ontology to build. Save as cellxgene-ontology-guide/ontology-assets/ontology_info_custom.json:

    {
      "7.0.0": {
        "ontologies": {
          "EFO": {
            "version": "v3.86.0",
            "source": "https://github.com/EBISPOT/efo/releases/download/{version}/{filename}",
            "filename": "efo.owl"
          }
        }
      }
    }
    

    Find the latest release version on the ontology's GitHub releases page (e.g., CL releases, EFO releases).

    Copy the ontology entry from the existing ontology_info.json in ontology-assets/ and update the version field. Keep source, filename, and any other fields (like cross_ontology_mapping) the same.

  4. Run the build script:

    #!/usr/bin/env python3
    import json, logging, os, sys
    logging.basicConfig(level=logging.INFO)
    
    REPO_DIR = "/tmp/ontology-guide-build/cellxgene-ontology-guide"
    sys.path.insert(0, os.path.join(REPO_DIR, "tools/ontology-builder/src"))
    import env
    env.ONTOLOGY_INFO_FILE = os.path.join(REPO_DIR, "ontology-assets/ontology_info_custom.json")
    env.ONTOLOGY_ASSETS_DIR = os.path.join(REPO_DIR, "ontology-assets")
    
    from all_ontology_generator import _download_ontologies, _parse_ontologies, get_ontology_info_file
    onto_info = get_ontology_info_file(env.ONTOLOGY_INFO_FILE)["7.0.0"]["ontologies"]
    _download_ontologies(onto_info)
    for f in _parse_ontologies(onto_info):
        logging.info(f"Generated: {f}")
    
  5. Copy the .json.zst output into src/hca_schema_validator/ontology_data/.

  6. Add an entry to _ONTOLOGY_VERSION_OVERRIDES in ontology_parser.py:

    _ONTOLOGY_VERSION_OVERRIDES = {
        ("7.0.0", "CL"): "v2025-12-17",
        ("7.0.0", "EFO"): "v3.86.0",  # new
    }
    
  7. Verify and test:

    poetry run python -c "
    from hca_schema_validator._vendored.cellxgene_schema.ontology_parser import ONTOLOGY_PARSER
    print(ONTOLOGY_PARSER.is_valid_term_id('CL:4052065'))  # True
    "
    poetry run pytest tests/ -v
    
  8. Clean up: rm -rf /tmp/ontology-guide-build

Removing the overlay

Once cellxgene-ontology-guide publishes a version that includes all the terms we need:

  1. Delete the overlay files from ontology_data/ (keep only __init__.py)
  2. Revert ontology_parser.py to its original form:
    from cellxgene_ontology_guide.ontology_parser import OntologyParser
    ONTOLOGY_PARSER = OntologyParser(schema_version="v7.0.0")
    
  3. Bump cellxgene-ontology-guide version in pyproject.toml

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hca_schema_validator-0.12.0.tar.gz (6.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hca_schema_validator-0.12.0-py3-none-any.whl (6.2 MB view details)

Uploaded Python 3

File details

Details for the file hca_schema_validator-0.12.0.tar.gz.

File metadata

  • Download URL: hca_schema_validator-0.12.0.tar.gz
  • Upload date:
  • Size: 6.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for hca_schema_validator-0.12.0.tar.gz
Algorithm Hash digest
SHA256 3d358db9f59beec972edd8d2c3c167e659980a578caefa8b1565c2b4d6d90fd2
MD5 e805c121add74ad6700b50ed52c687aa
BLAKE2b-256 eb4a081246b7759da94538836ae4a69d2ff666149423b5e14ea4c44cb3468216

See more details on using hashes here.

Provenance

The following attestation bundles were made for hca_schema_validator-0.12.0.tar.gz:

Publisher: release-please.yml on clevercanary/hca-validation-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hca_schema_validator-0.12.0-py3-none-any.whl.

File metadata

File hashes

Hashes for hca_schema_validator-0.12.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e26db3996c899a3d057098fe99766c3ed85397ea1542c4d6fa3a6485a6b82942
MD5 baaa53cf8bbf43282f99493d490de634
BLAKE2b-256 112b9bbf7e2e764241029c74e2a4c6ceabf65ebdfec61b12558c2455a0bf17cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for hca_schema_validator-0.12.0-py3-none-any.whl:

Publisher: release-please.yml on clevercanary/hca-validation-tools

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page