HCA schema validation for single-cell datasets
Project description
HCA Schema Validator
HCA-specific extensions for cellxgene schema validation.
Installation
From PyPI (Recommended)
pip install hca-schema-validator
From Source (Development)
# Clone the repository
git clone https://github.com/clevercanary/hca-validation-tools.git
cd hca-validation-tools/packages/hca-schema-validator
# Install Poetry if you haven't already
curl -sSL https://install.python-poetry.org | python3 -
# Install dependencies and package
poetry install
# Run tests
poetry run pytest tests/
Usage
from hca_schema_validator import HCAValidator
# Create validator instance
validator = HCAValidator()
# Validate an h5ad file
is_valid = validator.validate_adata("path/to/file.h5ad")
# Check results
if is_valid:
print("✅ Validation passed!")
else:
print("❌ Validation failed:")
for error in validator.errors:
print(f" - {error}")
Development Status
Current Version: 0.1.0 - Minimal passthrough implementation
Currently a passthrough wrapper around cellxgene-schema Validator. HCA-specific validation rules will be added incrementally.
Testing
cd hca_schema_validator
poetry run pytest tests/
Project Structure
hca_schema_validator/
├── src/
│ └── hca_schema_validator/
│ ├── __init__.py # Package exports
│ └── validator.py # HCAValidator class
├── tests/
│ └── test_validator.py # Unit tests
├── pyproject.toml # Poetry configuration & dependencies
└── README.md # This file
Ontology Data Overlay
The validator depends on cellxgene-ontology-guide for ontology term lookups. When that
package is missing terms we need (e.g., newly added CL or EFO terms), we generate updated
ontology data files and overlay them at runtime.
How it works
_vendored/cellxgene_schema/ontology_parser.py monkey-patches two functions from
cellxgene_ontology_guide.supported_versions:
load_supported_versions()loads upstream version data and patches only the ontology versions listed in_ONTOLOGY_VERSION_OVERRIDES— all other ontologies and any new entries added by future package releases are preserved unchanged.load_ontology_file(file_name)checksontology_data/first for a.json.zstfile, falling back to the package's bundled data.
Current overlays
| Ontology | Overlay Version | Bundled Version | Why |
|---|---|---|---|
| CL | v2025-12-17 | v2025-07-30 | Missing salivary gland cell types (CL:4052065-4052069) |
How to add/update an ontology overlay
Prerequisites: Python 3.10+, Docker, ~1GB disk for OWL files.
-
Clone CZI's ontology-guide repo (contains the build pipeline):
cd /tmp && mkdir ontology-guide-build && cd ontology-guide-build git clone --depth 1 https://github.com/chanzuckerberg/cellxgene-ontology-guide.git
-
Set up build environment:
python3 -m venv venv && source venv/bin/activate pip install owlready2==0.48 zstandard jsonschema semantic-version referencing cellxgene-ontology-guide docker pull obolibrary/robot:v1.9.8
-
Create a targeted ontology_info JSON with only the ontology to build. Save as
cellxgene-ontology-guide/ontology-assets/ontology_info_custom.json:{ "7.0.0": { "ontologies": { "EFO": { "version": "v3.86.0", "source": "https://github.com/EBISPOT/efo/releases/download/{version}/{filename}", "filename": "efo.owl" } } } }
Find the latest release version on the ontology's GitHub releases page (e.g., CL releases, EFO releases).
Copy the ontology entry from the existing
ontology_info.jsoninontology-assets/and update theversionfield. Keepsource,filename, and any other fields (likecross_ontology_mapping) the same. -
Run the build script:
#!/usr/bin/env python3 import json, logging, os, sys logging.basicConfig(level=logging.INFO) REPO_DIR = "/tmp/ontology-guide-build/cellxgene-ontology-guide" sys.path.insert(0, os.path.join(REPO_DIR, "tools/ontology-builder/src")) import env env.ONTOLOGY_INFO_FILE = os.path.join(REPO_DIR, "ontology-assets/ontology_info_custom.json") env.ONTOLOGY_ASSETS_DIR = os.path.join(REPO_DIR, "ontology-assets") from all_ontology_generator import _download_ontologies, _parse_ontologies, get_ontology_info_file onto_info = get_ontology_info_file(env.ONTOLOGY_INFO_FILE)["7.0.0"]["ontologies"] _download_ontologies(onto_info) for f in _parse_ontologies(onto_info): logging.info(f"Generated: {f}")
-
Copy the
.json.zstoutput intosrc/hca_schema_validator/ontology_data/. -
Add an entry to
_ONTOLOGY_VERSION_OVERRIDESinontology_parser.py:_ONTOLOGY_VERSION_OVERRIDES = { ("7.0.0", "CL"): "v2025-12-17", ("7.0.0", "EFO"): "v3.86.0", # new }
-
Verify and test:
poetry run python -c " from hca_schema_validator._vendored.cellxgene_schema.ontology_parser import ONTOLOGY_PARSER print(ONTOLOGY_PARSER.is_valid_term_id('CL:4052065')) # True " poetry run pytest tests/ -v
-
Clean up:
rm -rf /tmp/ontology-guide-build
Removing the overlay
Once cellxgene-ontology-guide publishes a version that includes all the terms we need:
- Delete the overlay files from
ontology_data/(keep only__init__.py) - Revert
ontology_parser.pyto its original form:from cellxgene_ontology_guide.ontology_parser import OntologyParser ONTOLOGY_PARSER = OntologyParser(schema_version="v7.0.0")
- Bump
cellxgene-ontology-guideversion inpyproject.toml
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hca_schema_validator-0.7.0.tar.gz.
File metadata
- Download URL: hca_schema_validator-0.7.0.tar.gz
- Upload date:
- Size: 6.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35f3d6a281ea6e7701ca133d129d06e6c0942e986092f87d31025825bb5e2714
|
|
| MD5 |
5e8eec4aeff99908de4a686f735a093b
|
|
| BLAKE2b-256 |
7fddd0ae428c46d78f9ba81cc588f956ace3cfe96487770775aa2e142aabefe6
|
Provenance
The following attestation bundles were made for hca_schema_validator-0.7.0.tar.gz:
Publisher:
release-please.yml on clevercanary/hca-validation-tools
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hca_schema_validator-0.7.0.tar.gz -
Subject digest:
35f3d6a281ea6e7701ca133d129d06e6c0942e986092f87d31025825bb5e2714 - Sigstore transparency entry: 1020914814
- Sigstore integration time:
-
Permalink:
clevercanary/hca-validation-tools@48665e8292b4dea9d34906a21f6b6af66326ddf1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/clevercanary
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-please.yml@48665e8292b4dea9d34906a21f6b6af66326ddf1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file hca_schema_validator-0.7.0-py3-none-any.whl.
File metadata
- Download URL: hca_schema_validator-0.7.0-py3-none-any.whl
- Upload date:
- Size: 6.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a75b603fc6317bbe9366f26b6f574fb2d9f37b98f2936052376dd5d70aa9b3a4
|
|
| MD5 |
e2612caefb465fc301a955dc5a93fa46
|
|
| BLAKE2b-256 |
8f6d5972ab56b873a2b6b1c422ca89dbffddf6b483b8bacab8f18bba05c8fb5b
|
Provenance
The following attestation bundles were made for hca_schema_validator-0.7.0-py3-none-any.whl:
Publisher:
release-please.yml on clevercanary/hca-validation-tools
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
hca_schema_validator-0.7.0-py3-none-any.whl -
Subject digest:
a75b603fc6317bbe9366f26b6f574fb2d9f37b98f2936052376dd5d70aa9b3a4 - Sigstore transparency entry: 1020914888
- Sigstore integration time:
-
Permalink:
clevercanary/hca-validation-tools@48665e8292b4dea9d34906a21f6b6af66326ddf1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/clevercanary
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release-please.yml@48665e8292b4dea9d34906a21f6b6af66326ddf1 -
Trigger Event:
push
-
Statement type: