Skip to main content

Validating external terms

Project description

linkml-term-validator

Validating LinkML schemas and datasets that depend on external terms

A collection of LinkML ValidationPlugin implementations for validating ontology term references:

  1. Schema Validation: Validate meaning fields in enum permissible values
  2. Data Validation: Validate data against dynamic enums and binding constraints

Features

  • ✅ Three composable validation plugins for LinkML validator framework
  • ✅ Validates meaning fields in permissible_values in LinkML schemas
  • ✅ Validates data against dynamic enums (reachable_from, matches, concepts)
  • ✅ Validates binding constraints on nested object fields
  • ✅ Supports multiple ontology sources via OAK (Ontology Access Kit)
  • ✅ Multi-level caching (in-memory + file-based) for fast repeated validation
  • ✅ Configurable per-prefix validation via oak_config.yaml
  • ✅ Standalone CLI + LinkML validator integration
  • ✅ Tracks unknown ontology prefixes

Installation

pip install linkml-term-validator

Or with uv:

uv add linkml-term-validator

Quick Start

For interactive tutorials, see the Jupyter notebooks in the notebooks/ directory.

Validate Schemas

Check that meaning fields in your schema reference valid ontology terms:

linkml-term-validator validate-schema schema.yaml

Validate Data

Validate data instances against dynamic enums and binding constraints:

linkml-term-validator validate-data data.yaml --schema schema.yaml

The validate-data command checks:

  • Dynamic enums - values match reachable_from, matches, or concepts definitions
  • Binding constraints - nested object fields satisfy binding ranges
  • Labels (optional with --labels) - ontology term labels match

Examples

Schema Validation

Here's a LinkML schema that uses ontology terms:

id: https://example.org/my-schema
name: my-schema
prefixes:
  GO: http://purl.obolibrary.org/obo/GO_
  CHEBI: http://purl.obolibrary.org/obo/CHEBI_

enums:
  BiologicalProcessEnum:
    description: Examples of biological processes
    permissible_values:
      BIOLOGICAL_PROCESS:
        title: biological process
        meaning: GO:0008150
      CELL_CYCLE:
        title: cell cycle
        meaning: GO:0007049

  ChemicalEntityEnum:
    description: Examples of chemical entities
    permissible_values:
      WATER:
        title: water
        meaning: CHEBI:15377
      GLUCOSE:
        title: glucose
        meaning: CHEBI:17234

When you run validation:

linkml-term-validator my-schema.yaml

The validator will:

  1. Check that GO:0008150 exists and has label "biological_process" (or "biological process")
  2. Check that GO:0007049 exists and has label "cell cycle"
  3. Check that CHEBI:15377 exists and has label "water"
  4. Check that CHEBI:17234 exists and has label "glucose"
  5. Report any mismatches or missing terms

Example Output

Validation Results for my-schema.yaml
============================================================
Enums checked: 2
Values checked: 4
Meanings validated: 4

✅ No issues found!

Or if there's an issue:

⚠️  WARNING: Label mismatch
    Enum: BiologicalProcessEnum
    Value: BIOLOGICAL_PROCESS
    Expected label: biological process
    Found label: biological_process
    Meaning: GO:0008150

Validation Results for my-schema.yaml
============================================================
Enums checked: 2
Values checked: 4
Meanings validated: 4

Issues found: 1
  Warnings: 1
  Errors: 0

Data Validation

Example 1: Dynamic Enums

Schema with a dynamic enum using reachable_from:

enums:
  NeuronTypeEnum:
    description: Any neuron type
    reachable_from:
      source_ontology: obo:cl
      source_nodes:
        - CL:0000540  # neuron
      relationship_types:
        - rdfs:subClassOf

Data file with neuron instances:

neurons:
  - id: "1"
    cell_type: CL:0000540  # neuron - valid
  - id: "2"
    cell_type: CL:0000100  # neuron associated cell - valid (descendant)
  - id: "3"
    cell_type: GO:0008150  # biological process - INVALID

Validate:

linkml-term-validator validate-data neurons.yaml --schema schema.yaml

Output:

❌ Validation failed with 1 issue(s):

❌ ERROR: Value 'GO:0008150' not in dynamic enum NeuronTypeEnum
    Expected one of the descendants of CL:0000540

Example 2: Binding Constraints

Schema with binding constraints:

classes:
  GeneAnnotation:
    slots:
      - gene
      - go_term
    slot_usage:
      go_term:
        range: GOTerm
        bindings:
          - binds_value_of: id
            range: BiologicalProcessEnum

  GOTerm:
    slots:
      - id
      - label

Data file:

annotations:
  - gene: BRCA1
    go_term:
      id: GO:0008150  # biological_process
      label: biological process

Validate with label checking:

linkml-term-validator validate-data annotations.yaml --schema schema.yaml --labels

Caching

The validator uses multi-level caching to speed up repeated validations:

In-Memory Cache

During a single validation run, ontology labels and expanded dynamic enums are cached in memory.

File-Based Cache

Labels are persisted to CSV files in the cache directory (default: cache/). Dynamic enums are cached separately under cache/enums/.

Label cache layout:

cache/
├── go/
│   └── terms.csv      # GO term labels
├── chebi/
│   └── terms.csv      # CHEBI term labels
└── uberon/
    └── terms.csv      # UBERON term labels

Each CSV contains:

curie,label,retrieved_at
GO:0008150,biological_process,2025-11-15T10:30:00
GO:0007049,cell cycle,2025-11-15T10:30:01

Cache Behavior

  • First run: Queries ontology databases and stores positive enum membership hits
  • Subsequent runs: Loads warm label caches and any previously seen enum hits from disk
  • Closed enum caches: Use --saturate-enum-caches or cache_strategy=greedy to materialize full closures and write an explicit .complete marker
  • Cache location: Configurable via --cache-dir flag
  • Disable all file caching: Use --no-cache
  • Disable only enum expansion caching: Use --no-cache-enum-expansions

When to Clear Cache

You might want to clear the cache if:

  • Ontology databases have been updated
  • You suspect stale or incorrect labels
# Clear cache for specific ontology
rm -rf cache/go/

# Clear entire cache
rm -rf cache/

Advanced Configuration

Per-Prefix Adapter Configuration

Create an oak_config.yaml to control which ontologies are validated:

ontology_adapters:
  GO: sqlite:obo:go           # Use local GO database
  CHEBI: sqlite:obo:chebi     # Use local CHEBI database
  UBERON: sqlite:obo:uberon   # Use local UBERON database
  CUSTOM: ""                   # Skip validation for CUSTOM prefix

Then validate with this config:

linkml-term-validator schema.yaml --config oak_config.yaml

Important: When using oak_config.yaml, ONLY the prefixes listed in the config will be validated. Any prefix not in the config will be tracked as "unknown" and reported at the end of validation.

Default Behavior (No Config File)

Without an oak_config.yaml, the validator uses sqlite:obo: as the default adapter. This automatically creates per-prefix adapters:

  • GO:0008150 → uses sqlite:obo:go
  • CHEBI:15377 → uses sqlite:obo:chebi
  • UBERON:0000468 → uses sqlite:obo:uberon

This works for any OBO ontology that has been downloaded via OAK.

Usage

linkml-term-validator supports two main validation use cases:

1. Schema Validation

Validates meaning fields in enum permissible values.

CLI:

# Validate schema permissible values
linkml-term-validator validate-schema schema.yaml

# With strict mode (warnings become errors)
linkml-term-validator validate-schema --strict schema.yaml

# With custom config
linkml-term-validator validate-schema --config oak_config.yaml schema.yaml

Python API:

from linkml.validator import Validator
from linkml_term_validator.plugins import PermissibleValueMeaningPlugin

plugin = PermissibleValueMeaningPlugin(
    oak_adapter_string="sqlite:obo:",
    strict_mode=False
)

validator = Validator(schema="schema.yaml", validation_plugins=[plugin])
report = validator.validate("schema.yaml")

if len(report.results) == 0:
    print("Valid!")
else:
    for result in report.results:
        print(f"{result.severity}: {result.message}")

2. Data Validation

Validates data instances against dynamic enums and binding constraints.

CLI:

# Validate data (checks both dynamic enums and bindings)
linkml-term-validator validate-data data.yaml --schema schema.yaml

# With specific target class
linkml-term-validator validate-data data.yaml -s schema.yaml -t Person

# Also validate labels match ontology
linkml-term-validator validate-data data.yaml -s schema.yaml --labels

# Only check bindings, skip dynamic enums
linkml-term-validator validate-data data.yaml -s schema.yaml --no-dynamic-enums

# Only check dynamic enums, skip bindings
linkml-term-validator validate-data data.yaml -s schema.yaml --no-bindings

Data validation includes two aspects:

Dynamic Enums

Validates against enums defined via reachable_from, matches, concepts.

Example schema:

enums:
  NeuronTypeEnum:
    reachable_from:
      source_ontology: obo:cl
      source_nodes: [CL:0000540]  # neuron
      relationship_types: [rdfs:subClassOf]

Python API:

from linkml.validator import Validator
from linkml_term_validator.plugins import DynamicEnumPlugin

plugin = DynamicEnumPlugin(oak_adapter_string="sqlite:obo:")
validator = Validator(schema="schema.yaml", validation_plugins=[plugin])
report = validator.validate("data.yaml")
Binding Constraints

Validates nested object fields against binding constraints.

Example schema:

classes:
  Annotation:
    slots:
      - term
    slot_usage:
      term:
        range: Term
        bindings:
          - binds_value_of: id
            range: GOTermEnum

Python API:

from linkml.validator import Validator
from linkml_term_validator.plugins import BindingValidationPlugin

plugin = BindingValidationPlugin(
    validate_labels=True  # Also check labels match ontology
)
validator = Validator(schema="schema.yaml", validation_plugins=[plugin])
report = validator.validate("data.yaml")

Combining Multiple Validations

CLI:

# Validate data with both dynamic enums and bindings (default)
linkml-term-validator validate-data data.yaml --schema schema.yaml

# With label validation enabled
linkml-term-validator validate-data data.yaml -s schema.yaml --labels

Python API:

from linkml.validator import Validator
from linkml.validator.plugins import JsonschemaValidationPlugin
from linkml_term_validator.plugins import (
    DynamicEnumPlugin,
    BindingValidationPlugin,
)

# Comprehensive validation pipeline
plugins = [
    JsonschemaValidationPlugin(closed=True),  # Structural validation
    DynamicEnumPlugin(),                       # Dynamic enum validation
    BindingValidationPlugin(validate_labels=True),  # Binding validation
]

validator = Validator(schema="schema.yaml", validation_plugins=plugins)
report = validator.validate("data.yaml")

Integration with linkml-validate

The linkml-term-validator plugins can be used directly with the standard linkml-validate command via configuration files.

Using Config Files

Create a validation config file (e.g., validation_config.yaml):

# Validation configuration for linkml-validate
schema: schema.yaml
target_class: Person

data_sources:
  - data.yaml

plugins:
  # Standard JSON Schema validation
  JsonschemaValidationPlugin:
    closed: true

  # Ontology term validation for dynamic enums
  "linkml_term_validator.plugins.DynamicEnumPlugin":
    oak_adapter_string: "sqlite:obo:"
    cache_labels: true
    cache_enum_expansions: true
    saturate_enum_caches: false
    cache_dir: cache

  # Binding constraint validation
  "linkml_term_validator.plugins.BindingValidationPlugin":
    oak_adapter_string: "sqlite:obo:"
    validate_labels: true
    cache_labels: true
    cache_enum_expansions: true
    saturate_enum_caches: false
    cache_dir: cache

Then run validation:

linkml-validate --config validation_config.yaml

Example Files

See the examples/ directory for complete examples:

Plugin Configuration Options

DynamicEnumPlugin

"linkml_term_validator.plugins.DynamicEnumPlugin":
  oak_adapter_string: "sqlite:obo:"  # OAK adapter (default: sqlite:obo:)
  cache_labels: true                  # Enable label caching (default: true)
  cache_enum_expansions: true         # Enable enum expansion caching (default: true)
  saturate_enum_caches: false         # Materialize full closures and mark caches complete
  cache_dir: cache                    # Cache directory (default: cache)
  oak_config_path: oak_config.yaml    # Optional: custom OAK config

BindingValidationPlugin

"linkml_term_validator.plugins.BindingValidationPlugin":
  oak_adapter_string: "sqlite:obo:"  # OAK adapter (default: sqlite:obo:)
  validate_labels: true               # Check labels match ontology (default: true)
  cache_labels: true                  # Enable label caching (default: true)
  cache_enum_expansions: true         # Enable enum expansion caching (default: true)
  saturate_enum_caches: false         # Materialize full closures and mark caches complete
  cache_dir: cache                    # Cache directory (default: cache)
  oak_config_path: oak_config.yaml    # Optional: custom OAK config

Programmatic Usage

You can also use the plugins programmatically:

from linkml.validator import Validator
from linkml.validator.plugins import JsonschemaValidationPlugin
from linkml_term_validator.plugins import (
    DynamicEnumPlugin,
    BindingValidationPlugin,
)

# Build validation pipeline
plugins = [
    JsonschemaValidationPlugin(closed=True),
    DynamicEnumPlugin(oak_adapter_string="sqlite:obo:"),
    BindingValidationPlugin(validate_labels=True),
]

# Create validator
validator = Validator(
    schema="schema.yaml",
    validation_plugins=plugins,
)

# Validate
report = validator.validate("data.yaml")

# Check results
if len(report.results) == 0:
    print("✅ Validation passed")
else:
    for result in report.results:
        print(f"{result.severity.name}: {result.message}")

Repository Structure

Developer Tools

There are several pre-defined command-recipes available. They are written for the command runner just. To list all pre-defined commands, run just or just --list.

Anti-Hallucination Guardrails for Agentic AI

While linkml-term-validator is designed for standard data validation, it serves a crucial role as an anti-hallucination guardrail for agentic AI pipelines that generate ontology term references.

The Problem: LLMs Hallucinate Identifiers

Language models frequently hallucinate identifiers like gene IDs, ontology terms, and other structured references. These fake identifiers often appear structurally correct (e.g., GO:9999999, CHEBI:88888) but don't actually exist in the source ontologies.

The Solution: Dual Validation Pattern

A robust guardrail requires dual validation—forcing the AI to provide both the identifier and its canonical label, then validating that they match:

Instead of accepting:

term: GO:0005515  # Single piece of information - easy to hallucinate

Require and validate:

term:
  id: GO:0005515
  label: protein binding  # Must match canonical label in ontology

This dramatically reduces hallucinations because the AI must get two interdependent facts correct simultaneously, which is significantly harder to fake convincingly than inventing a single plausible-looking identifier.

Implementation in AI Pipelines

Use linkml-term-validator to embed validation directly into your agentic workflow:

1. Define schemas with binding constraints:

classes:
  GeneAnnotation:
    slots:
      - gene
      - go_term
    slot_usage:
      go_term:
        range: GOTerm
        bindings:
          - binds_value_of: id
            range: BiologicalProcessEnum

  GOTerm:
    slots:
      - id        # AI must provide both
      - label     # fields correctly

2. Validate AI-generated outputs before committing:

from linkml.validator import Validator
from linkml_term_validator.plugins import BindingValidationPlugin

# Create validator with label checking enabled
plugin = BindingValidationPlugin(validate_labels=True)
validator = Validator(schema="schema.yaml", validation_plugins=[plugin])

# Validate AI-generated data
report = validator.validate(ai_generated_data)

if len(report.results) > 0:
    # Reject hallucinated terms, prompt AI to regenerate
    raise ValueError("Invalid ontology terms detected")

3. Use validation during generation (not just post-hoc):

The most effective approach embeds validation during AI generation rather than treating it as a filtering step afterward. This transforms hallucination resistance from a detection problem into a generation constraint.

Real-World Benefits

  • Prevents fake identifiers from entering curated datasets
  • Catches label mismatches where AI uses real IDs but wrong labels
  • Validates dynamic constraints (e.g., only disease terms, only neuron types)
  • Enables reliable automation of curation tasks traditionally requiring human experts

Learn More

For detailed patterns and best practices on making ontology IDs hallucination-resistant in AI workflows, see:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linkml_term_validator-0.4.0rc5.tar.gz (393.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

linkml_term_validator-0.4.0rc5-py3-none-any.whl (43.5 kB view details)

Uploaded Python 3

File details

Details for the file linkml_term_validator-0.4.0rc5.tar.gz.

File metadata

  • Download URL: linkml_term_validator-0.4.0rc5.tar.gz
  • Upload date:
  • Size: 393.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for linkml_term_validator-0.4.0rc5.tar.gz
Algorithm Hash digest
SHA256 dcde945c2f82422111586052be4483a539e319093460dc41fc7c8f7cef9a8a98
MD5 28d97c967f08815c474085230347924b
BLAKE2b-256 4b3280a598b939ffe870238f19daa609959831f72520f3018bf27b5aa56d5910

See more details on using hashes here.

Provenance

The following attestation bundles were made for linkml_term_validator-0.4.0rc5.tar.gz:

Publisher: pypi-publish.yaml on linkml/linkml-term-validator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file linkml_term_validator-0.4.0rc5-py3-none-any.whl.

File metadata

File hashes

Hashes for linkml_term_validator-0.4.0rc5-py3-none-any.whl
Algorithm Hash digest
SHA256 65bcc1bfd8708630376350705f60fbc432b8e8d9e4ebe3b2dfb3ac37c9cc24e6
MD5 b0ef16ceabf128c24752b9ead9a544a3
BLAKE2b-256 20e2eceb36d69a362ba049941f07d1ab5be60ff6adb7215f9bcb805a14fe36dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for linkml_term_validator-0.4.0rc5-py3-none-any.whl:

Publisher: pypi-publish.yaml on linkml/linkml-term-validator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page