Skip to main content

Compliance analysis tool for LinkML data files - measures recommended field population.

Project description

linkml-data-qc

A compliance analysis tool for LinkML data files. Measures how well your data populates recommended: true slots defined in LinkML schemas.

Features

  • Hierarchical scoring: Calculate compliance at multiple levels (global, path-level, per-item)
  • Aggregated list scoring: Roll up scores across list elements using jq-style [] notation
  • Configurable weights: Assign importance weights to paths and slots
  • Threshold violations: Set minimum compliance requirements and detect violations
  • Multiple output formats: JSON, CSV, and human-readable text
  • Multi-file reports: Aggregate compliance across an entire knowledge base

Installation

pip install linkml-data-qc

Or with uv:

uv add linkml-data-qc

Quick Start

Python API

from linkml_data_qc import ComplianceAnalyzer

# Basic usage
analyzer = ComplianceAnalyzer("path/to/schema.yaml")
report = analyzer.analyze_file("path/to/data.yaml", "TargetClass")

print(f"Global compliance: {report.global_compliance:.1f}%")
print(f"Total checks: {report.total_checks}")
print(f"Total populated: {report.total_populated}")

# With configuration for weights and thresholds
from linkml_data_qc import QCConfig, SlotQCConfig

config = QCConfig(
    default_weight=1.0,
    slots={
        "term": SlotQCConfig(weight=2.0, min_compliance=80.0),
        "description": SlotQCConfig(weight=0.5)
    }
)
analyzer = ComplianceAnalyzer("schema.yaml", config)
report = analyzer.analyze_file("data.yaml", "Disease")

if report.threshold_violations:
    print(f"Found {len(report.threshold_violations)} violations!")

Command Line

# Single file analysis
linkml-data-qc data.yaml -s schema.yaml -t TargetClass -f text

# Analyze all files in a directory
linkml-data-qc data/ -s schema.yaml -t TargetClass -f json

# With configuration and threshold enforcement
linkml-data-qc data/ -s schema.yaml -t TargetClass \
    -c qc_config.yaml --fail-on-violations

CLI Options

Option Description
DATA_PATH... Data file(s) or directory to analyze (positional)
-s, --schema Path to LinkML schema YAML (required)
-t, --target-class Target class name for validation (required)
-c, --config Path to QC configuration YAML file
-f, --format Output format: json, csv, text (default: text)
-o, --output Output file path (default: stdout)
--min-compliance Minimum global compliance percentage (exit 1 if below)
--fail-on-violations Exit with error code if any threshold violations occur
--pattern Glob pattern for directory search (default: *.yaml)

How It Works

Schema Introspection

The tool uses LinkML's SchemaView to identify slots marked with recommended: true:

# In your LinkML schema
slots:
  description:
    description: Human-readable description
    recommended: true  # This slot will be tracked

  term:
    description: Ontology term binding
    recommended: true  # This slot will be tracked

Recursive Analysis

The analyzer recursively traverses your data, tracking:

  • Which recommended slots are present at each location
  • The path to each object (e.g., pathophysiology[0].cell_types[2])
  • The LinkML class of each object

Aggregation Levels

Results are computed at multiple levels:

  1. Per-item scores: Each object gets compliance scores for its recommended slots
  2. Aggregated list scores: Rolled up by normalized path with [] notation
  3. Global scores: Overall compliance across all paths

Configuration

Create a YAML configuration file to customize weights and thresholds:

# qc_config.yaml
default_weight: 1.0
default_min_compliance: null

# Per-slot configuration
slots:
  term:
    weight: 2.0
    min_compliance: 80.0
  description:
    weight: 0.5

# Per-path overrides
paths:
  "phenotypes[].phenotype_term.term":
    weight: 3.0
    min_compliance: 95.0

Configuration Precedence

  1. Path-specific config (highest priority)
  2. Slot-specific config
  3. Default values

Output Formats

Text Output

Compliance Report: data/Asthma.yaml
Target Class: Disease
Global Compliance: 65.3% (125/191)
Weighted Compliance: 71.2%

Summary by Slot:
  description: 78.4%
  term: 72.1%

Aggregated Scores by List Path:
  pathophysiology[].description: 100.0% (5/5)
  pathophysiology[].term: 80.0% (4/5)

JSON Output

{
  "file_path": "data/Asthma.yaml",
  "target_class": "Disease",
  "global_compliance": 65.3,
  "weighted_compliance": 71.2,
  "total_checks": 191,
  "total_populated": 125,
  "summary_by_slot": {
    "description": 78.4,
    "term": 72.1
  }
}

CI/CD Integration

Use exit codes for CI integration:

# Fail if global compliance is below 70%
linkml-data-qc data/ -s schema.yaml -t Disease --min-compliance 70

# Fail if any configured threshold is violated
linkml-data-qc data/ -s schema.yaml -t Disease \
    -c qc_config.yaml --fail-on-violations

Exit codes:

  • 0: All checks passed
  • 1: Compliance below threshold or violations detected

Documentation

https://linkml.github.io/linkml-data-qc

Development

# Install dependencies
uv sync --group dev

# Run tests
just test

# Run doctests only
just doctest

# Run type checking
just mypy

# Run linting
just format

Credits

This project uses the template monarch-project-copier

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linkml_data_qc-0.1.0.tar.gz (736.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

linkml_data_qc-0.1.0-py3-none-any.whl (36.0 kB view details)

Uploaded Python 3

File details

Details for the file linkml_data_qc-0.1.0.tar.gz.

File metadata

  • Download URL: linkml_data_qc-0.1.0.tar.gz
  • Upload date:
  • Size: 736.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for linkml_data_qc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d2fc4eadca7785543d3ee7eab91deb50dcc4c3e043f52709960d16932605c129
MD5 824ba844dfe82c180f8c0dfba7121f1e
BLAKE2b-256 4f20a76c1e60c6e52e4fa12848a59dc78343452c95fb8002c5edfa73e75b50e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for linkml_data_qc-0.1.0.tar.gz:

Publisher: pypi-publish.yaml on linkml/linkml-data-qc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file linkml_data_qc-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: linkml_data_qc-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 36.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for linkml_data_qc-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2873eb35cf86307eeaeb5658ad0bbba3fd2f6bd4a147326ed27b473d73b14d3
MD5 1bfe75e49282baa1ffde5e577804383b
BLAKE2b-256 fb6c0d6c2e2048f1a1f0baa6b9645ca330c445fbcc4cb1e8ce70db3eacff0f54

See more details on using hashes here.

Provenance

The following attestation bundles were made for linkml_data_qc-0.1.0-py3-none-any.whl:

Publisher: pypi-publish.yaml on linkml/linkml-data-qc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page