Compliance analysis tool for LinkML data files - measures recommended field population.
Project description
linkml-data-qc
A compliance analysis tool for LinkML data files. Measures how well your data populates recommended: true slots defined in LinkML schemas.
Features
- Hierarchical scoring: Calculate compliance at multiple levels (global, path-level, per-item)
- Aggregated list scoring: Roll up scores across list elements using jq-style
[]notation - Configurable weights: Assign importance weights to paths and slots
- Threshold violations: Set minimum compliance requirements and detect violations
- Multiple output formats: JSON, CSV, and human-readable text
- Multi-file reports: Aggregate compliance across an entire knowledge base
Installation
pip install linkml-data-qc
Or with uv:
uv add linkml-data-qc
Quick Start
Python API
from linkml_data_qc import ComplianceAnalyzer
# Basic usage
analyzer = ComplianceAnalyzer("path/to/schema.yaml")
report = analyzer.analyze_file("path/to/data.yaml", "TargetClass")
print(f"Global compliance: {report.global_compliance:.1f}%")
print(f"Total checks: {report.total_checks}")
print(f"Total populated: {report.total_populated}")
# With configuration for weights and thresholds
from linkml_data_qc import QCConfig, SlotQCConfig
config = QCConfig(
default_weight=1.0,
slots={
"term": SlotQCConfig(weight=2.0, min_compliance=80.0),
"description": SlotQCConfig(weight=0.5)
}
)
analyzer = ComplianceAnalyzer("schema.yaml", config)
report = analyzer.analyze_file("data.yaml", "Disease")
if report.threshold_violations:
print(f"Found {len(report.threshold_violations)} violations!")
Command Line
# Single file analysis
linkml-data-qc data.yaml -s schema.yaml -t TargetClass -f text
# Analyze all files in a directory
linkml-data-qc data/ -s schema.yaml -t TargetClass -f json
# With configuration and threshold enforcement
linkml-data-qc data/ -s schema.yaml -t TargetClass \
-c qc_config.yaml --fail-on-violations
CLI Options
| Option | Description |
|---|---|
DATA_PATH... |
Data file(s) or directory to analyze (positional) |
-s, --schema |
Path to LinkML schema YAML (required) |
-t, --target-class |
Target class name for validation (required) |
-c, --config |
Path to QC configuration YAML file |
-f, --format |
Output format: json, csv, text (default: text) |
-o, --output |
Output file path (default: stdout) |
--min-compliance |
Minimum global compliance percentage (exit 1 if below) |
--fail-on-violations |
Exit with error code if any threshold violations occur |
--pattern |
Glob pattern for directory search (default: *.yaml) |
How It Works
Schema Introspection
The tool uses LinkML's SchemaView to identify slots marked with recommended: true:
# In your LinkML schema
slots:
description:
description: Human-readable description
recommended: true # This slot will be tracked
term:
description: Ontology term binding
recommended: true # This slot will be tracked
Recursive Analysis
The analyzer recursively traverses your data, tracking:
- Which recommended slots are present at each location
- The path to each object (e.g.,
pathophysiology[0].cell_types[2]) - The LinkML class of each object
Aggregation Levels
Results are computed at multiple levels:
- Per-item scores: Each object gets compliance scores for its recommended slots
- Aggregated list scores: Rolled up by normalized path with
[]notation - Global scores: Overall compliance across all paths
Configuration
Create a YAML configuration file to customize weights and thresholds:
# qc_config.yaml
default_weight: 1.0
default_min_compliance: null
# Per-slot configuration
slots:
term:
weight: 2.0
min_compliance: 80.0
description:
weight: 0.5
# Per-path overrides
paths:
"phenotypes[].phenotype_term.term":
weight: 3.0
min_compliance: 95.0
Configuration Precedence
- Path-specific config (highest priority)
- Slot-specific config
- Default values
Output Formats
Text Output
Compliance Report: data/Asthma.yaml
Target Class: Disease
Global Compliance: 65.3% (125/191)
Weighted Compliance: 71.2%
Summary by Slot:
description: 78.4%
term: 72.1%
Aggregated Scores by List Path:
pathophysiology[].description: 100.0% (5/5)
pathophysiology[].term: 80.0% (4/5)
JSON Output
{
"file_path": "data/Asthma.yaml",
"target_class": "Disease",
"global_compliance": 65.3,
"weighted_compliance": 71.2,
"total_checks": 191,
"total_populated": 125,
"summary_by_slot": {
"description": 78.4,
"term": 72.1
}
}
CI/CD Integration
Use exit codes for CI integration:
# Fail if global compliance is below 70%
linkml-data-qc data/ -s schema.yaml -t Disease --min-compliance 70
# Fail if any configured threshold is violated
linkml-data-qc data/ -s schema.yaml -t Disease \
-c qc_config.yaml --fail-on-violations
Exit codes:
0: All checks passed1: Compliance below threshold or violations detected
Documentation
https://linkml.github.io/linkml-data-qc
Development
# Install dependencies
uv sync --group dev
# Run tests
just test
# Run doctests only
just doctest
# Run type checking
just mypy
# Run linting
just format
Credits
This project uses the template monarch-project-copier
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file linkml_data_qc-0.1.0.tar.gz.
File metadata
- Download URL: linkml_data_qc-0.1.0.tar.gz
- Upload date:
- Size: 736.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d2fc4eadca7785543d3ee7eab91deb50dcc4c3e043f52709960d16932605c129
|
|
| MD5 |
824ba844dfe82c180f8c0dfba7121f1e
|
|
| BLAKE2b-256 |
4f20a76c1e60c6e52e4fa12848a59dc78343452c95fb8002c5edfa73e75b50e0
|
Provenance
The following attestation bundles were made for linkml_data_qc-0.1.0.tar.gz:
Publisher:
pypi-publish.yaml on linkml/linkml-data-qc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
linkml_data_qc-0.1.0.tar.gz -
Subject digest:
d2fc4eadca7785543d3ee7eab91deb50dcc4c3e043f52709960d16932605c129 - Sigstore transparency entry: 750805033
- Sigstore integration time:
-
Permalink:
linkml/linkml-data-qc@c3c350c03b34bca8ee1942fc8031a3300beb3b76 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/linkml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yaml@c3c350c03b34bca8ee1942fc8031a3300beb3b76 -
Trigger Event:
release
-
Statement type:
File details
Details for the file linkml_data_qc-0.1.0-py3-none-any.whl.
File metadata
- Download URL: linkml_data_qc-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2873eb35cf86307eeaeb5658ad0bbba3fd2f6bd4a147326ed27b473d73b14d3
|
|
| MD5 |
1bfe75e49282baa1ffde5e577804383b
|
|
| BLAKE2b-256 |
fb6c0d6c2e2048f1a1f0baa6b9645ca330c445fbcc4cb1e8ce70db3eacff0f54
|
Provenance
The following attestation bundles were made for linkml_data_qc-0.1.0-py3-none-any.whl:
Publisher:
pypi-publish.yaml on linkml/linkml-data-qc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
linkml_data_qc-0.1.0-py3-none-any.whl -
Subject digest:
c2873eb35cf86307eeaeb5658ad0bbba3fd2f6bd4a147326ed27b473d73b14d3 - Sigstore transparency entry: 750805091
- Sigstore integration time:
-
Permalink:
linkml/linkml-data-qc@c3c350c03b34bca8ee1942fc8031a3300beb3b76 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/linkml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yaml@c3c350c03b34bca8ee1942fc8031a3300beb3b76 -
Trigger Event:
release
-
Statement type: