Skip to main content

SDC4 XML validation with ExceptionalValue recovery

Project description

SDCvalidator

XML Schema Validation with Semantic Data Charter (SDC4) ExceptionalValue Recovery

License: MIT Python 3.9+ PyPI version Tests codecov

Overview

SDCvalidator is a specialized XML Schema validation library designed for Semantic Data Charter Release 4 (SDC4) data models. It extends standard XML Schema 1.1 validation with automatic ExceptionalValue injection for validation errors, implementing the SDC4 "quarantine-and-tag" pattern.

When validation errors occur, SDCvalidator:

  1. Preserves the invalid data in the XML instance
  2. Inserts SDC4 ExceptionalValue elements to flag the errors
  3. Classifies errors into 15 ISO 21090-based ExceptionalValue types
  4. Enables data quality tracking and auditing workflows

This library is based on the excellent xmlschema library by Davide Brunato and SISSA.

Key Features

  • SDC4 ExceptionalValue Recovery: Automatic error classification and injection
  • Full XML Schema 1.1 Support: XSD 1.0 and 1.1 validation
  • Data Quality Tracking: 15 ISO 21090 NULL Flavor-based ExceptionalValue types
  • Quarantine-and-Tag Pattern: Preserves invalid data for forensic analysis
  • Extensible Error Mapping: Customizable error-to-ExceptionalValue rules
  • High-Level API: Simple SDC4Validator interface for common workflows
  • Comprehensive Validation Reports: Detailed error summaries with ExceptionalValue classifications

Installation

pip install sdcvalidator

Quick Start

Basic SDC4 Validation with Recovery

from sdcvalidator import SDC4Validator

# Initialize validator with your SDC4 data model schema
validator = SDC4Validator('my_sdc4_datamodel.xsd')

# Validate XML instance and inject ExceptionalValues for errors
recovered_tree = validator.validate_with_recovery('my_instance.xml')

# Save the recovered XML with ExceptionalValue elements
validator.save_recovered_xml('recovered_instance.xml', 'my_instance.xml')

Generate Validation Reports

from sdcvalidator import SDC4Validator

validator = SDC4Validator('my_schema.xsd')
report = validator.validate_and_report('my_instance.xml')

print(f"Valid: {report['valid']}")
print(f"Error count: {report['error_count']}")
print(f"ExceptionalValue types: {report['exceptional_value_type_counts']}")

# Examine individual errors
for error in report['errors']:
    print(f"{error['xpath']}: {error['exceptional_value_type']} - {error['reason']}")

Standard XML Schema Validation

SDCvalidator also supports traditional XML Schema validation:

from sdcvalidator import Schema, validate, is_valid

# Create schema (XSD 1.1 by default)
schema = Schema('my_schema.xsd')

# Validate instances
is_valid('my_instance.xml', schema)
validate('my_instance.xml', schema)

# Decode XML to dictionaries
data = schema.to_dict('my_instance.xml')

SDC4 ExceptionalValue Types

SDCvalidator maps validation errors to 15 ISO 21090 NULL Flavor-based ExceptionalValue types:

Code Name Description Typical Use Case
INV Invalid Value not a member of permitted data values Type violations, pattern mismatches
OTH Other Value not in coding system Enumeration violations
NI No Information Missing/omitted value Missing required elements
NA Not Applicable No proper value applicable Unexpected content
UNC Unencoded Raw source information Encoding/format errors
UNK Unknown Proper value applicable but not known -
ASKU Asked but Unknown Information sought but not found -
ASKR Asked and Refused Information sought but refused -
NASK Not Asked Information not sought -
NAV Not Available Information not available -
MSK Masked Information masked for privacy/security -
DER Derived Derived or calculated value -
PINF Positive Infinity Positive infinity -
NINF Negative Infinity Negative infinity -
TRC Trace Trace amount detected -

ExceptionalValue Injection Example

When validation errors occur, SDCvalidator inserts ExceptionalValue elements while preserving the invalid data:

Input XML (invalid):

<sdc4:AdultPopulation>
    <label>Adult Population</label>
    <xdcount-value>not_a_number</xdcount-value>
    <xdcount-units>
        <label>Count Units</label>
        <xdstring-value>people</xdstring-value>
    </xdcount-units>
</sdc4:AdultPopulation>

Output XML (after recovery):

<sdc4:AdultPopulation>
    <label>Adult Population</label>

    <!-- ExceptionalValue inserted to flag the error -->
    <sdc4:INV>
        <sdc4:ev-name>Invalid</sdc4:ev-name>
        <!-- Validation error: not a valid value for type xs:integer -->
    </sdc4:INV>

    <!-- Invalid value preserved for auditing -->
    <xdcount-value>not_a_number</xdcount-value>

    <xdcount-units>
        <label>Count Units</label>
        <xdstring-value>people</xdstring-value>
    </xdcount-units>
</sdc4:AdultPopulation>

Command-Line Interface

Validate and recover XML instances from the command line:

# Validate with ExceptionalValue recovery
sdcvalidate --recover my_instance.xml -o recovered.xml --schema my_schema.xsd

# Generate validation report
sdcvalidate --report my_instance.xml --schema my_schema.xsd

# Standard validation (no recovery)
sdcvalidate my_instance.xml --schema my_schema.xsd

Convert between XML and JSON:

# XML to JSON
sdcvalidator-xml2json my_instance.xml -o output.json --schema my_schema.xsd

# JSON to XML
sdcvalidator-json2xml my_data.json -o output.xml --schema my_schema.xsd

Advanced Usage

Custom Error Mapping Rules

from sdcvalidator import SDC4Validator, ErrorMapper, ExceptionalValueType

# Create custom error mapper
error_mapper = ErrorMapper()

# Add custom rule for confidential data errors
def is_confidential_error(error):
    return error.reason and 'confidential' in error.reason.lower()

error_mapper._rules.insert(0, (is_confidential_error, ExceptionalValueType.MSK))

# Use custom mapper
validator = SDC4Validator('my_schema.xsd', error_mapper=error_mapper)

Filtering Valid Data for Analytics

To select only valid data (excluding elements with ExceptionalValues):

from xml.etree import ElementTree as ET

def has_exceptional_value(element):
    """Check if element contains any ExceptionalValue."""
    for child in element:
        local_name = child.tag.split('}')[1] if '}' in child.tag else child.tag
        if local_name in ['INV', 'OTH', 'NI', 'NA', 'UNC', 'UNK', 'MSK',
                          'ASKU', 'ASKR', 'NASK', 'NAV', 'DER',
                          'PINF', 'NINF', 'TRC', 'QS']:
            return True
    return False

# Filter valid elements
tree = ET.parse('recovered_instance.xml')
valid_elements = [elem for elem in tree.iter() if not has_exceptional_value(elem)]

Architecture

SDCvalidator consists of:

  1. Core Validation (sdcvalidator.core): Full XML Schema 1.0/1.1 validation engine
  2. SDC4 Module (sdcvalidator.sdc4): ExceptionalValue injection and error mapping
  3. Resources (sdcvalidator.resources): XML resource loading and caching
  4. Converters (sdcvalidator.converters): XML ↔ Python data conversion
  5. XPath (sdcvalidator.xpath): XPath-based element selection

Documentation

User Documentation

Developer Documentation

Development

Contributing

We welcome contributions! Please see our comprehensive guides:

Running Tests

# Run all tests
pytest

# Run SDC4 tests only
pytest tests/sdc4/ -v

# Run with coverage
pytest --cov=sdcvalidator --cov-report=html

# Run linters
flake8 sdcvalidator
mypy sdcvalidator

Development Setup

# Clone repository
git clone https://github.com/Axius-SDC/sdcvalidator.git
cd sdcvalidator

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

See CLAUDE.md for complete developer guide.

Credits

SDCvalidator is developed by Axius-SDC, Inc. and is based on the xmlschema library by:

  • Davide Brunato (brunato@sissa.it)
  • SISSA (International School for Advanced Studies)

The core XML Schema validation engine and much of the underlying architecture are from the xmlschema project.

License

This software is distributed under the terms of the MIT License.

Copyright (c) 2025, Axius-SDC, Inc. Copyright (c) 2016-2024, SISSA (International School for Advanced Studies)

See the LICENSE file for details.

SDC4 Ecosystem

SDCvalidator is part of the SDC4 (Semantic Data Charter version 4) ecosystem:

  • SDCRM v4.0.0 - Reference model and schemas
  • SDCStudio v4.0.0 - Web application for model generation
  • SDCvalidator v4.0.1 - This library (validation and recovery)
  • Obsidian Template v4.0.0 - Markdown templates for dataset descriptions

All SDC4 projects use 4.x.x versioning - the MAJOR version (4) represents the SDC generation.

Support

Acknowledgments

Special thanks to:

  • Davide Brunato and SISSA for the excellent xmlschema library
  • The Semantic Data Charter community for the SDC4 specification
  • All contributors to the project

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdcvalidator-4.0.5.tar.gz (655.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdcvalidator-4.0.5-py3-none-any.whl (487.8 kB view details)

Uploaded Python 3

File details

Details for the file sdcvalidator-4.0.5.tar.gz.

File metadata

  • Download URL: sdcvalidator-4.0.5.tar.gz
  • Upload date:
  • Size: 655.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for sdcvalidator-4.0.5.tar.gz
Algorithm Hash digest
SHA256 a083a40cc43b58cc0eff64cf8139f7e0399490f2e19fff939ed68e490e4445ba
MD5 4352e35bdb96f833229d92ed1713d26b
BLAKE2b-256 7c0ad84db48ea55a197a3e0b0f86dfdad2a3f0bf521133d7ebebe1454bc258ca

See more details on using hashes here.

File details

Details for the file sdcvalidator-4.0.5-py3-none-any.whl.

File metadata

  • Download URL: sdcvalidator-4.0.5-py3-none-any.whl
  • Upload date:
  • Size: 487.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for sdcvalidator-4.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 0f2c420d8b7172c83ea6fcf1b88720382f607e17f29c6d0806b5d4da80b0b094
MD5 afc6b3673515a22da4d3febc80f01ae0
BLAKE2b-256 e3bf2d3afbb4debeb3c01ac1e352abdae3faca1b8a07c31dfd4094445028961b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page