Skip to main content

Validate, repair, and normalize XML against a JSON ground-truth schema.

Project description

xmldev

A Python library and CLI for validating, repairing, and normalizing XML against a developer-defined JSON ground-truth schema. Because apparently the world still runs on XML, and apparently no one agrees on what that XML should look like.

Why This Exists

You have an XML feed. Your partner sends you XML. Your internal service produces XML. An LLM burps out XML. And somehow, none of it quite matches what the schema says it should look like.

The element is named <FullName> in prod, <fullname> in staging, and <full-name> in QA, and your downstream processor accepts exactly none of them. Someone typed twenty five into an integer field. Half the closing tags are missing — recovered by lxml but now the tree looks like abstract art. Your integration test has been broken for three sprints and nobody knows why.

xmldev is the library that fixes all of that. You tell it what the XML is supposed to look like (in a JSON file you write once), and it tells you what is wrong with what you got. Then it fixes it. Then it tells you exactly what it changed and why. No magic. No surprises. An LLM can optionally be involved, but only if you explicitly ask for it and actually configure it first.

This project exists because "just validate the XML" is the oldest lie in enterprise software, and someone had to write the thing that actually does it.

Installation

pip install xmldev

Or from source:

git clone https://github.com/yourname/xmldev.git
cd xmldev
pip install -e ".[dev]"

Python 3.10+ required. Dependencies: lxml, click, jsonschema, rapidfuzz, openai, prometheus-client, python-dateutil.

The LLM dependency (openai) is used only when you configure and enable LLM fallback. If you never touch the config file, no LLM calls are made. Ever.

Quick Start

Python API

from xmldev import Xmldev

xd = Xmldev()
schema = xd.load_schema_from_file("schema.person.json")
result = xd.validate_and_fix(open("broken.xml", "rb").read(), schema)

if result["ok"]:
    print(result["fixed_xml"])
else:
    print("Could not fully fix. Diagnostics:")
    for d in result["diagnostics"]:
        print(" ", d)

result is always a dict with:

Key Type Description
ok bool True if the output passes schema validation
fixed_xml str or None The repaired XML string
patches list[dict] Every change made, with confidence and source
diagnostics list[str] Human-readable description of unfixable problems
provenance dict Which repair layer actually fixed it
original str The input as received, unmodified

CLI

# Validate and repair a file
xmldev validate --schema schema.person.json --input broken.xml --output fixed.xml --audit audit.json

# Repair with auto-apply
xmldev repair --schema schema.person.json --input broken.xml --output fixed.xml --auto-apply

# Lint an entire directory
xmldev lint --schema schema.person.json --input ./xml_feeds/ --report report.json

# Start an HTTP server
xmldev serve --port 8080

Exit codes: 0 = pass, 1 = validation failed, 2 = fixed but has low-confidence patches that need review, 3 = fatal error (bad schema, I/O failure).

Defining Your Schema

The ground-truth schema is a JSON file you write once. It describes the structure of the XML you expect to receive. Here is the minimal schema for a person record:

{
  "root": {
    "name": "person",
    "attrs": {
      "id": { "name": "id", "type": "int", "required": true }
    },
    "children": [
      {
        "spec": {
          "name": "name",
          "text_type": "string",
          "aliases": ["fullname", "full-name", "full_name"]
        },
        "min_occurs": 1,
        "order_index": 0
      },
      {
        "spec": {
          "name": "age",
          "text_type": "int",
          "default": 0
        },
        "min_occurs": 0,
        "order_index": 1
      }
    ],
    "order_enforced": true
  }
}

Given this schema and the following broken XML:

<person id="12">
  <fullname>Bob</fullname>
  <age>twenty five</age>
</person>

xmldev will:

  1. Rename <fullname> to <name> (alias match, confidence 1.0, auto-applied).
  2. Coerce "twenty five" to 25 (words-to-number parser, confidence 0.6, flagged for review).
  3. Return ok: true and two patches in the audit.

Nothing was sent to any API. Nothing was guessed. The alias was declared. The type was declared. The fix was deterministic.

Schema Reference (Short Form)

Full meta-schema lives in xmldev/schema.py. The fields that matter most:

Element fields

Field Type Description
name string Required. The expected tag name.
text_type string One of string, int, float, date, bool, enum, regex. Default string.
enum_values list[str] Required if text_type is enum.
pattern string Regex pattern for text_type: regex.
aliases list[str] Tag names that should silently rename to this element.
default any Value to insert when this element is missing and min_occurs >= 1.
attrs dict Map of attribute name to AttrSpec.
children list Array of ChildSpec objects.
order_enforced bool If true, children must appear in order_index order.
conditional_rules list Expression-based rules evaluated at validation time.
custom_validator string "module.path:callable" for custom validation logic.

ChildSpec fields

Field Type Description
spec ElementSpec The element definition (nested).
min_occurs int Minimum occurrences. 0 = optional. Default 0.
max_occurs int or "unbounded" Maximum occurrences. Default 1.
order_index int Position when order_enforced is true.

Global config

{
  "root": { ... },
  "global": {
    "allow_unknown": "move_to_extension",
    "extension_element_name": "extensions",
    "fuzzy": {
      "name_threshold": 0.85,
      "permissive_threshold": 0.70,
      "max_renames_per_doc": 10
    }
  }
}

allow_unknown controls what happens to elements not in the schema:

  • keep — leave them alone.
  • drop — delete them, generate a patch.
  • move_to_extension (default) — move them into an <extensions> wrapper element.

LLM Fallback

The LLM fallback is opt-in. It will not activate unless you create a config file and explicitly set XMDEV_ALLOW_LLM=true in it. There is no default behavior that sends data anywhere.

Create xmldev.config.env (see examples/xmldev.config.env.example):

LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=sk-your-key-here
LLM_MODEL_ID=gpt-4o-mini
LLM_TEMPERATURE=0.0
LLM_MAX_TOKENS=2048
XMDEV_ALLOW_LLM=true
LLM_REDACT=true

Then pass it at construction time:

xd = Xmldev(config_path="xmldev.config.env")
result = xd.validate_and_fix(xml, schema, allow_llm=True)

Or via CLI:

xmldev repair --schema schema.json --input broken.xml --output fixed.xml \
  --allow-llm --config xmldev.config.env

PII redaction is enabled by default when LLM is used. Tags and attributes matching a configurable list are replaced with [REDACTED] before the prompt is sent. Set LLM_REDACT=false at your own discretion and your own legal risk.

The LLM is called after deterministic and heuristic passes have both failed. It gets the broken XML, the relevant schema snippet, and the list of violations. If the LLM output passes schema validation, it is accepted. If not, one retry is attempted with a stricter prompt. If that also fails, the LLM result is discarded and diagnostics are returned. Total LLM calls per document: maximum 2.


Developer Reference

This section is for people who are integrating xmldev into a pipeline, embedding it into a service, extending it with custom validators, or just trying to understand why it did what it did to your XML. It is long. That is intentional.

Architecture Overview

The library is a pipeline of sequential, independently testable stages. Each stage produces an output that the next stage consumes. No stage reaches back to modify a previous stage's output.

Input XML (str or bytes)
        |
        v
  [1] tolerant_parse        -- xmldev/parser.py
        |
        v
  [2] canonicalize           -- xmldev/parser.py
        |
        v
  [3] alias normalization   -- xmldev/repair.py  DeterministicRepairer
        |
        v
  [4] Validator              -- xmldev/validator.py
        |
  violations list
        |
        v
  [5] DeterministicRepairer  -- xmldev/repair.py   (up to 3 passes)
        |
        v
  [6] re-validate
        |
        +-- ok --> return result
        |
        v
  [7] HeuristicRepairer (if aggressive=True)
        |
        v
  [8] LLMClient (if configured and allow_llm=True)
        |
        v
  return result dict

Each box is a separate module. The orchestrator is xmldev/__init__.py (Xmldev.validate_and_fix). The orchestrator holds no schema parsing, no repair logic, and no fuzzy matching. It only calls the right thing in the right order.

Module Reference

xmldev/schema.py

What it does: Loads and validates the user-provided ground-truth JSON schema and converts it into a Python AST made of dataclasses.

Key classes:

  • SchemaLoader.load(source) — accepts a JSON string, bytes, or pathlib.Path. Validates against an internal meta-schema using jsonschema. Raises SchemaLoadError (code XM01) on any problem.

  • Schema — top-level container: root: ElementSpec, global_config: GlobalConfig.

  • ElementSpec — represents one XML element: name, text_type, enum_values, pattern, attrs, children, order_enforced, aliases, default, conditional_rules, custom_validator.

  • ChildSpec — wraps an ElementSpec with cardinality info: min_occurs, max_occurs, order_index, required_if.

  • AttrSpec — attribute definition: name, type, required, default, enum_values, pattern, aliases.

  • GlobalConfig — global behavior: allow_unknown, extension_element_name, name_threshold, permissive_threshold, max_renames_per_doc.

Important implementation note: The GlobalConfig.name_threshold field defaults to 0.85 (the schema default), not None. The orchestrator detects whether the user explicitly overrode this value by comparing against the GlobalConfig() default. If not explicitly set, the fuzzy profile's threshold is used. If you change GlobalConfig defaults, update Xmldev.validate_and_fix accordingly.


xmldev/parser.py

What it does: Turns a string or bytes into an lxml _Element tree.

Key functions:

  • tolerant_parse(xml) — returns a ParseResult(root, recovered, parse_errors). First tries a strict parse. On failure, retries with lxml.etree.XMLParser(recover=True). If recovery also fails or returns an empty tree, raises ParseError (code XM02).

  • canonicalize(root) — modifies the tree in-place. Strips insignificant whitespace from text and tail nodes. Does NOT remove intentional whitespace in text content — only leading/trailing whitespace is stripped. Returns the root for chaining.

  • xpath_path(elem, doc_root) — generates an XPath-style path string like /person[1]/name[1]. Used in patch records and violation messages. The path is position-indexed to be unambiguous even when siblings share a tag name.

  • _local_name(tag) — strips namespace prefix from {uri}localname format. Used throughout the codebase wherever a tag name needs to be compared to a schema name.

Parser recovery behavior: lxml's recovery mode does NOT guarantee a valid tree. It makes a best effort. The recovered flag in ParseResult indicates that recovery was used. The HeuristicRepairer looks at this flag to decide whether additional structural reconstruction is needed.


xmldev/validator.py

What it does: Walks the parsed XML tree and produces a list of Violation objects. Does NOT modify the tree.

Key classes:

  • Violation — a dataclass with code, path, element (lxml element reference), message, and extra dict.

  • Validator(schema).validate(root) — entry point. Returns list[Violation].

Violation codes:

Code Meaning
UNKNOWN_TAG Element not in schema and not an alias of any schema element
MISSING_REQUIRED min_occurs > 0 and element not present (alias-aware count)
TOO_MANY more occurrences than max_occurs
UNKNOWN_ATTR Attribute not in schema
MISSING_REQUIRED_ATTR Required attribute absent and not an alias of a present attr
TEXT_TYPE_ERROR Leaf text content fails type validation
ATTR_TYPE_ERROR Attribute value fails type validation
WRONG_ORDER Children appear in wrong order when order_enforced is true
CONDITIONAL_RULE_FAILED A conditional_rule expression evaluated to False
CUSTOM_VALIDATOR_FAILED Custom validator returned an error string
CUSTOM_VALIDATOR_ERROR Custom validator raised an exception

Alias-aware cardinality counting: When a child element appears under an alias name (e.g., <fullname> for a field defined as name with alias fullname), the validator counts it toward the canonical name's cardinality. This means <fullname> satisfies min_occurs: 1 for name. The actual renormalization (tag rename) happens in the repair pre-pass, not in the validator.

Order checking: Only fires a WRONG_ORDER violation when order_enforced: true is set on the parent element. The check filters the expected order list to only include names actually present in the document, then compares order with what was found.

Custom validators: Any string of the form "module.path:function_name" in the custom_validator field on an ElementSpec will be dynamically imported at validation time. The function receives the lxml._Element and should return None or True on success, or a non-empty string error message on failure.


xmldev/fuzzy.py

What it does: Fuzzy string matching for tag and attribute name normalization.

Key classes:

  • FuzzyMatcher(profile, max_renames, name_threshold, permissive_threshold).

Profiles:

Profile name_threshold permissive_threshold
strict 0.95 0.80
balanced (default) 0.85 0.70
permissive 0.75 0.60

name_threshold controls tag/attribute name fuzzy matching. permissive_threshold controls enum value fuzzy matching.

Match priority (highest to lowest):

  1. Alias list match (score = 1.0, via_alias=True).
  2. Exact match after normalization (lowercase, remove punctuation, collapse separators, apply synonym mappings — see _SYNONYM_MAP for built-in synonyms like fullname -> name).
  3. Fuzzy score via rapidfuzz.fuzz.ratio >= name_threshold.

Rename cap: max_renames (default 10 from schema global config) is a hard cap on total renames per document. Once hit, renames_exhausted returns True and all further match() calls return None. Call reset() at the start of each document.

Schema threshold vs profile threshold: The schema's global.fuzzy.name_threshold overrides the profile only when explicitly set by the user. If the schema does not include a global.fuzzy block (or uses the default value), the profile's threshold is used. This is handled in Xmldev.validate_and_fix by comparing against GlobalConfig() defaults before passing to FuzzyMatcher.


xmldev/repair.py

What it does: Applies rule-based fixes to an XML tree. Does NOT validate. Does NOT call fuzzy directly — it receives a FuzzyMatcher instance.

Key classes:

  • DeterministicRepairer(schema, fuzzy, auto_apply_threshold=0.9)

  • HeuristicRepairer(schema, auto_apply_threshold=0.9)

DeterministicRepairer.repair(root, violations):

  1. Deep-copies the root. All mutations are on the copy.
  2. Alias pre-pass (_normalize_alias_tags): walks the copy tree and renames any element whose tag is in an alias list to the canonical name. This runs before re-validation so the alias count fix in the validator is not needed for repair.
  3. Re-validates the copy to get fresh violations with element references pointing into the copy (not the original).
  4. Dispatches each violation to the appropriate fixer method.
  5. Returns RepairResult(root=copy, patches=list[Patch], success=bool).

Why re-validate inside repair(): The caller passes violations generated against the original tree. After deep-copy, those element references are invalid. Re-validating against the copy ensures fixer methods receive element objects that are actually in the tree being modified.

Why alias normalization is also done in validate_and_fix(): Because the validator treats aliases as valid. A document containing only <fullname> (a declared alias of <name>) passes validation. No violations are generated, so repair() is never called. The alias pre-pass in validate_and_fix runs unconditionally on the live tree to ensure aliased tags are canonicalized regardless of whether there are other violations.

Fixer dispatch table:

Violation code Fixer method Notes
UNKNOWN_TAG _fix_unknown_tag Fuzzy rename first; if no match, apply allow_unknown policy
MISSING_REQUIRED _fix_missing_required Inserts element with default value; skips if no default
MISSING_REQUIRED_ATTR _fix_missing_attr Inserts attribute with default; skips if no default
TEXT_TYPE_ERROR _fix_text_type Type coercion including words-to-number
ATTR_TYPE_ERROR _fix_attr_type Attribute value coercion
WRONG_ORDER _fix_order Removes and re-inserts children in schema order
UNKNOWN_ATTR (inline) Drops unknown attributes; confidence 0.95

Confidence scores and auto-apply: Patches with confidence >= auto_apply_threshold (default 0.9) are marked auto_applied=True. The CLI uses this to decide the exit code: if any patch is not auto-applied, exit code 2 is returned to signal human review is needed.

HeuristicRepairer.repair(root, doc_root): Currently implements duplicate sibling merging (collapses multiple occurrences of an element that has max_occurs=1 by moving children of excess occurrences into the first). Patches are marked auto_applied=False. Intended as the extensible hook for DP-based reconstruction in future passes.

Words-to-number parsing: _words_to_int(text) handles English integer words like "twenty five" -> 25. Confidence for word-to-number coercions is 0.6 (below auto-apply threshold, always requires review). Decimal words and ordinals are not supported.


xmldev/audit.py

What it does: Defines the Patch dataclass and the AuditWriter utility.

Patch fields:

@dataclass
class Patch:
    type: str          # rename | insert | delete | coerce | reorder | move
    path: str          # XPath-style: /person[1]/name[1]
    original: str
    replacement: str
    rule_id: str
    confidence: float
    source: str        # deterministic | heuristic | llm
    auto_applied: bool
    notes: str
    id: str            # UUIDv4, auto-generated
    timestamp: str     # ISO 8601 UTC, auto-generated

AuditWriter.write(patches, path): Writes a JSON file with:

  • summary: total_patches, auto_applied, pending_review, by_type, by_source.
  • patches: list of Patch.to_dict() for every patch.

AuditWriter.to_dict(patches): Returns the same structure as a Python dict without writing to disk. Used by cli.py for the --audit flag.


xmldev/llm.py

What it does: Manages the LLM fallback. Contains config loading, PII redaction, prompt construction, and the API call.

LLMConfig.load(path):

  • Returns None if the file does not exist, or if XMDEV_ALLOW_LLM != true.
  • Parses KEY=VALUE lines. Strips quotes. Supports inline comments via #.
  • Required fields: LLM_BASE_URL, LLM_API_KEY, LLM_MODEL_ID.

PIIRedactor(tags, attrs): Walks the XML string with a regex substitution strategy. Replaces text content of configured tags and values of configured attributes with [REDACTED]. Default sensitive tags: ssn, password, credit_card, secret, token, api_key. Override via LLM_REDACT_TAGS=tag1,tag2 in config.

LLMClient.repair(schema, xml_str, diagnostics=None):

  1. Optionally redacts PII from xml_str.
  2. Calls build_prompt(schema, xml_str, diagnostics).
  3. Sends to openai.OpenAI(base_url=..., api_key=...).chat.completions.create(...).
  4. Validates response: must be parseable XML and the root tag must match schema.root.name.
  5. If invalid, retries once.
  6. Raises LLMError (code XM05) on persistent failure or if response contains <xmldev_error>.

build_prompt(schema, xml_str, diagnostics): Returns the prompt string. The schema is serialized to a compact but human-readable JSON snippet. The LLM is instructed to return only XML and nothing else. The explicit instruction to use <xmldev_error> for cases where it cannot repair is included.


xmldev/cli.py

What it does: Click-based CLI. Four commands: validate, repair, lint, serve.

validate: Runs the full validate_and_fix pipeline. Optionally writes fixed XML and an audit JSON. Exit codes 0/1/2/3 as defined above.

repair: Same as validate but focuses on writing the fixed XML. Flags:

  • --auto-apply: applies all patches regardless of confidence.
  • --auto-apply-threshold: custom confidence cutoff.
  • --aggressive: enables heuristic repair pass.

lint: Accepts a file or a directory. For directories, recursively finds *.xml. Runs the pipeline on each file and reports pass/fail per file. Writes a JSON report.

serve: Starts a minimal http.server.HTTPServer on POST /repair. Accepts JSON body {"schema": {...}, "xml": "..."}. Returns the validate_and_fix result as JSON. Prometheus metrics exposed on port+1 if prometheus_client is importable.

Verbose mode (-v): Sets logging.getLogger("xmldev") to DEBUG level. Logs each patch type, confidence, and path. Useful for pipeline debugging.


Writing Custom Validators

To attach domain-specific validation logic to any element, set custom_validator in its schema ElementSpec:

{
  "name": "email",
  "text_type": "string",
  "custom_validator": "mypackage.validators:validate_email"
}

The referenced function must be importable when xmldev runs. It receives the lxml._Element and should return:

  • None or True — validation passed.
  • A non-empty string — validation failed; the string becomes the violation message.
# mypackage/validators.py
import re

def validate_email(elem):
    text = (elem.text or "").strip()
    if not re.fullmatch(r"[^@]+@[^@]+\.[^@]+", text):
        return f"Invalid email address: '{text}'"
    return None

Custom validator exceptions are caught and surfaced as CUSTOM_VALIDATOR_ERROR violations rather than propagated. This ensures one bad validator does not crash the entire pipeline.


Extending the Repair Pipeline

The repair pipeline is designed to be extended. The cleanest extension points are:

1. Adding a new violation code and fixer:

Add a new elif v.code == "MY_CODE": branch in DeterministicRepairer._fix_violations(). Add the corresponding violation generation in Validator._check_children() or Validator._validate_element(). Add a test.

2. Adding a new heuristic:

Subclass HeuristicRepairer or add a method to the existing class. Call it from HeuristicRepairer.repair(). Mark patches with source="heuristic" and set appropriate confidence values.

3. Replacing the LLM client:

LLMClient uses openai.OpenAI but only for the chat.completions.create call. Any OpenAI-compatible endpoint works (Ollama, LMStudio, Mistral, Azure OpenAI, etc.) by setting LLM_BASE_URL in the config. To replace the client entirely, subclass LLMClient and override repair().

4. Custom fuzzy synonyms:

Edit _SYNONYM_MAP in xmldev/fuzzy.py. The map applies during normalization, before similarity scoring. Keys and values are normalized strings.


Error Codes

Code When Class
XM01 Invalid or unreadable ground-truth schema SchemaLoadError
XM02 XML is so broken lxml cannot recover any tree ParseError
XM03 I/O error reading input or schema file (OS-level, surfaced in CLI)
XM04 LLM requested but not configured (diagnostic string)
XM05 LLM returned invalid or unrepairable XML LLMError
XM06 Validation still fails after all repair passes (diagnostic strings in result)

Exceptions carry .code, .message, .details, .suggested_action attributes.


Running Tests

pytest tests/ -v

Coverage report:

pytest tests/ --cov=xmldev --cov-report=term-missing

The test suite is entirely self-contained. No external services, no network calls. LLM tests use mocked openai.OpenAI clients. All tests run in under 5 seconds on a normal machine.

Test files:

File Coverage area
tests/test_schema.py Schema loading, meta-schema validation, AST fields
tests/test_parser.py Tolerant parse, recovery, canonicalization, XPath paths
tests/test_repair.py Alias rename, fuzzy rename, type coercion, reorder, move/drop
tests/test_audit.py Patch dataclass, AuditWriter, summary counts
tests/test_llm.py Config loading, PII redaction, mocked LLM calls
tests/test_cli.py Click test runner, exit codes, file outputs

Linting

ruff check xmldev/ tests/

The project uses ruff for linting. Configuration is in pyproject.toml.


License

MIT. Go fix some XML.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xmldev-0.1.0.tar.gz (52.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xmldev-0.1.0-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file xmldev-0.1.0.tar.gz.

File metadata

  • Download URL: xmldev-0.1.0.tar.gz
  • Upload date:
  • Size: 52.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for xmldev-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4dfa4f1da7d7366bfa66042978d19ee1615a19ed8efde314de13039dc73104d4
MD5 bdd207dad07f21a263be6d205969d29e
BLAKE2b-256 c120a0809db30e364af6070b837fe665f4f708c4f3fcc148b9b0fd6a531726cc

See more details on using hashes here.

File details

Details for the file xmldev-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: xmldev-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for xmldev-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 73e8497cfd11a8311ea69fd2ae0c29eff413bb6f2ac93eebb9ec2210d760b9f8
MD5 4f5bca2f9af07994646344802548763b
BLAKE2b-256 798ff3fe8140ab32ace62c6c12b29e62210b3e4e03fd94ed603f606e6d08491d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page