Validate, repair, and normalize XML against a JSON ground-truth schema.
Project description
xmldev
A Python library and CLI for validating, repairing, and normalizing XML against a developer-defined JSON ground-truth schema. Because apparently the world still runs on XML, and apparently no one agrees on what that XML should look like.
Why This Exists
You have an XML feed. Your partner sends you XML. Your internal service produces XML. An LLM burps out XML. And somehow, none of it quite matches what the schema says it should look like.
The element is named <FullName> in prod, <fullname> in staging, and <full-name>
in QA, and your downstream processor accepts exactly none of them. Someone typed
twenty five into an integer field. Half the closing tags are missing — recovered
by lxml but now the tree looks like abstract art. Your integration test has been
broken for three sprints and nobody knows why.
xmldev is the library that fixes all of that. You tell it what the XML is supposed to look like (in a JSON file you write once), and it tells you what is wrong with what you got. Then it fixes it. Then it tells you exactly what it changed and why. No magic. No surprises. An LLM can optionally be involved, but only if you explicitly ask for it and actually configure it first.
This project exists because "just validate the XML" is the oldest lie in enterprise software, and someone had to write the thing that actually does it.
Installation
pip install xmldev
Or from source:
git clone https://github.com/yourname/xmldev.git
cd xmldev
pip install -e ".[dev]"
Python 3.10+ required. Dependencies: lxml, click, jsonschema,
rapidfuzz, openai, prometheus-client, python-dateutil.
The LLM dependency (openai) is used only when you configure and enable LLM
fallback. If you never touch the config file, no LLM calls are made. Ever.
Quick Start
Python API
from xmldev import Xmldev
xd = Xmldev()
schema = xd.load_schema_from_file("schema.person.json")
result = xd.validate_and_fix(open("broken.xml", "rb").read(), schema)
if result["ok"]:
print(result["fixed_xml"])
else:
print("Could not fully fix. Diagnostics:")
for d in result["diagnostics"]:
print(" ", d)
result is always a dict with:
| Key | Type | Description |
|---|---|---|
ok |
bool | True if the output passes schema validation |
fixed_xml |
str or None | The repaired XML string |
patches |
list[dict] | Every change made, with confidence and source |
diagnostics |
list[str] | Human-readable description of unfixable problems |
provenance |
dict | Which repair layer actually fixed it |
original |
str | The input as received, unmodified |
CLI
# Validate and repair a file
xmldev validate --schema schema.person.json --input broken.xml --output fixed.xml --audit audit.json
# Repair with auto-apply
xmldev repair --schema schema.person.json --input broken.xml --output fixed.xml --auto-apply
# Lint an entire directory
xmldev lint --schema schema.person.json --input ./xml_feeds/ --report report.json
# Start an HTTP server
xmldev serve --port 8080
Exit codes: 0 = pass, 1 = validation failed, 2 = fixed but has low-confidence
patches that need review, 3 = fatal error (bad schema, I/O failure).
Defining Your Schema
The ground-truth schema is a JSON file you write once. It describes the structure of the XML you expect to receive. Here is the minimal schema for a person record:
{
"root": {
"name": "person",
"attrs": {
"id": { "name": "id", "type": "int", "required": true }
},
"children": [
{
"spec": {
"name": "name",
"text_type": "string",
"aliases": ["fullname", "full-name", "full_name"]
},
"min_occurs": 1,
"order_index": 0
},
{
"spec": {
"name": "age",
"text_type": "int",
"default": 0
},
"min_occurs": 0,
"order_index": 1
}
],
"order_enforced": true
}
}
Given this schema and the following broken XML:
<person id="12">
<fullname>Bob</fullname>
<age>twenty five</age>
</person>
xmldev will:
- Rename
<fullname>to<name>(alias match, confidence 1.0, auto-applied). - Coerce
"twenty five"to25(words-to-number parser, confidence 0.6, flagged for review). - Return
ok: trueand two patches in the audit.
Nothing was sent to any API. Nothing was guessed. The alias was declared. The type was declared. The fix was deterministic.
Schema Reference (Short Form)
Full meta-schema lives in xmldev/schema.py. The fields that matter most:
Element fields
| Field | Type | Description |
|---|---|---|
name |
string | Required. The expected tag name. |
text_type |
string | One of string, int, float, date, bool, enum, regex. Default string. |
enum_values |
list[str] | Required if text_type is enum. |
pattern |
string | Regex pattern for text_type: regex. |
aliases |
list[str] | Tag names that should silently rename to this element. |
default |
any | Value to insert when this element is missing and min_occurs >= 1. |
attrs |
dict | Map of attribute name to AttrSpec. |
children |
list | Array of ChildSpec objects. |
order_enforced |
bool | If true, children must appear in order_index order. |
conditional_rules |
list | Expression-based rules evaluated at validation time. |
custom_validator |
string | "module.path:callable" for custom validation logic. |
ChildSpec fields
| Field | Type | Description |
|---|---|---|
spec |
ElementSpec | The element definition (nested). |
min_occurs |
int | Minimum occurrences. 0 = optional. Default 0. |
max_occurs |
int or "unbounded" |
Maximum occurrences. Default 1. |
order_index |
int | Position when order_enforced is true. |
Global config
{
"root": { ... },
"global": {
"allow_unknown": "move_to_extension",
"extension_element_name": "extensions",
"fuzzy": {
"name_threshold": 0.85,
"permissive_threshold": 0.70,
"max_renames_per_doc": 10
}
}
}
allow_unknown controls what happens to elements not in the schema:
keep— leave them alone.drop— delete them, generate a patch.move_to_extension(default) — move them into an<extensions>wrapper element.
LLM Fallback
The LLM fallback is opt-in. It will not activate unless you create a config file
and explicitly set XMDEV_ALLOW_LLM=true in it. There is no default behavior that
sends data anywhere.
Create xmldev.config.env (see examples/xmldev.config.env.example):
LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=sk-your-key-here
LLM_MODEL_ID=gpt-4o-mini
LLM_TEMPERATURE=0.0
LLM_MAX_TOKENS=2048
XMDEV_ALLOW_LLM=true
LLM_REDACT=true
Then pass it at construction time:
xd = Xmldev(config_path="xmldev.config.env")
result = xd.validate_and_fix(xml, schema, allow_llm=True)
Or via CLI:
xmldev repair --schema schema.json --input broken.xml --output fixed.xml \
--allow-llm --config xmldev.config.env
PII redaction is enabled by default when LLM is used. Tags and attributes matching
a configurable list are replaced with [REDACTED] before the prompt is sent.
Set LLM_REDACT=false at your own discretion and your own legal risk.
The LLM is called after deterministic and heuristic passes have both failed. It gets the broken XML, the relevant schema snippet, and the list of violations. If the LLM output passes schema validation, it is accepted. If not, one retry is attempted with a stricter prompt. If that also fails, the LLM result is discarded and diagnostics are returned. Total LLM calls per document: maximum 2.
Developer Reference
This section is for people who are integrating xmldev into a pipeline, embedding it into a service, extending it with custom validators, or just trying to understand why it did what it did to your XML. It is long. That is intentional.
Architecture Overview
The library is a pipeline of sequential, independently testable stages. Each stage produces an output that the next stage consumes. No stage reaches back to modify a previous stage's output.
Input XML (str or bytes)
|
v
[1] tolerant_parse -- xmldev/parser.py
|
v
[2] canonicalize -- xmldev/parser.py
|
v
[3] alias normalization -- xmldev/repair.py DeterministicRepairer
|
v
[4] Validator -- xmldev/validator.py
|
violations list
|
v
[5] DeterministicRepairer -- xmldev/repair.py (up to 3 passes)
|
v
[6] re-validate
|
+-- ok --> return result
|
v
[7] HeuristicRepairer (if aggressive=True)
|
v
[8] LLMClient (if configured and allow_llm=True)
|
v
return result dict
Each box is a separate module. The orchestrator is xmldev/__init__.py
(Xmldev.validate_and_fix). The orchestrator holds no schema parsing, no
repair logic, and no fuzzy matching. It only calls the right thing in the
right order.
Module Reference
xmldev/schema.py
What it does: Loads and validates the user-provided ground-truth JSON schema and converts it into a Python AST made of dataclasses.
Key classes:
-
SchemaLoader.load(source)— accepts a JSON string, bytes, orpathlib.Path. Validates against an internal meta-schema usingjsonschema. RaisesSchemaLoadError(codeXM01) on any problem. -
Schema— top-level container:root: ElementSpec,global_config: GlobalConfig. -
ElementSpec— represents one XML element:name,text_type,enum_values,pattern,attrs,children,order_enforced,aliases,default,conditional_rules,custom_validator. -
ChildSpec— wraps anElementSpecwith cardinality info:min_occurs,max_occurs,order_index,required_if. -
AttrSpec— attribute definition:name,type,required,default,enum_values,pattern,aliases. -
GlobalConfig— global behavior:allow_unknown,extension_element_name,name_threshold,permissive_threshold,max_renames_per_doc.
Important implementation note: The GlobalConfig.name_threshold field defaults
to 0.85 (the schema default), not None. The orchestrator detects whether the user
explicitly overrode this value by comparing against the GlobalConfig() default. If
not explicitly set, the fuzzy profile's threshold is used. If you change GlobalConfig
defaults, update Xmldev.validate_and_fix accordingly.
xmldev/parser.py
What it does: Turns a string or bytes into an lxml _Element tree.
Key functions:
-
tolerant_parse(xml)— returns aParseResult(root, recovered, parse_errors). First tries a strict parse. On failure, retries withlxml.etree.XMLParser(recover=True). If recovery also fails or returns an empty tree, raisesParseError(codeXM02). -
canonicalize(root)— modifies the tree in-place. Strips insignificant whitespace from text and tail nodes. Does NOT remove intentional whitespace in text content — only leading/trailing whitespace is stripped. Returns the root for chaining. -
xpath_path(elem, doc_root)— generates an XPath-style path string like/person[1]/name[1]. Used in patch records and violation messages. The path is position-indexed to be unambiguous even when siblings share a tag name. -
_local_name(tag)— strips namespace prefix from{uri}localnameformat. Used throughout the codebase wherever a tag name needs to be compared to a schema name.
Parser recovery behavior: lxml's recovery mode does NOT guarantee a valid tree.
It makes a best effort. The recovered flag in ParseResult indicates that recovery
was used. The HeuristicRepairer looks at this flag to decide whether additional
structural reconstruction is needed.
xmldev/validator.py
What it does: Walks the parsed XML tree and produces a list of Violation objects.
Does NOT modify the tree.
Key classes:
-
Violation— a dataclass withcode,path,element(lxml element reference),message, andextradict. -
Validator(schema).validate(root)— entry point. Returnslist[Violation].
Violation codes:
| Code | Meaning |
|---|---|
UNKNOWN_TAG |
Element not in schema and not an alias of any schema element |
MISSING_REQUIRED |
min_occurs > 0 and element not present (alias-aware count) |
TOO_MANY |
more occurrences than max_occurs |
UNKNOWN_ATTR |
Attribute not in schema |
MISSING_REQUIRED_ATTR |
Required attribute absent and not an alias of a present attr |
TEXT_TYPE_ERROR |
Leaf text content fails type validation |
ATTR_TYPE_ERROR |
Attribute value fails type validation |
WRONG_ORDER |
Children appear in wrong order when order_enforced is true |
CONDITIONAL_RULE_FAILED |
A conditional_rule expression evaluated to False |
CUSTOM_VALIDATOR_FAILED |
Custom validator returned an error string |
CUSTOM_VALIDATOR_ERROR |
Custom validator raised an exception |
Alias-aware cardinality counting: When a child element appears under an alias name
(e.g., <fullname> for a field defined as name with alias fullname), the validator
counts it toward the canonical name's cardinality. This means <fullname> satisfies
min_occurs: 1 for name. The actual renormalization (tag rename) happens in the
repair pre-pass, not in the validator.
Order checking: Only fires a WRONG_ORDER violation when order_enforced: true
is set on the parent element. The check filters the expected order list to only include
names actually present in the document, then compares order with what was found.
Custom validators: Any string of the form "module.path:function_name" in the
custom_validator field on an ElementSpec will be dynamically imported at validation
time. The function receives the lxml._Element and should return None or True
on success, or a non-empty string error message on failure.
xmldev/fuzzy.py
What it does: Fuzzy string matching for tag and attribute name normalization.
Key classes:
FuzzyMatcher(profile, max_renames, name_threshold, permissive_threshold).
Profiles:
| Profile | name_threshold | permissive_threshold |
|---|---|---|
| strict | 0.95 | 0.80 |
| balanced (default) | 0.85 | 0.70 |
| permissive | 0.75 | 0.60 |
name_threshold controls tag/attribute name fuzzy matching.
permissive_threshold controls enum value fuzzy matching.
Match priority (highest to lowest):
- Alias list match (score = 1.0,
via_alias=True). - Exact match after normalization (lowercase, remove punctuation, collapse separators,
apply synonym mappings — see
_SYNONYM_MAPfor built-in synonyms likefullname -> name). - Fuzzy score via
rapidfuzz.fuzz.ratio>=name_threshold.
Rename cap: max_renames (default 10 from schema global config) is a hard cap on
total renames per document. Once hit, renames_exhausted returns True and all further
match() calls return None. Call reset() at the start of each document.
Schema threshold vs profile threshold: The schema's global.fuzzy.name_threshold
overrides the profile only when explicitly set by the user. If the schema does not
include a global.fuzzy block (or uses the default value), the profile's threshold
is used. This is handled in Xmldev.validate_and_fix by comparing against GlobalConfig()
defaults before passing to FuzzyMatcher.
xmldev/repair.py
What it does: Applies rule-based fixes to an XML tree. Does NOT validate.
Does NOT call fuzzy directly — it receives a FuzzyMatcher instance.
Key classes:
-
DeterministicRepairer(schema, fuzzy, auto_apply_threshold=0.9) -
HeuristicRepairer(schema, auto_apply_threshold=0.9)
DeterministicRepairer.repair(root, violations):
- Deep-copies the root. All mutations are on the copy.
- Alias pre-pass (
_normalize_alias_tags): walks the copy tree and renames any element whose tag is in an alias list to the canonical name. This runs before re-validation so the alias count fix in the validator is not needed for repair. - Re-validates the copy to get fresh violations with element references pointing into the copy (not the original).
- Dispatches each violation to the appropriate fixer method.
- Returns
RepairResult(root=copy, patches=list[Patch], success=bool).
Why re-validate inside repair(): The caller passes violations generated against the original tree. After deep-copy, those element references are invalid. Re-validating against the copy ensures fixer methods receive element objects that are actually in the tree being modified.
Why alias normalization is also done in validate_and_fix(): Because the validator
treats aliases as valid. A document containing only <fullname> (a declared alias of
<name>) passes validation. No violations are generated, so repair() is never called.
The alias pre-pass in validate_and_fix runs unconditionally on the live tree to ensure
aliased tags are canonicalized regardless of whether there are other violations.
Fixer dispatch table:
| Violation code | Fixer method | Notes |
|---|---|---|
UNKNOWN_TAG |
_fix_unknown_tag |
Fuzzy rename first; if no match, apply allow_unknown policy |
MISSING_REQUIRED |
_fix_missing_required |
Inserts element with default value; skips if no default |
MISSING_REQUIRED_ATTR |
_fix_missing_attr |
Inserts attribute with default; skips if no default |
TEXT_TYPE_ERROR |
_fix_text_type |
Type coercion including words-to-number |
ATTR_TYPE_ERROR |
_fix_attr_type |
Attribute value coercion |
WRONG_ORDER |
_fix_order |
Removes and re-inserts children in schema order |
UNKNOWN_ATTR |
(inline) | Drops unknown attributes; confidence 0.95 |
Confidence scores and auto-apply: Patches with confidence >= auto_apply_threshold
(default 0.9) are marked auto_applied=True. The CLI uses this to decide the exit code:
if any patch is not auto-applied, exit code 2 is returned to signal human review is
needed.
HeuristicRepairer.repair(root, doc_root): Currently implements duplicate sibling
merging (collapses multiple occurrences of an element that has max_occurs=1 by moving
children of excess occurrences into the first). Patches are marked auto_applied=False.
Intended as the extensible hook for DP-based reconstruction in future passes.
Words-to-number parsing: _words_to_int(text) handles English integer words like
"twenty five" -> 25. Confidence for word-to-number coercions is 0.6 (below
auto-apply threshold, always requires review). Decimal words and ordinals are not
supported.
xmldev/audit.py
What it does: Defines the Patch dataclass and the AuditWriter utility.
Patch fields:
@dataclass
class Patch:
type: str # rename | insert | delete | coerce | reorder | move
path: str # XPath-style: /person[1]/name[1]
original: str
replacement: str
rule_id: str
confidence: float
source: str # deterministic | heuristic | llm
auto_applied: bool
notes: str
id: str # UUIDv4, auto-generated
timestamp: str # ISO 8601 UTC, auto-generated
AuditWriter.write(patches, path): Writes a JSON file with:
summary:total_patches,auto_applied,pending_review,by_type,by_source.patches: list ofPatch.to_dict()for every patch.
AuditWriter.to_dict(patches): Returns the same structure as a Python dict
without writing to disk. Used by cli.py for the --audit flag.
xmldev/llm.py
What it does: Manages the LLM fallback. Contains config loading, PII redaction, prompt construction, and the API call.
LLMConfig.load(path):
- Returns
Noneif the file does not exist, or ifXMDEV_ALLOW_LLM != true. - Parses
KEY=VALUElines. Strips quotes. Supports inline comments via#. - Required fields:
LLM_BASE_URL,LLM_API_KEY,LLM_MODEL_ID.
PIIRedactor(tags, attrs): Walks the XML string with a regex substitution strategy.
Replaces text content of configured tags and values of configured attributes with
[REDACTED]. Default sensitive tags: ssn, password, credit_card, secret,
token, api_key. Override via LLM_REDACT_TAGS=tag1,tag2 in config.
LLMClient.repair(schema, xml_str, diagnostics=None):
- Optionally redacts PII from
xml_str. - Calls
build_prompt(schema, xml_str, diagnostics). - Sends to
openai.OpenAI(base_url=..., api_key=...).chat.completions.create(...). - Validates response: must be parseable XML and the root tag must match
schema.root.name. - If invalid, retries once.
- Raises
LLMError(codeXM05) on persistent failure or if response contains<xmldev_error>.
build_prompt(schema, xml_str, diagnostics): Returns the prompt string. The schema
is serialized to a compact but human-readable JSON snippet. The LLM is instructed to
return only XML and nothing else. The explicit instruction to use <xmldev_error> for
cases where it cannot repair is included.
xmldev/cli.py
What it does: Click-based CLI. Four commands: validate, repair, lint, serve.
validate: Runs the full validate_and_fix pipeline. Optionally writes fixed XML
and an audit JSON. Exit codes 0/1/2/3 as defined above.
repair: Same as validate but focuses on writing the fixed XML. Flags:
--auto-apply: applies all patches regardless of confidence.--auto-apply-threshold: custom confidence cutoff.--aggressive: enables heuristic repair pass.
lint: Accepts a file or a directory. For directories, recursively finds *.xml.
Runs the pipeline on each file and reports pass/fail per file. Writes a JSON report.
serve: Starts a minimal http.server.HTTPServer on POST /repair. Accepts JSON
body {"schema": {...}, "xml": "..."}. Returns the validate_and_fix result as JSON.
Prometheus metrics exposed on port+1 if prometheus_client is importable.
Verbose mode (-v): Sets logging.getLogger("xmldev") to DEBUG level. Logs
each patch type, confidence, and path. Useful for pipeline debugging.
Writing Custom Validators
To attach domain-specific validation logic to any element, set custom_validator
in its schema ElementSpec:
{
"name": "email",
"text_type": "string",
"custom_validator": "mypackage.validators:validate_email"
}
The referenced function must be importable when xmldev runs. It receives the
lxml._Element and should return:
NoneorTrue— validation passed.- A non-empty string — validation failed; the string becomes the violation message.
# mypackage/validators.py
import re
def validate_email(elem):
text = (elem.text or "").strip()
if not re.fullmatch(r"[^@]+@[^@]+\.[^@]+", text):
return f"Invalid email address: '{text}'"
return None
Custom validator exceptions are caught and surfaced as CUSTOM_VALIDATOR_ERROR
violations rather than propagated. This ensures one bad validator does not crash
the entire pipeline.
Extending the Repair Pipeline
The repair pipeline is designed to be extended. The cleanest extension points are:
1. Adding a new violation code and fixer:
Add a new elif v.code == "MY_CODE": branch in
DeterministicRepairer._fix_violations(). Add the corresponding violation generation
in Validator._check_children() or Validator._validate_element(). Add a test.
2. Adding a new heuristic:
Subclass HeuristicRepairer or add a method to the existing class. Call it from
HeuristicRepairer.repair(). Mark patches with source="heuristic" and set
appropriate confidence values.
3. Replacing the LLM client:
LLMClient uses openai.OpenAI but only for the chat.completions.create call.
Any OpenAI-compatible endpoint works (Ollama, LMStudio, Mistral, Azure OpenAI, etc.)
by setting LLM_BASE_URL in the config. To replace the client entirely, subclass
LLMClient and override repair().
4. Custom fuzzy synonyms:
Edit _SYNONYM_MAP in xmldev/fuzzy.py. The map applies during normalization,
before similarity scoring. Keys and values are normalized strings.
Error Codes
| Code | When | Class |
|---|---|---|
| XM01 | Invalid or unreadable ground-truth schema | SchemaLoadError |
| XM02 | XML is so broken lxml cannot recover any tree | ParseError |
| XM03 | I/O error reading input or schema file | (OS-level, surfaced in CLI) |
| XM04 | LLM requested but not configured | (diagnostic string) |
| XM05 | LLM returned invalid or unrepairable XML | LLMError |
| XM06 | Validation still fails after all repair passes | (diagnostic strings in result) |
Exceptions carry .code, .message, .details, .suggested_action attributes.
Running Tests
pytest tests/ -v
Coverage report:
pytest tests/ --cov=xmldev --cov-report=term-missing
The test suite is entirely self-contained. No external services, no network calls.
LLM tests use mocked openai.OpenAI clients. All tests run in under 5 seconds on
a normal machine.
Test files:
| File | Coverage area |
|---|---|
tests/test_schema.py |
Schema loading, meta-schema validation, AST fields |
tests/test_parser.py |
Tolerant parse, recovery, canonicalization, XPath paths |
tests/test_repair.py |
Alias rename, fuzzy rename, type coercion, reorder, move/drop |
tests/test_audit.py |
Patch dataclass, AuditWriter, summary counts |
tests/test_llm.py |
Config loading, PII redaction, mocked LLM calls |
tests/test_cli.py |
Click test runner, exit codes, file outputs |
Linting
ruff check xmldev/ tests/
The project uses ruff for linting. Configuration is in pyproject.toml.
License
MIT. Go fix some XML.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xmldev-0.1.0.tar.gz.
File metadata
- Download URL: xmldev-0.1.0.tar.gz
- Upload date:
- Size: 52.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4dfa4f1da7d7366bfa66042978d19ee1615a19ed8efde314de13039dc73104d4
|
|
| MD5 |
bdd207dad07f21a263be6d205969d29e
|
|
| BLAKE2b-256 |
c120a0809db30e364af6070b837fe665f4f708c4f3fcc148b9b0fd6a531726cc
|
File details
Details for the file xmldev-0.1.0-py3-none-any.whl.
File metadata
- Download URL: xmldev-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73e8497cfd11a8311ea69fd2ae0c29eff413bb6f2ac93eebb9ec2210d760b9f8
|
|
| MD5 |
4f5bca2f9af07994646344802548763b
|
|
| BLAKE2b-256 |
798ff3fe8140ab32ace62c6c12b29e62210b3e4e03fd94ed603f606e6d08491d
|