Full-fidelity XML parser with lossless round-trip editing
Project description
xmlcst
Full-fidelity XML concrete syntax tree for Python -- parse, edit, and serialize with zero formatting loss.
Why xmlcst?
Existing Python XML libraries (ElementTree, lxml, minidom) parse XML into a semantic tree that discards lexical details: whitespace, comment placement, attribute quote styles, entity reference forms, and more. When you serialize back, the output differs from the input even if you changed nothing.
xmlcst takes a different approach. It treats XML as source text first and semantic structure second, producing a concrete syntax tree (CST) that retains every byte of the original document. When you edit a single attribute, only that attribute changes in the output -- surrounding formatting, comments, and whitespace remain untouched.
This makes xmlcst ideal for programmatic editing of XML configuration files (Maven POMs, .csproj files, Spring configs, Android manifests) where changes must produce minimal, reviewable diffs.
How xmlcst compares
| Feature | ElementTree | lxml | minidom | xmlcst |
|---|---|---|---|---|
| Attribute order | Partial | Partial | Partial | Preserved |
Quote style (' vs ") |
No | No | No | Preserved |
| Whitespace / indentation | No | No | No | Preserved |
| Comments | No | Yes | Yes | Preserved |
| Entity reference form | No | No | No | Preserved |
| CDATA vs escaped text | No | Yes | Yes | Preserved |
Empty-element syntax (<x/> vs <x />) |
No | No | No | Preserved |
| Byte-identical round-trip | No | No | No | Yes |
The closest conceptual analogue is ruamel.yaml -- a round-trip-capable YAML library -- applied to XML.
Installation
pip install xmlcst
Requires Python 3.12+. Pure Python -- no compiled dependencies. Ships PEP 561 type annotations for full mypy / pyright support.
Quick Start
Parse and round-trip
import xmlcst
source = '<project xmlns="http://maven.apache.org/POM/4.0.0">\n <version>1.0</version>\n</project>'
doc = xmlcst.parse(source)
assert doc.to_string() == source # byte-identical round-trip
Edit an attribute (minimal diff)
doc = xmlcst.parse('<root version="1.0" author="alice"/>')
doc.root.attributes["version"] = "2.0"
print(doc.to_string())
# <root version="2.0" author="alice"/>
# Only the value changed -- quotes, whitespace, other attributes untouched
Navigate the tree
doc = xmlcst.parse("""\
<project>
<dependencies>
<dependency>
<groupId>junit</groupId>
</dependency>
</dependencies>
</project>""")
deps = doc.root.find("dependencies")
dep = deps.find("dependency")
group = dep.find("groupId")
print(group.children[0].content) # "junit" (a Text node)
# Or search recursively
dep2 = doc.root.find_recursive("dependency")
all_deps = doc.root.findall_recursive("dependency")
Add and remove elements
doc = xmlcst.parse("<root>\n <a/>\n <b/>\n</root>")
doc.root.append(xmlcst.Element("c"))
print(doc.to_string())
# <root>
# <a/>
# <b/>
# <c/>
# </root>
Access formatting metadata
doc = xmlcst.parse('<root id = "1" name=\'foo\'/>')
attr = doc.root.attributes["id"]
print(attr.raw_value) # "1"
print(attr.quote) # '"'
print(attr.leading_whitespace) # " "
print(attr.eq_whitespace) # (" ", " ")
Work with entity references
doc = xmlcst.parse("<root>a & b</root>")
text = doc.root.children[0]
print(text.content) # "a & b" (raw, as in the source)
print(text.decoded_content()) # "a & b" (entities resolved)
text.set_content("x < y") # auto-escapes
print(text.content) # "x < y"
Sample Application
The samples/bump_pom_version/ directory contains a complete example: a Maven POM version bumper that reads a pom.xml, increments the patch version, and writes the file back. Only the version string changes -- all comments, whitespace, attribute quoting, and other formatting are preserved exactly.
python samples/bump_pom_version/bump_pom_version.py
# 1.2.3 -> 1.2.4
The script accepts an optional path argument to operate on any POM file:
python samples/bump_pom_version/bump_pom_version.py /path/to/your/pom.xml
API Overview
Parsing
| Function | Input | Returns |
|---|---|---|
xmlcst.parse(text) |
str |
Document |
xmlcst.parse_bytes(data) |
bytes |
Document |
xmlcst.parse_file(path) |
str | Path |
Document |
All parse functions raise xmlcst.ParseError on malformed input. The error includes message, line, column, and offset attributes.
Node Types
| Type | Description |
|---|---|
Document |
Root container; holds all top-level nodes |
Element |
An XML element with tag, attributes, and children |
Attribute |
Name-value pair with formatting metadata (quote style, whitespace) |
AttributeList |
Ordered collection with dict-like access by name |
Text |
Character data (entity references preserved in raw form) |
Whitespace |
Whitespace-only character data between markup |
Comment |
<!-- ... --> |
ProcessingInstruction |
<?target data?> |
CData |
<![CDATA[...]]> |
Doctype |
<!DOCTYPE ...> (preserved verbatim) |
XmlDeclaration |
<?xml version="1.0" ...?> |
Serialization
| Method | Description |
|---|---|
doc.to_string() |
Exact round-trip serialization (default) |
doc.to_string(mode="normalized") |
Pretty-printed with consistent formatting |
doc.to_bytes() |
UTF-8 encoded; BOM preserved if present in input |
doc.write(path) |
Write to file |
Design
xmlcst uses a dual-layer architecture:
- Token stream (Layer 1) -- a lossless sequence of tokens covering every byte of the input. The fundamental invariant:
"".join(t.text for t in tokens) == source. - Tree API (Layer 2) -- mutable nodes backed by the token stream. Each node tracks a token span and a dirty flag.
Unmodified nodes serialize by replaying their original tokens (byte-identical). Modified nodes rebuild from their current properties. This guarantees that edits produce the smallest possible diff.
See SPEC.md for the full specification.
Limitations (v1)
- UTF-8 encoding only
- XML 1.0 well-formed documents only (no error recovery)
- No DTD validation or schema support
- No XPath query engine
- No streaming / SAX-style parsing
- Pure Python (no compiled acceleration)
See the future roadmap in the specification for planned enhancements.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xmlcst-0.1.0.tar.gz.
File metadata
- Download URL: xmlcst-0.1.0.tar.gz
- Upload date:
- Size: 32.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0da82fdf7bf92c17cff0cf18cb5dc844b55881e2be6ec4a3adf8556758d1d9ef
|
|
| MD5 |
cd56db887f04e5e4ff596a1eade9d110
|
|
| BLAKE2b-256 |
8bac9ec826c0d7ea85041e515f30895fef47d3e575d25e4cff3818349e789df1
|
Provenance
The following attestation bundles were made for xmlcst-0.1.0.tar.gz:
Publisher:
publish.yml on rcook/xmlcst
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
xmlcst-0.1.0.tar.gz -
Subject digest:
0da82fdf7bf92c17cff0cf18cb5dc844b55881e2be6ec4a3adf8556758d1d9ef - Sigstore transparency entry: 1563949047
- Sigstore integration time:
-
Permalink:
rcook/xmlcst@50400b971570d73ef017e63d2f50a0f44491272c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/rcook
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@50400b971570d73ef017e63d2f50a0f44491272c -
Trigger Event:
push
-
Statement type:
File details
Details for the file xmlcst-0.1.0-py3-none-any.whl.
File metadata
- Download URL: xmlcst-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd0178dc88474d6f1fd2bee3300d81ee4ca6f677f624d6f192f17853eacaf192
|
|
| MD5 |
e1535dd0f889e1587e4763b079f139b2
|
|
| BLAKE2b-256 |
522aff52110b65b6b070cb33e5d78ca6acd3e135491cf564df845b214122eb3b
|
Provenance
The following attestation bundles were made for xmlcst-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on rcook/xmlcst
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
xmlcst-0.1.0-py3-none-any.whl -
Subject digest:
bd0178dc88474d6f1fd2bee3300d81ee4ca6f677f624d6f192f17853eacaf192 - Sigstore transparency entry: 1563949088
- Sigstore integration time:
-
Permalink:
rcook/xmlcst@50400b971570d73ef017e63d2f50a0f44491272c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/rcook
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@50400b971570d73ef017e63d2f50a0f44491272c -
Trigger Event:
push
-
Statement type: