Skip to main content

Full-fidelity XML parser with lossless round-trip editing

Project description

xmlcst

Full-fidelity XML concrete syntax tree for Python -- parse, edit, and serialize with zero formatting loss.

Why xmlcst?

Existing Python XML libraries (ElementTree, lxml, minidom) parse XML into a semantic tree that discards lexical details: whitespace, comment placement, attribute quote styles, entity reference forms, and more. When you serialize back, the output differs from the input even if you changed nothing.

xmlcst takes a different approach. It treats XML as source text first and semantic structure second, producing a concrete syntax tree (CST) that retains every byte of the original document. When you edit a single attribute, only that attribute changes in the output -- surrounding formatting, comments, and whitespace remain untouched.

This makes xmlcst ideal for programmatic editing of XML configuration files (Maven POMs, .csproj files, Spring configs, Android manifests) where changes must produce minimal, reviewable diffs.

How xmlcst compares

Feature ElementTree lxml minidom xmlcst
Attribute order Partial Partial Partial Preserved
Quote style (' vs ") No No No Preserved
Whitespace / indentation No No No Preserved
Comments No Yes Yes Preserved
Entity reference form No No No Preserved
CDATA vs escaped text No Yes Yes Preserved
Empty-element syntax (<x/> vs <x />) No No No Preserved
Byte-identical round-trip No No No Yes

The closest conceptual analogue is ruamel.yaml -- a round-trip-capable YAML library -- applied to XML.

Installation

pip install xmlcst

Requires Python 3.12+. Pure Python -- no compiled dependencies. Ships PEP 561 type annotations for full mypy / pyright support.

Quick Start

Parse and round-trip

import xmlcst

source = '<project xmlns="http://maven.apache.org/POM/4.0.0">\n  <version>1.0</version>\n</project>'

doc = xmlcst.parse(source)
assert doc.to_string() == source  # byte-identical round-trip

Edit an attribute (minimal diff)

doc = xmlcst.parse('<root version="1.0" author="alice"/>')
doc.root.attributes["version"] = "2.0"
print(doc.to_string())
# <root version="2.0" author="alice"/>
# Only the value changed -- quotes, whitespace, other attributes untouched

Navigate the tree

doc = xmlcst.parse("""\
<project>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
    </dependency>
  </dependencies>
</project>""")

deps = doc.root.find("dependencies")
dep = deps.find("dependency")
group = dep.find("groupId")
print(group.children[0].content)  # "junit" (a Text node)

# Or search recursively
dep2 = doc.root.find_recursive("dependency")
all_deps = doc.root.findall_recursive("dependency")

Add and remove elements

doc = xmlcst.parse("<root>\n  <a/>\n  <b/>\n</root>")
doc.root.append(xmlcst.Element("c"))
print(doc.to_string())
# <root>
#   <a/>
#   <b/>
#   <c/>
# </root>

Access formatting metadata

doc = xmlcst.parse('<root  id = "1"  name=\'foo\'/>')
attr = doc.root.attributes["id"]
print(attr.raw_value)          # "1"
print(attr.quote)              # '"'
print(attr.leading_whitespace) # "  "
print(attr.eq_whitespace)      # (" ", " ")

Work with entity references

doc = xmlcst.parse("<root>a &amp; b</root>")
text = doc.root.children[0]
print(text.content)            # "a &amp; b"  (raw, as in the source)
print(text.decoded_content())  # "a & b"      (entities resolved)

text.set_content("x < y")     # auto-escapes
print(text.content)            # "x &lt; y"

Sample Application

The samples/bump_pom_version/ directory contains a complete example: a Maven POM version bumper that reads a pom.xml, increments the patch version, and writes the file back. Only the version string changes -- all comments, whitespace, attribute quoting, and other formatting are preserved exactly.

python samples/bump_pom_version/bump_pom_version.py
# 1.2.3 -> 1.2.4

The script accepts an optional path argument to operate on any POM file:

python samples/bump_pom_version/bump_pom_version.py /path/to/your/pom.xml

API Overview

Parsing

Function Input Returns
xmlcst.parse(text) str Document
xmlcst.parse_bytes(data) bytes Document
xmlcst.parse_file(path) str | Path Document

All parse functions raise xmlcst.ParseError on malformed input. The error includes message, line, column, and offset attributes.

Node Types

Type Description
Document Root container; holds all top-level nodes
Element An XML element with tag, attributes, and children
Attribute Name-value pair with formatting metadata (quote style, whitespace)
AttributeList Ordered collection with dict-like access by name
Text Character data (entity references preserved in raw form)
Whitespace Whitespace-only character data between markup
Comment <!-- ... -->
ProcessingInstruction <?target data?>
CData <![CDATA[...]]>
Doctype <!DOCTYPE ...> (preserved verbatim)
XmlDeclaration <?xml version="1.0" ...?>

Serialization

Method Description
doc.to_string() Exact round-trip serialization (default)
doc.to_string(mode="normalized") Pretty-printed with consistent formatting
doc.to_bytes() UTF-8 encoded; BOM preserved if present in input
doc.write(path) Write to file

Design

xmlcst uses a dual-layer architecture:

  1. Token stream (Layer 1) -- a lossless sequence of tokens covering every byte of the input. The fundamental invariant: "".join(t.text for t in tokens) == source.
  2. Tree API (Layer 2) -- mutable nodes backed by the token stream. Each node tracks a token span and a dirty flag.

Unmodified nodes serialize by replaying their original tokens (byte-identical). Modified nodes rebuild from their current properties. This guarantees that edits produce the smallest possible diff.

See SPEC.md for the full specification.

Limitations (v1)

  • UTF-8 encoding only
  • XML 1.0 well-formed documents only (no error recovery)
  • No DTD validation or schema support
  • No XPath query engine
  • No streaming / SAX-style parsing
  • Pure Python (no compiled acceleration)

See the future roadmap in the specification for planned enhancements.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xmlcst-0.1.0.tar.gz (32.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xmlcst-0.1.0-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file xmlcst-0.1.0.tar.gz.

File metadata

  • Download URL: xmlcst-0.1.0.tar.gz
  • Upload date:
  • Size: 32.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xmlcst-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0da82fdf7bf92c17cff0cf18cb5dc844b55881e2be6ec4a3adf8556758d1d9ef
MD5 cd56db887f04e5e4ff596a1eade9d110
BLAKE2b-256 8bac9ec826c0d7ea85041e515f30895fef47d3e575d25e4cff3818349e789df1

See more details on using hashes here.

Provenance

The following attestation bundles were made for xmlcst-0.1.0.tar.gz:

Publisher: publish.yml on rcook/xmlcst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file xmlcst-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: xmlcst-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xmlcst-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bd0178dc88474d6f1fd2bee3300d81ee4ca6f677f624d6f192f17853eacaf192
MD5 e1535dd0f889e1587e4763b079f139b2
BLAKE2b-256 522aff52110b65b6b070cb33e5d78ca6acd3e135491cf564df845b214122eb3b

See more details on using hashes here.

Provenance

The following attestation bundles were made for xmlcst-0.1.0-py3-none-any.whl:

Publisher: publish.yml on rcook/xmlcst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page