Full-fidelity XML parser with lossless round-trip editing

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rcook

These details have not been verified by PyPI

Project description

xmlcst

Full-fidelity XML concrete syntax tree for Python -- parse, edit, and serialize with zero formatting loss.

Why xmlcst?

Existing Python XML libraries (ElementTree, lxml, minidom) parse XML into a semantic tree that discards lexical details: whitespace, comment placement, attribute quote styles, entity reference forms, and more. When you serialize back, the output differs from the input even if you changed nothing.

xmlcst takes a different approach. It treats XML as source text first and semantic structure second, producing a concrete syntax tree (CST) that retains every byte of the original document. When you edit a single attribute, only that attribute changes in the output -- surrounding formatting, comments, and whitespace remain untouched.

This makes xmlcst ideal for programmatic editing of XML configuration files (Maven POMs, .csproj files, Spring configs, Android manifests) where changes must produce minimal, reviewable diffs.

How xmlcst compares

Feature	ElementTree	lxml	minidom	xmlcst
Attribute order	Partial	Partial	Partial	Preserved
Quote style (`'` vs `"`)	No	No	No	Preserved
Whitespace / indentation	No	No	No	Preserved
Comments	No	Yes	Yes	Preserved
Entity reference form	No	No	No	Preserved
CDATA vs escaped text	No	Yes	Yes	Preserved
Empty-element syntax (`<x/>` vs `<x />`)	No	No	No	Preserved
Byte-identical round-trip	No	No	No	Yes

The closest conceptual analogue is ruamel.yaml -- a round-trip-capable YAML library -- applied to XML.

Installation

pip install xmlcst

Requires Python 3.12+. Pure Python -- no compiled dependencies. Ships PEP 561 type annotations for full mypy / pyright support.

Quick Start

Parse and round-trip

import xmlcst

source = '<project xmlns="http://maven.apache.org/POM/4.0.0">\n  <version>1.0</version>\n</project>'

doc = xmlcst.parse(source)
assert doc.to_string() == source  # byte-identical round-trip

Edit an attribute (minimal diff)

doc = xmlcst.parse('<root version="1.0" author="alice"/>')
doc.root.attributes["version"] = "2.0"
print(doc.to_string())
# <root version="2.0" author="alice"/>
# Only the value changed -- quotes, whitespace, other attributes untouched

Navigate the tree

doc = xmlcst.parse("""\
<project>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
    </dependency>
  </dependencies>
</project>""")

deps = doc.root.find("dependencies")
dep = deps.find("dependency")
group = dep.find("groupId")
print(group.children[0].content)  # "junit" (a Text node)

# Or search recursively
dep2 = doc.root.find_recursive("dependency")
all_deps = doc.root.findall_recursive("dependency")

Add and remove elements

doc = xmlcst.parse("<root>\n  <a/>\n  <b/>\n</root>")
doc.root.append(xmlcst.Element("c"))
print(doc.to_string())
# <root>
#   <a/>
#   <b/>
#   <c/>
# </root>

Access formatting metadata

doc = xmlcst.parse('<root  id = "1"  name=\'foo\'/>')
attr = doc.root.attributes["id"]
print(attr.raw_value)          # "1"
print(attr.quote)              # '"'
print(attr.leading_whitespace) # "  "
print(attr.eq_whitespace)      # (" ", " ")

Work with entity references

doc = xmlcst.parse("<root>a &amp; b</root>")
text = doc.root.children[0]
print(text.content)            # "a &amp; b"  (raw, as in the source)
print(text.decoded_content())  # "a & b"      (entities resolved)

text.set_content("x < y")     # auto-escapes
print(text.content)            # "x &lt; y"

Sample Application

The samples/bump_pom_version/ directory contains a complete example: a Maven POM version bumper that reads a pom.xml, increments the patch version, and writes the file back. Only the version string changes -- all comments, whitespace, attribute quoting, and other formatting are preserved exactly.

python samples/bump_pom_version/bump_pom_version.py
# 1.2.3 -> 1.2.4

The script accepts an optional path argument to operate on any POM file:

python samples/bump_pom_version/bump_pom_version.py /path/to/your/pom.xml

API Overview

Parsing

Function	Input	Returns
`xmlcst.parse(text)`	`str`	`Document`
`xmlcst.parse_bytes(data)`	`bytes`	`Document`
`xmlcst.parse_file(path)`	`str \| Path`	`Document`

All parse functions raise xmlcst.ParseError on malformed input. The error includes message, line, column, and offset attributes.

Node Types

Type	Description
`Document`	Root container; holds all top-level nodes
`Element`	An XML element with tag, attributes, and children
`Attribute`	Name-value pair with formatting metadata (quote style, whitespace)
`AttributeList`	Ordered collection with dict-like access by name
`Text`	Character data (entity references preserved in raw form)
`Whitespace`	Whitespace-only character data between markup
`Comment`	`<!-- ... -->`
`ProcessingInstruction`	`<?target data?>`
`CData`	`<![CDATA[...]]>`
`Doctype`	`<!DOCTYPE ...>` (preserved verbatim)
`XmlDeclaration`	`<?xml version="1.0" ...?>`

Serialization

Method	Description
`doc.to_string()`	Exact round-trip serialization (default)
`doc.to_string(mode="normalized")`	Pretty-printed with consistent formatting
`doc.to_bytes()`	UTF-8 encoded; BOM preserved if present in input
`doc.write(path)`	Write to file

Design

xmlcst uses a dual-layer architecture:

Token stream (Layer 1) -- a lossless sequence of tokens covering every byte of the input. The fundamental invariant: "".join(t.text for t in tokens) == source.
Tree API (Layer 2) -- mutable nodes backed by the token stream. Each node tracks a token span and a dirty flag.

Unmodified nodes serialize by replaying their original tokens (byte-identical). Modified nodes rebuild from their current properties. This guarantees that edits produce the smallest possible diff.

See SPEC.md for the full specification.

Limitations (v1)

UTF-8 encoding only
XML 1.0 well-formed documents only (no error recovery)
No DTD validation or schema support
No XPath query engine
No streaming / SAX-style parsing
Pure Python (no compiled acceleration)

See the future roadmap in the specification for planned enhancements.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rcook

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xmlcst-0.1.0.tar.gz (32.1 kB view details)

Uploaded May 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xmlcst-0.1.0-py3-none-any.whl (16.9 kB view details)

Uploaded May 17, 2026 Python 3

File details

Details for the file xmlcst-0.1.0.tar.gz.

File metadata

Download URL: xmlcst-0.1.0.tar.gz
Upload date: May 17, 2026
Size: 32.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xmlcst-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0da82fdf7bf92c17cff0cf18cb5dc844b55881e2be6ec4a3adf8556758d1d9ef`
MD5	`cd56db887f04e5e4ff596a1eade9d110`
BLAKE2b-256	`8bac9ec826c0d7ea85041e515f30895fef47d3e575d25e4cff3818349e789df1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for xmlcst-0.1.0.tar.gz:

Publisher: publish.yml on rcook/xmlcst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: xmlcst-0.1.0.tar.gz
- Subject digest: 0da82fdf7bf92c17cff0cf18cb5dc844b55881e2be6ec4a3adf8556758d1d9ef
- Sigstore transparency entry: 1563949047
- Sigstore integration time: May 17, 2026
Source repository:
- Permalink: rcook/xmlcst@50400b971570d73ef017e63d2f50a0f44491272c
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/rcook
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@50400b971570d73ef017e63d2f50a0f44491272c
- Trigger Event: push

File details

Details for the file xmlcst-0.1.0-py3-none-any.whl.

File metadata

Download URL: xmlcst-0.1.0-py3-none-any.whl
Upload date: May 17, 2026
Size: 16.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xmlcst-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bd0178dc88474d6f1fd2bee3300d81ee4ca6f677f624d6f192f17853eacaf192`
MD5	`e1535dd0f889e1587e4763b079f139b2`
BLAKE2b-256	`522aff52110b65b6b070cb33e5d78ca6acd3e135491cf564df845b214122eb3b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for xmlcst-0.1.0-py3-none-any.whl:

Publisher: publish.yml on rcook/xmlcst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: xmlcst-0.1.0-py3-none-any.whl
- Subject digest: bd0178dc88474d6f1fd2bee3300d81ee4ca6f677f624d6f192f17853eacaf192
- Sigstore transparency entry: 1563949088
- Sigstore integration time: May 17, 2026
Source repository:
- Permalink: rcook/xmlcst@50400b971570d73ef017e63d2f50a0f44491272c
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/rcook
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@50400b971570d73ef017e63d2f50a0f44491272c
- Trigger Event: push

xmlcst 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

xmlcst

Why xmlcst?

How xmlcst compares

Installation

Quick Start

Parse and round-trip

Edit an attribute (minimal diff)

Navigate the tree

Add and remove elements

Access formatting metadata

Work with entity references

Sample Application

API Overview

Parsing

Node Types

Serialization

Design

Limitations (v1)

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance