Skip to main content

SIMD-accelerated XML parser with full XPath 1.0 support

Project description

simdxml

SIMD-accelerated XML parser with full XPath 1.0 support for Python.

simdxml parses XML into flat arrays instead of a DOM tree, then evaluates XPath expressions against those arrays. The approach adapts simdjson's structural indexing architecture to XML: SIMD instructions classify structural characters in parallel, producing a compact index that supports all 13 XPath 1.0 axes via array operations.

Installation

pip install simdxml

Pre-built wheels for Linux (x86_64, aarch64), macOS (arm64, x86_64), and Windows.

Quick start

import simdxml

doc = simdxml.parse(b"<library><book><title>Rust</title></book></library>")
titles = doc.xpath_text("//title")
assert titles == ["Rust"]

API

Native API

The native API gives you direct access to the SIMD-accelerated engine:

import simdxml

# Parse bytes or str
doc = simdxml.parse(xml_bytes)

# XPath queries
doc.xpath_text("//title")          # -> list[str] (direct child text)
doc.xpath_string("//title")        # -> list[str] (all descendant text, like XPath string())
doc.xpath("//book[@lang='en']")    # -> list[Element | str]

# Element traversal
root = doc.root
root.tag                           # "library"
root.text                          # direct text content or None
root.attrib                        # {"lang": "en", ...}
root.get("lang")                   # "en"
root[0]                            # first child element
len(root)                          # number of child elements
list(root)                         # all child elements

# Navigation (lxml-compatible)
elem.getparent()                   # parent element or None
elem.getnext()                     # next sibling or None
elem.getprevious()                 # previous sibling or None

# XPath from any element
elem.xpath(".//title")             # context-node evaluation
elem.xpath_text("author")         # text extraction from context

# Batch APIs (single FFI call, interned strings)
root.child_tags()                  # -> list[str] of child tag names
root.descendant_tags("item")       # -> list[str] filtered by tag

# Compiled XPath (like re.compile)
expr = simdxml.compile("//title")
expr.eval_text(doc)                # -> list[str]
expr.eval_count(doc)               # -> int
expr.eval_exists(doc)              # -> bool
expr.eval(doc)                     # -> list[Element]

# Batch: process many documents in one call
docs = [open(f).read() for f in xml_files]
expr = simdxml.compile("//title")
simdxml.batch_xpath_text(docs, expr)           # bloom prefilter
simdxml.batch_xpath_text_parallel(docs, expr)  # multithreaded

ElementTree drop-in (read-only)

Full read-only drop-in replacement for xml.etree.ElementTree. Every read-only Element method and module function is supported:

from simdxml.etree import ElementTree as ET

tree = ET.parse("books.xml")
root = tree.getroot()

# All stdlib Element methods work
root.tag, root.text, root.tail, root.attrib
root.find(".//title")              # first match
root.findall(".//book[@lang]")     # all matches
root.findtext(".//title")          # text of first match
root.iterfind(".//author")         # iterator
root.iter("title")                 # descendant iterator
root.itertext()                    # text iterator
root.get("key"), root.keys(), root.items()
len(root), root[0], list(root)

# All stdlib module functions work
ET.parse(file), ET.fromstring(text), ET.tostring(element)
ET.iterparse(file, events=("start", "end"))
ET.canonicalize(xml), ET.dump(element), ET.iselement(obj)
ET.XMLPullParser(events=("end",)), ET.XMLParser(), ET.TreeBuilder()
ET.fromstringlist(seq), ET.tostringlist(elem)
ET.QName(uri, tag), ET.XMLID(text)

# Plus full XPath 1.0 (lxml-compatible extension)
root.xpath("//book[contains(title, 'XML')]")

Mutation operations (append, remove, set, SubElement, indent, etc.) raise TypeError with a helpful message pointing to stdlib.

Read-only by design

simdxml Elements are immutable views into the structural index. Mutation operations raise TypeError with a helpful message:

root.text = "new"  # TypeError: simdxml Elements are read-only.
                    #   Use xml.etree.ElementTree for XML construction.

XPath 1.0 support

Full conformance with XPath 1.0:

  • 327/327 libxml2 conformance tests (100%)
  • 1015/1023 pugixml conformance tests (99.2%)
  • All 13 axes: child, descendant, parent, ancestor, following-sibling, preceding-sibling, following, preceding, self, attribute, namespace, descendant-or-self, ancestor-or-self
  • All 25 functions: string(), contains(), count(), position(), last(), starts-with(), substring(), concat(), normalize-space(), etc.
  • Operators: and, or, =, !=, <, >, +, -, *, div, mod, |
  • Predicates: positional [1], [last()], boolean [@attr='val'], nested

Benchmarks

Apple Silicon, Python 3.14, lxml 6.0. GC disabled, 3 warmup + 20 timed iterations, median reported. 100K-element catalog (5.6 MB). Run yourself: uv run python bench/bench_parse.py

Faster than lxml on every operation. Faster than stdlib on 11 of 14.

Operation simdxml lxml stdlib vs lxml vs stdlib
parse() 10 ms 33 ms 55 ms 3x 5x
find("item") <1 us 1 us <1 us faster tied
find(".//name") <1 us 1 us 1 us faster faster
findall("item") 0.23 ms 4.8 ms 0.89 ms 21x 4x
findall(".//item") 0.15 ms 6.2 ms 3.0 ms 42x 20x
findall(predicate) 1.5 ms 12 ms 4.9 ms 8x 3x
findtext(".//name") <1 us 1 us 1 us faster faster
xpath_text("//name") 2.1 ms 19 ms 4.4 ms 9x 2x
iter() 9.2 ms 15 ms 1.3 ms 2x 0.14x
iter("item") filtered 4.5 ms 5.9 ms 1.9 ms 1.3x 0.4x
itertext() 2.6 ms 33 ms 1.4 ms 13x 0.5x
child_tags() 0.40 ms 6.2 ms 1.5 ms 16x 4x
iterparse() 51 ms 66 ms 70 ms 1.3x 1.4x
canonicalize() 1.8 ms 4.7 ms 4.6 ms 3x 3x

The three operations where stdlib is faster (iter, itertext, iter filtered) involve creating per-element Python objects. The batch alternatives (child_tags(), xpath_text()) beat both lxml and stdlib for those workloads.

Batch processing (multiple documents)

batch_xpath_text uses a bloom filter to skip non-matching documents at ~10 GiB/s. batch_xpath_text_parallel spreads parse + eval across threads. Both return all results in a single FFI call — zero per-document Python overhead.

Workload Python loop bloom batch parallel batch
1K small docs 1.1 ms 0.37 ms (3x) 12 ms
100x 31KB docs 7.9 ms 8.2 ms 2.6 ms (3x)

Use bloom batch when many documents won't match the query (ETL filtering). Use parallel batch when documents are large (>10KB) and most will match.

How it works

Instead of building a DOM tree with heap-allocated nodes and pointer-chasing, simdxml represents XML structure as parallel arrays (struct-of-arrays layout). Each tag gets an entry in flat arrays for starts, ends, types, names, depths, and parents -- all indexed by the same position.

  • ~16 bytes per tag vs ~35 bytes per DOM node
  • O(1) ancestor/descendant checks via pre/post-order numbering
  • O(1) child enumeration via CSR (Compressed Sparse Row) indices
  • SIMD-accelerated structural parsing (NEON on ARM, AVX2 on x86)
  • Parse eagerly builds all indices (CSR, name posting, parent map) so subsequent queries pay zero index construction cost

Platform support

Platform SIMD Backend Status
aarch64 (Apple Silicon, ARM) NEON 128-bit Production
x86_64 AVX2 256-bit / SSE4.2 Production
Other Scalar (memchr-accelerated) Working

Development

git clone https://github.com/simdxml/simdxml-python
cd simdxml-python

make dev        # build extension (debug mode)
make test       # run tests
make lint       # ruff check + format
make typecheck  # pyright

Requires Rust toolchain and Python 3.9+.

License

MIT OR Apache-2.0 (same as the simdxml Rust crate)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simdxml-0.3.0.tar.gz (42.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simdxml-0.3.0-cp39-abi3-win_amd64.whl (416.1 kB view details)

Uploaded CPython 3.9+Windows x86-64

simdxml-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (580.1 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

simdxml-0.3.0-cp39-abi3-macosx_11_0_arm64.whl (526.6 kB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

simdxml-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl (549.4 kB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

simdxml-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (599.2 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file simdxml-0.3.0.tar.gz.

File metadata

  • Download URL: simdxml-0.3.0.tar.gz
  • Upload date:
  • Size: 42.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for simdxml-0.3.0.tar.gz
Algorithm Hash digest
SHA256 4bc633dabf9ee1e716bf1c8a4ab5f7985d4c1a2fd185c65046c5aa2ff984e534
MD5 51dc1e99b3758f7aff9296176c5dbe46
BLAKE2b-256 d25c494b8d0e58805ec2d8a98b6036883fe44e64789bf982c20d7432515c1894

See more details on using hashes here.

Provenance

The following attestation bundles were made for simdxml-0.3.0.tar.gz:

Publisher: release.yml on simdxml/simdxml-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simdxml-0.3.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: simdxml-0.3.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 416.1 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for simdxml-0.3.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3ae6cbac89c9edf7a8dbe52681bb3087ca489ad874ecc36fb90e46ab1f929f95
MD5 8570f986da7a9b8da31508430a745e55
BLAKE2b-256 f50d6ce1e05f7fa843a337bc0d1c306649aae32762b3f902305dd1596f9d8d37

See more details on using hashes here.

Provenance

The following attestation bundles were made for simdxml-0.3.0-cp39-abi3-win_amd64.whl:

Publisher: release.yml on simdxml/simdxml-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simdxml-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for simdxml-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 570a3a812c98e2b62a6928fbf99e7988ed948d31386af8a9f8960c34fa26e909
MD5 5a0cb7a9ddd1e45754673b37f3facbd2
BLAKE2b-256 c10fa8e5624860ef9d476899b974ab7b8bba55ac6ac55a0ab82e5d27b9bfbbfa

See more details on using hashes here.

Provenance

The following attestation bundles were made for simdxml-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on simdxml/simdxml-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simdxml-0.3.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for simdxml-0.3.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5d8d4707dbb891a0a9860faf7642920f6db71ff0a5c2468163388f36cb4a5e35
MD5 3608b1dacfca5484cdabdbf8c99cafe8
BLAKE2b-256 d440bde0ef1b3a226e4808de2634ae1012c8472bb886dda09ed9f625fa9cdbca

See more details on using hashes here.

Provenance

The following attestation bundles were made for simdxml-0.3.0-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on simdxml/simdxml-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simdxml-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for simdxml-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3ba57c685eb5dcee985ad98c828b20a1b9742f94f4c1d52bd7b5a79af81fa745
MD5 bfd19d951124e2ee5800774f645af457
BLAKE2b-256 c965429228c2e21dc9666cceb248eb4a2b403921d2ef4fa7d72a28a7c8e6058b

See more details on using hashes here.

Provenance

The following attestation bundles were made for simdxml-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on simdxml/simdxml-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simdxml-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simdxml-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 70e7db0623b296b069c005b837098f70dfac7fa4ae713afcde5dd837116459f4
MD5 da580e012d969cd52aeb224adff4528f
BLAKE2b-256 1270ab10225c5774ef553aaef6bac9dc646093d73b6f537932cbf606170fc08b

See more details on using hashes here.

Provenance

The following attestation bundles were made for simdxml-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on simdxml/simdxml-python

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page