SIMD-accelerated XML parser with full XPath 1.0 support
Project description
simdxml
SIMD-accelerated XML parser with full XPath 1.0 support for Python.
simdxml parses XML into flat arrays instead of a DOM tree, then evaluates
XPath expressions against those arrays. The approach adapts
simdjson's structural indexing architecture to XML:
SIMD instructions classify structural characters in parallel, producing a
compact index that supports all 13 XPath 1.0 axes via array operations.
Installation
pip install simdxml
Pre-built wheels for Linux (x86_64, aarch64), macOS (arm64, x86_64), and Windows.
Quick start
import simdxml
doc = simdxml.parse(b"<library><book><title>Rust</title></book></library>")
titles = doc.xpath_text("//title")
assert titles == ["Rust"]
API
Native API
The native API gives you direct access to the SIMD-accelerated engine:
import simdxml
# Parse bytes or str
doc = simdxml.parse(xml_bytes)
# XPath queries
doc.xpath_text("//title") # -> list[str] (direct child text)
doc.xpath_string("//title") # -> list[str] (all descendant text, like XPath string())
doc.xpath("//book[@lang='en']") # -> list[Element | str]
# Element traversal
root = doc.root
root.tag # "library"
root.text # direct text content or None
root.attrib # {"lang": "en", ...}
root.get("lang") # "en"
root[0] # first child element
len(root) # number of child elements
list(root) # all child elements
# Navigation (lxml-compatible)
elem.getparent() # parent element or None
elem.getnext() # next sibling or None
elem.getprevious() # previous sibling or None
# XPath from any element
elem.xpath(".//title") # context-node evaluation
elem.xpath_text("author") # text extraction from context
# Batch APIs (single FFI call, interned strings)
root.child_tags() # -> list[str] of child tag names
root.descendant_tags("item") # -> list[str] filtered by tag
# Compiled XPath (like re.compile)
expr = simdxml.compile("//title")
expr.eval_text(doc) # -> list[str]
expr.eval_count(doc) # -> int
expr.eval_exists(doc) # -> bool
expr.eval(doc) # -> list[Element]
# Batch: process many documents in one call
docs = [open(f).read() for f in xml_files]
expr = simdxml.compile("//title")
simdxml.batch_xpath_text(docs, expr) # bloom prefilter
simdxml.batch_xpath_text_parallel(docs, expr) # multithreaded
ElementTree drop-in (read-only)
Full read-only drop-in replacement for xml.etree.ElementTree. Every
read-only Element method and module function is supported:
from simdxml.etree import ElementTree as ET
tree = ET.parse("books.xml")
root = tree.getroot()
# All stdlib Element methods work
root.tag, root.text, root.tail, root.attrib
root.find(".//title") # first match
root.findall(".//book[@lang]") # all matches
root.findtext(".//title") # text of first match
root.iterfind(".//author") # iterator
root.iter("title") # descendant iterator
root.itertext() # text iterator
root.get("key"), root.keys(), root.items()
len(root), root[0], list(root)
# All stdlib module functions work
ET.parse(file), ET.fromstring(text), ET.tostring(element)
ET.iterparse(file, events=("start", "end"))
ET.canonicalize(xml), ET.dump(element), ET.iselement(obj)
ET.XMLPullParser(events=("end",)), ET.XMLParser(), ET.TreeBuilder()
ET.fromstringlist(seq), ET.tostringlist(elem)
ET.QName(uri, tag), ET.XMLID(text)
# Plus full XPath 1.0 (lxml-compatible extension)
root.xpath("//book[contains(title, 'XML')]")
Mutation operations (append, remove, set, SubElement, indent, etc.)
raise TypeError with a helpful message pointing to stdlib.
Read-only by design
simdxml Elements are immutable views into the structural index. Mutation
operations raise TypeError with a helpful message:
root.text = "new" # TypeError: simdxml Elements are read-only.
# Use xml.etree.ElementTree for XML construction.
XPath 1.0 support
Full conformance with XPath 1.0:
- 327/327 libxml2 conformance tests (100%)
- 1015/1023 pugixml conformance tests (99.2%)
- All 13 axes:
child,descendant,parent,ancestor,following-sibling,preceding-sibling,following,preceding,self,attribute,namespace,descendant-or-self,ancestor-or-self - All 25 functions:
string(),contains(),count(),position(),last(),starts-with(),substring(),concat(),normalize-space(), etc. - Operators:
and,or,=,!=,<,>,+,-,*,div,mod,| - Predicates: positional
[1],[last()], boolean[@attr='val'], nested
Benchmarks
Apple Silicon, Python 3.14, lxml 6.0. GC disabled, 3 warmup + 20 timed
iterations, median reported. 100K-element catalog (5.6 MB).
Run yourself: uv run python bench/bench_parse.py
Faster than lxml on every operation. Faster than stdlib on 11 of 14.
| Operation | simdxml | lxml | stdlib | vs lxml | vs stdlib |
|---|---|---|---|---|---|
parse() |
10 ms | 33 ms | 55 ms | 3x | 5x |
find("item") |
<1 us | 1 us | <1 us | faster | tied |
find(".//name") |
<1 us | 1 us | 1 us | faster | faster |
findall("item") |
0.23 ms | 4.8 ms | 0.89 ms | 21x | 4x |
findall(".//item") |
0.15 ms | 6.2 ms | 3.0 ms | 42x | 20x |
findall(predicate) |
1.5 ms | 12 ms | 4.9 ms | 8x | 3x |
findtext(".//name") |
<1 us | 1 us | 1 us | faster | faster |
xpath_text("//name") |
2.1 ms | 19 ms | 4.4 ms | 9x | 2x |
iter() |
9.2 ms | 15 ms | 1.3 ms | 2x | 0.14x |
iter("item") filtered |
4.5 ms | 5.9 ms | 1.9 ms | 1.3x | 0.4x |
itertext() |
2.6 ms | 33 ms | 1.4 ms | 13x | 0.5x |
child_tags() |
0.40 ms | 6.2 ms | 1.5 ms | 16x | 4x |
iterparse() |
51 ms | 66 ms | 70 ms | 1.3x | 1.4x |
canonicalize() |
1.8 ms | 4.7 ms | 4.6 ms | 3x | 3x |
The three operations where stdlib is faster (iter, itertext, iter filtered)
involve creating per-element Python objects. The batch alternatives
(child_tags(), xpath_text()) beat both lxml and stdlib for those workloads.
Batch processing (multiple documents)
batch_xpath_text uses a bloom filter to skip non-matching documents at
~10 GiB/s. batch_xpath_text_parallel spreads parse + eval across threads.
Both return all results in a single FFI call — zero per-document Python overhead.
| Workload | Python loop | bloom batch | parallel batch |
|---|---|---|---|
| 1K small docs | 1.1 ms | 0.37 ms (3x) | 12 ms |
| 100x 31KB docs | 7.9 ms | 8.2 ms | 2.6 ms (3x) |
Use bloom batch when many documents won't match the query (ETL filtering). Use parallel batch when documents are large (>10KB) and most will match.
How it works
Instead of building a DOM tree with heap-allocated nodes and pointer-chasing, simdxml represents XML structure as parallel arrays (struct-of-arrays layout). Each tag gets an entry in flat arrays for starts, ends, types, names, depths, and parents -- all indexed by the same position.
- ~16 bytes per tag vs ~35 bytes per DOM node
- O(1) ancestor/descendant checks via pre/post-order numbering
- O(1) child enumeration via CSR (Compressed Sparse Row) indices
- SIMD-accelerated structural parsing (NEON on ARM, AVX2 on x86)
- Parse eagerly builds all indices (CSR, name posting, parent map) so subsequent queries pay zero index construction cost
Platform support
| Platform | SIMD Backend | Status |
|---|---|---|
| aarch64 (Apple Silicon, ARM) | NEON 128-bit | Production |
| x86_64 | AVX2 256-bit / SSE4.2 | Production |
| Other | Scalar (memchr-accelerated) | Working |
Development
git clone https://github.com/simdxml/simdxml-python
cd simdxml-python
make dev # build extension (debug mode)
make test # run tests
make lint # ruff check + format
make typecheck # pyright
Requires Rust toolchain and Python 3.9+.
License
MIT OR Apache-2.0 (same as the simdxml Rust crate)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simdxml-0.3.0.tar.gz.
File metadata
- Download URL: simdxml-0.3.0.tar.gz
- Upload date:
- Size: 42.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bc633dabf9ee1e716bf1c8a4ab5f7985d4c1a2fd185c65046c5aa2ff984e534
|
|
| MD5 |
51dc1e99b3758f7aff9296176c5dbe46
|
|
| BLAKE2b-256 |
d25c494b8d0e58805ec2d8a98b6036883fe44e64789bf982c20d7432515c1894
|
Provenance
The following attestation bundles were made for simdxml-0.3.0.tar.gz:
Publisher:
release.yml on simdxml/simdxml-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simdxml-0.3.0.tar.gz -
Subject digest:
4bc633dabf9ee1e716bf1c8a4ab5f7985d4c1a2fd185c65046c5aa2ff984e534 - Sigstore transparency entry: 1189463981
- Sigstore integration time:
-
Permalink:
simdxml/simdxml-python@431295849f3b8f364f0072428981e35feaeff439 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/simdxml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@431295849f3b8f364f0072428981e35feaeff439 -
Trigger Event:
push
-
Statement type:
File details
Details for the file simdxml-0.3.0-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: simdxml-0.3.0-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 416.1 kB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ae6cbac89c9edf7a8dbe52681bb3087ca489ad874ecc36fb90e46ab1f929f95
|
|
| MD5 |
8570f986da7a9b8da31508430a745e55
|
|
| BLAKE2b-256 |
f50d6ce1e05f7fa843a337bc0d1c306649aae32762b3f902305dd1596f9d8d37
|
Provenance
The following attestation bundles were made for simdxml-0.3.0-cp39-abi3-win_amd64.whl:
Publisher:
release.yml on simdxml/simdxml-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simdxml-0.3.0-cp39-abi3-win_amd64.whl -
Subject digest:
3ae6cbac89c9edf7a8dbe52681bb3087ca489ad874ecc36fb90e46ab1f929f95 - Sigstore transparency entry: 1189463985
- Sigstore integration time:
-
Permalink:
simdxml/simdxml-python@431295849f3b8f364f0072428981e35feaeff439 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/simdxml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@431295849f3b8f364f0072428981e35feaeff439 -
Trigger Event:
push
-
Statement type:
File details
Details for the file simdxml-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: simdxml-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 580.1 kB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
570a3a812c98e2b62a6928fbf99e7988ed948d31386af8a9f8960c34fa26e909
|
|
| MD5 |
5a0cb7a9ddd1e45754673b37f3facbd2
|
|
| BLAKE2b-256 |
c10fa8e5624860ef9d476899b974ab7b8bba55ac6ac55a0ab82e5d27b9bfbbfa
|
Provenance
The following attestation bundles were made for simdxml-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on simdxml/simdxml-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simdxml-0.3.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
570a3a812c98e2b62a6928fbf99e7988ed948d31386af8a9f8960c34fa26e909 - Sigstore transparency entry: 1189463991
- Sigstore integration time:
-
Permalink:
simdxml/simdxml-python@431295849f3b8f364f0072428981e35feaeff439 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/simdxml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@431295849f3b8f364f0072428981e35feaeff439 -
Trigger Event:
push
-
Statement type:
File details
Details for the file simdxml-0.3.0-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: simdxml-0.3.0-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 526.6 kB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d8d4707dbb891a0a9860faf7642920f6db71ff0a5c2468163388f36cb4a5e35
|
|
| MD5 |
3608b1dacfca5484cdabdbf8c99cafe8
|
|
| BLAKE2b-256 |
d440bde0ef1b3a226e4808de2634ae1012c8472bb886dda09ed9f625fa9cdbca
|
Provenance
The following attestation bundles were made for simdxml-0.3.0-cp39-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on simdxml/simdxml-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simdxml-0.3.0-cp39-abi3-macosx_11_0_arm64.whl -
Subject digest:
5d8d4707dbb891a0a9860faf7642920f6db71ff0a5c2468163388f36cb4a5e35 - Sigstore transparency entry: 1189463983
- Sigstore integration time:
-
Permalink:
simdxml/simdxml-python@431295849f3b8f364f0072428981e35feaeff439 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/simdxml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@431295849f3b8f364f0072428981e35feaeff439 -
Trigger Event:
push
-
Statement type:
File details
Details for the file simdxml-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: simdxml-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 549.4 kB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ba57c685eb5dcee985ad98c828b20a1b9742f94f4c1d52bd7b5a79af81fa745
|
|
| MD5 |
bfd19d951124e2ee5800774f645af457
|
|
| BLAKE2b-256 |
c965429228c2e21dc9666cceb248eb4a2b403921d2ef4fa7d72a28a7c8e6058b
|
Provenance
The following attestation bundles were made for simdxml-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl:
Publisher:
release.yml on simdxml/simdxml-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simdxml-0.3.0-cp39-abi3-macosx_10_12_x86_64.whl -
Subject digest:
3ba57c685eb5dcee985ad98c828b20a1b9742f94f4c1d52bd7b5a79af81fa745 - Sigstore transparency entry: 1189463989
- Sigstore integration time:
-
Permalink:
simdxml/simdxml-python@431295849f3b8f364f0072428981e35feaeff439 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/simdxml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@431295849f3b8f364f0072428981e35feaeff439 -
Trigger Event:
push
-
Statement type:
File details
Details for the file simdxml-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: simdxml-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 599.2 kB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70e7db0623b296b069c005b837098f70dfac7fa4ae713afcde5dd837116459f4
|
|
| MD5 |
da580e012d969cd52aeb224adff4528f
|
|
| BLAKE2b-256 |
1270ab10225c5774ef553aaef6bac9dc646093d73b6f537932cbf606170fc08b
|
Provenance
The following attestation bundles were made for simdxml-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on simdxml/simdxml-python
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simdxml-0.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
70e7db0623b296b069c005b837098f70dfac7fa4ae713afcde5dd837116459f4 - Sigstore transparency entry: 1189463987
- Sigstore integration time:
-
Permalink:
simdxml/simdxml-python@431295849f3b8f364f0072428981e35feaeff439 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/simdxml
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@431295849f3b8f364f0072428981e35feaeff439 -
Trigger Event:
push
-
Statement type: