Skip to main content

XML parser with streaming iterator interface

Project description

Xml Iterator

An XML parser for Python with streaming iterator interface and protection against infinite depth attacks.

Features

  • Streaming XML parsing - processes XML without loading entire document into memory
  • Infinite depth protection - iterator-based approach allows user-controlled limits
  • xmltodict compatibility - xml_to_dict() function produces identical results to xmltodict library
  • High performance - Rust implementation 1.2x faster than xmltodict, 734x faster for early termination
  • Unicode support - handles UTF-8 encoding correctly

Performance

Benchmarks comparing xml_to_dict() against xmltodict.parse():

Elements File Size xml_iterator xmltodict Speedup
500 0.2 MB 0.020s 0.024s 1.2x
2,000 0.7 MB 0.095s 0.099s 1.1x
5,000 1.8 MB 0.231s 0.251s 1.1x

Streaming advantage: 734x faster when processing only first 1,000 events from large files.

Run benchmarks yourself:

  • make benchmark - Synthetic data comparison vs xmltodict
  • make benchmark-real - Real-world ESMA FIRDS XML file (downloads ~100MB)

Usage

from xml_iterator.xml_iterator import iter_xml
from xml_iterator.core import xml_to_dict

# Streaming iteration
for count, event, value in iter_xml('file.xml'):
    print(f"{event}: {value}")
    if count > 1000:  # User-controlled limits
        break

# Convert to dictionary (xmltodict compatible)
data = xml_to_dict('file.xml', max_depth=100, max_events=10000)

Testing

Run the test suite with pytest:

# Install test dependencies
pip install -e ".[test]"

# Run all tests
pytest

# Run specific test types
pytest tests/test_basic.py           # Core functionality
pytest tests/test_xmltodict.py       # xmltodict compatibility
pytest tests/test_performance.py    # Performance regression tests

# Run benchmarks (separate from tests)
make benchmark           # Synthetic data vs xmltodict
make benchmark-real      # Real-world ESMA FIRDS XML

The test suite includes:

  • Basic functionality tests - streaming, encoding, deep nesting
  • xmltodict compatibility tests - 100% exact result compatibility
  • Performance regression tests - ensure no slowdowns

Example Output

In [1]: from xml_iterator.xml_iterator import get_edge_counts, iter_xml

In [2]: get_edge_counts('simple.xml')
xml_iterator::reading "simple.xml"
Out[2]: 
{('breakfast_menu', 'food', 'price'): 5,
 ('breakfast_menu', 'food', 'description'): 5,
 ('breakfast_menu', 'food'): 5,
 ('breakfast_menu', 'food', 'calories'): 5,
 ('breakfast_menu',): 1,
 ('breakfast_menu', 'food', 'name'): 5}

In [3]: for x in iter_xml('simple.xml'):
   ...:     print(x)
   ...: 
xml_iterator::reading "simple.xml"
(0, 'start', 'breakfast_menu')
(1, 'start', 'food')
(2, 'start', 'name')
(3, 'text', 'Belgian Waffles')
(4, 'end', 'name')
(5, 'start', 'price')
(6, 'text', '$5.95')
(7, 'end', 'price')
(8, 'start', 'description')
(9, 'text', 'Two of our famous Belgian Waffles with plenty of real maple syrup')
(10, 'end', 'description')
(11, 'start', 'calories')
(12, 'text', '650')
(13, 'end', 'calories')
(14, 'end', 'food')
(15, 'start', 'food')
(16, 'start', 'name')
(17, 'text', 'Strawberry Belgian Waffles')
(18, 'end', 'name')
(19, 'start', 'price')
(20, 'text', '$7.95')
(21, 'end', 'price')
(22, 'start', 'description')
(23, 'text', 'Light Belgian waffles covered with strawberries and whipped cream')
(24, 'end', 'description')
(25, 'start', 'calories')
(26, 'text', '900')
(27, 'end', 'calories')
(28, 'end', 'food')
(29, 'start', 'food')
(30, 'start', 'name')
(31, 'text', 'Berry-Berry Belgian Waffles')
(32, 'end', 'name')
(33, 'start', 'price')
(34, 'text', '$8.95')
(35, 'end', 'price')
(36, 'start', 'description')
(37, 'text', 'Light Belgian waffles covered with an assortment of fresh berries and whipped cream')
(38, 'end', 'description')
(39, 'start', 'calories')
(40, 'text', '900')
(41, 'end', 'calories')
(42, 'end', 'food')
(43, 'start', 'food')
(44, 'start', 'name')
(45, 'text', 'French Toast')
(46, 'end', 'name')
(47, 'start', 'price')
(48, 'text', '$4.50')
(49, 'end', 'price')
(50, 'start', 'description')
(51, 'text', 'Thick slices made from our homemade sourdough bread')
(52, 'end', 'description')
(53, 'start', 'calories')
(54, 'text', '600')
(55, 'end', 'calories')
(56, 'end', 'food')
(57, 'start', 'food')
(58, 'start', 'name')
(59, 'text', 'Homestyle Breakfast')
(60, 'end', 'name')
(61, 'start', 'price')
(62, 'text', '$6.95')
(63, 'end', 'price')
(64, 'start', 'description')
(65, 'text', 'Two eggs, bacon or sausage, toast, and our ever-popular hash browns')
(66, 'end', 'description')
(67, 'start', 'calories')
(68, 'text', '950')
(69, 'end', 'calories')
(70, 'end', 'food')
(71, 'end', 'breakfast_menu')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xml_iterator-0.1.4-cp312-cp312-win_amd64.whl (290.0 kB view details)

Uploaded CPython 3.12Windows x86-64

xml_iterator-0.1.4-cp312-cp312-manylinux_2_34_x86_64.whl (432.4 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

File details

Details for the file xml_iterator-0.1.4-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for xml_iterator-0.1.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 d21dfb7f23358c1ae01ab1f3979f1770f47d346e8c0cea4b4e3a84d11e7d7f34
MD5 422ee39667564dcda4e1586ace8de147
BLAKE2b-256 f96f292dbda8f7ab2df7b06cf8ecadbfbcc58a4e63f590297c144275c0d77150

See more details on using hashes here.

File details

Details for the file xml_iterator-0.1.4-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for xml_iterator-0.1.4-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 c1e67abe1675f3f746e0d17c9ad317e60fef9d82b63f8bf55d4092a93145b3d1
MD5 dc4a054e04f88dc9d94e232ed851d299
BLAKE2b-256 c8db375df481dc8c545c3ca606ae43bae2e187b3cb1979933c89951296f747f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page