High-performance XML to JSON streaming parser built with Rust

These details have not been verified by PyPI

Project links

Repository

Project description

oxidize-xml

High-performance XML to JSON streaming parser built with Rust and PyO3. Specialized for extracting repeated elements from large XML files like API responses, log files, and data exports, particularly for engineers and analysts working in DuckDB, Polars, and Pandas.

Key Features

High Performance: 2-3x faster than lxml, built with Rust's quick-xml parser with batch processing in Rayon
Low Memory Usage: Streaming architecture processes files larger than available RAM
Specialized Design: Opinionated API and schema design for common data engineering and data analysis workflows

Use Cases

Perfect for extracting structured data from XML files containing repeated elements into newline JSON

API responses: Extract <record>, <item>, or <entry> elements from REST API responses
Log files: Parse <event> or <log> entries from XML-formatted logs
Data exports: Process <row>, <product>, or <transaction> elements from database exports
Configuration files: Extract <server>, <user>, or similar repeated configuration blocks

Installation

pip install oxidize-xml

Development Setup

# Install dependencies
poetry install

# Build the extension
./build.sh

# Run tests
pytest tests/

Usage

Extract specific elements from XML and convert to JSON-Lines:

import oxidize_xml

# File to file
count = oxidize_xml.parse_xml_file_to_json_file("data.xml", "book", "books.json")

# File to string  
json_lines = oxidize_xml.parse_xml_file_to_json_string("data.xml", "book")

# String to string
result = oxidize_xml.parse_xml_string_to_json_string(xml_content, "book")

# String to file
result = oxidize_xml.parse_xml_string_to_json_file(xml_content, "book")

Conversion Rules

Uniform arrays: All elements become arrays for consistent schema inference:

<book id="bk101">
    <author>J.K. Rowling</author>
    <title>Harry Potter</title>
</book>

{
  "@id": "bk101",
  "author": ["J.K. Rowling"], 
  "title": ["Harry Potter"]
}

Key behaviors:

Attributes: Prefixed with @ to avoid conflicts with element names
Mixed content: Text in elements with children stored as #text entries
Empty elements: Self-closing/empty tags become null values
Structure preservation: Element order maintained via IndexMap
Namespace handling: Prefixes kept in element names, declarations treated as attributes

Ignored features:

Processing instructions, DTDs, comments (not relevant for data extraction)
Custom entity definitions (entity references passed through as text)
Character references automatically unescaped by quick_xml

API

parse_xml_file_to_json_file(input_path, target_element, output_path, batch_size=1000) -> int
parse_xml_file_to_json_string(input_path, target_element, batch_size=1000) -> str  
parse_xml_string_to_json_file(xml_content, target_element, output_path, batch_size=1000) -> int
parse_xml_string_to_json_string(xml_content, target_element, batch_size=1000) -> str

Parameters:

batch_size: Number of elements to process per batch (default: 1000, min: 1)
Returns the number of elements processed, or raises ValueError for invalid inputs

Testing

Run the test suite:

# All tests
pytest tests/

# Integration tests only  
pytest tests/integration/

# Performance benchmarks
pytest tests/performance/ --benchmark-only

# With coverage
pytest --cov=oxidize_xml --cov-report=html

Test coverage includes:

Core functionality validation
Error handling with malformed XML
Performance regression detection
Memory usage monitoring
Edge cases and concurrent operations

Architecture

oxidize-xml/src/io/
├── error.rs         # Centralized error handling and Python conversions
├── parser.rs        # Core XML streaming parser with security validations  
├── python_api.rs    # Clean Python function wrappers with shared logic
├── xml_utils.rs     # XML-to-JSON conversion utilities
└── mod.rs          # Module organization and exports

Security

oxidize-xml includes various security protections against XML-based attacks:

File Path Security

Path sanitization: Prevents directory traversal attacks (../ sequences)
Null byte protection: Rejects paths containing null bytes
Path length limits: Maximum 4096 character paths
Canonical path validation: Uses system path normalization

XML Bomb Protection

Element nesting limit: Maximum 1000 levels of nesting depth
Element size limit: Maximum 10MB per element
Attribute limits: Maximum 1000 attributes per element
Attribute size limit: Maximum 64KB per attribute value

Security Limits

MAX_ELEMENT_DEPTH: 1000        // Maximum XML nesting depth
MAX_ELEMENT_SIZE: 10_000_000   // Maximum element size (10MB)
MAX_ATTRIBUTE_COUNT: 1000      // Maximum attributes per element  
MAX_ATTRIBUTE_SIZE: 65536      // Maximum attribute size (64KB)

These limits prevent:

Billion laughs attacks: Exponential entity expansion
Quadratic blowup attacks: Deeply nested structures
Memory exhaustion: Oversized elements or attributes
Directory traversal: Path-based security vulnerabilities

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.0

Sep 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oxidize_xml-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (326.9 kB view details)

Uploaded Sep 7, 2025 CPython 3.12macOS 11.0+ ARM64

File details

Details for the file oxidize_xml-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: oxidize_xml-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Sep 7, 2025
Size: 326.9 kB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for oxidize_xml-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`72492358c7992c30fec07b296032d9558f0b6c8206150d781098763088aa5386`
MD5	`6e9752f174cad973c8cf44203963cd36`
BLAKE2b-256	`3502f174f012778ba8c2c8d22404637b650c1f36b77bbc0ffd5538ef3c95a72e`

See more details on using hashes here.

oxidize-xml 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

oxidize-xml

Key Features

Use Cases

Installation

Development Setup

Usage

Conversion Rules

API

Testing

Architecture

Security

File Path Security

XML Bomb Protection

Security Limits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes