High-performance XML to JSON streaming parser built with Rust
Project description
Oxidize
High-performance XML to JSON streaming parser built with Rust and PyO3. Specialized for extracting repeated elements from large XML files like API responses, log files, and data exports.
Key Features
- High Performance: 2-3x faster than lxml, built with Rust's quick-xml parser with batch processing in Rayon
- Low Memory Usage: Streaming architecture processes files larger than available RAM
- Specialized Design: Opinionated API and schema design for common data engineering and data analysis workflows
Use Cases
Perfect for extracting structured data from XML files containing repeated elements into newline JSON
- API responses: Extract
<record>,<item>, or<entry>elements from REST API responses - Log files: Parse
<event>or<log>entries from XML-formatted logs - Data exports: Process
<row>,<product>, or<transaction>elements from database exports - Configuration files: Extract
<server>,<user>, or similar repeated configuration blocks
Installation
pip install oxidize
Usage
Extract specific elements from XML and convert to JSON-Lines:
import oxidize
# File to file
count = oxidize.parse_xml_file_to_json_file("data.xml", "book", "books.json")
# File to string
json_lines = oxidize.parse_xml_file_to_json_string("data.xml", "book")
# String to string
result = oxidize.parse_xml_string_to_json_string(xml_content, "book")
# Control batch size (default 1000)
oxidize.parse_xml_file_to_json_file("huge.xml", "record", "out.json", batch_size=500)
Conversion Rules
Uniform arrays: All elements become arrays for consistent schema inference:
<book id="bk101">
<author>J.K. Rowling</author>
<title>Harry Potter</title>
</book>
{
"@id": "bk101",
"author": ["J.K. Rowling"],
"title": ["Harry Potter"]
}
Key behaviors:
- Attributes: Prefixed with
@to avoid conflicts with element names - Mixed content: Text in elements with children stored as
#textentries - Empty elements: Self-closing/empty tags become
nullvalues - Structure preservation: Element order maintained via IndexMap
- Namespace handling: Prefixes kept in element names, declarations treated as attributes
Ignored features:
- Processing instructions, DTDs, comments (not relevant for data extraction)
- Custom entity definitions (entity references passed through as text)
- Character references automatically unescaped by quick_xml
Data Processing
Pandas
json_lines = oxidize.parse_xml_file_to_json_string("data.xml", "record")
records = [json.loads(line) for line in json_lines.strip().split('\n')]
df = pd.json_normalize(records)
Polars
oxidize.parse_xml_file_to_json_file("data.xml", "record", "temp.json")
# Use infer_schema_length=None and eager loading for unknown schemas
df = pl.read_json("temp.json", infer_schema_length=None)
# Use lazy loading for known schemas
lf = pl.scan_ndjson("temp.json")
DuckDB
oxidize.parse_xml_file_to_json_file("data.xml", "record", "data.json")
df = duckdb.sql("SELECT * FROM read_json_auto('data.json')").df()
API
parse_xml_file_to_json_file(input_path, target_element, output_path, batch_size=1000) -> int
parse_xml_file_to_json_string(input_path, target_element, batch_size=1000) -> str
parse_xml_string_to_json_file(xml_content, target_element, output_path, batch_size=1000) -> int
parse_xml_string_to_json_string(xml_content, target_element, batch_size=1000) -> str
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oxidize-0.1.1.tar.gz.
File metadata
- Download URL: oxidize-0.1.1.tar.gz
- Upload date:
- Size: 32.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e63a5fe6f26d8dd7d819c3832c79447b4dcf46f518b659194285123f72bcb6b4
|
|
| MD5 |
9509f3f99fef455a0d70243f03ef9635
|
|
| BLAKE2b-256 |
74f0989644f14481986bf40d65a943eb0c9509ae28eff16d173110dd7bb25f90
|
File details
Details for the file oxidize-0.1.1-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: oxidize-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 33.2 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
239d1b378c557e7e6f2609006e5b499c89786148b608982cfa046e88b530fead
|
|
| MD5 |
72444921e99e47d5a28d0ba44b9e9eee
|
|
| BLAKE2b-256 |
b6f59faca01d1eb3e9c79ba0913fed8db3d9b1d633f46dd341f793e54cbc9fac
|