Skip to main content

High-performance XML to JSON streaming parser built with Rust

Project description

Oxidize

High-performance XML to JSON streaming parser built with Rust and PyO3. Specialized for extracting repeated elements from large XML files like API responses, log files, and data exports.

Key Features

  • High Performance: 2-3x faster than lxml, built with Rust's quick-xml parser with batch processing in Rayon
  • Low Memory Usage: Streaming architecture processes files larger than available RAM
  • Specialized Design: Opinionated API and schema design for common data engineering and data analysis workflows

Use Cases

Perfect for extracting structured data from XML files containing repeated elements into newline JSON

  • API responses: Extract <record>, <item>, or <entry> elements from REST API responses
  • Log files: Parse <event> or <log> entries from XML-formatted logs
  • Data exports: Process <row>, <product>, or <transaction> elements from database exports
  • Configuration files: Extract <server>, <user>, or similar repeated configuration blocks

Installation

./build_rust_utils.sh

Usage

Extract specific elements from XML and convert to JSON-Lines:

import oxidize

# File to file
count = oxidize.parse_xml_file_to_json_file("data.xml", "book", "books.json")

# File to string  
json_lines = oxidize.parse_xml_file_to_json_string("data.xml", "book")

# String to string
result = oxidize.parse_xml_string_to_json_string(xml_content, "book")

# Control batch size (default 1000)
oxidize.parse_xml_file_to_json_file("huge.xml", "record", "out.json", batch_size=500)

Conversion Rules

Uniform arrays: All elements become arrays for consistent schema inference:

<book id="bk101">
    <author>J.K. Rowling</author>
    <title>Harry Potter</title>
</book>
{
  "@id": "bk101",
  "author": ["J.K. Rowling"], 
  "title": ["Harry Potter"]
}

Key behaviors:

  • Attributes: Prefixed with @ to avoid conflicts with element names
  • Mixed content: Text in elements with children stored as #text entries
  • Empty elements: Self-closing/empty tags become null values
  • Structure preservation: Element order maintained via IndexMap
  • Namespace handling: Prefixes kept in element names, declarations treated as attributes

Ignored features:

  • Processing instructions, DTDs, comments (not relevant for data extraction)
  • Custom entity definitions (entity references passed through as text)
  • Character references automatically unescaped by quick_xml

Data Processing

Pandas

json_lines = oxidize.parse_xml_file_to_json_string("data.xml", "record")
records = [json.loads(line) for line in json_lines.strip().split('\n')]
df = pd.json_normalize(records)

Polars

oxidize.parse_xml_file_to_json_file("data.xml", "record", "temp.json")
# Use infer_schema_length=None and eager loading for unknown schemas
df = pl.read_json("temp.json", infer_schema_length=None)
# Use lazy loading for known schemas
lf = pl.scan_ndjson("temp.json")

DuckDB

oxidize.parse_xml_file_to_json_file("data.xml", "record", "data.json") 
df = duckdb.sql("SELECT * FROM read_json_auto('data.json')").df()

API

parse_xml_file_to_json_file(input_path, target_element, output_path, batch_size=1000) -> int
parse_xml_file_to_json_string(input_path, target_element, batch_size=1000) -> str  
parse_xml_string_to_json_file(xml_content, target_element, output_path, batch_size=1000) -> int
parse_xml_string_to_json_string(xml_content, target_element, batch_size=1000) -> str

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oxidize-0.1.0.tar.gz (32.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oxidize-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (78.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file oxidize-0.1.0.tar.gz.

File metadata

  • Download URL: oxidize-0.1.0.tar.gz
  • Upload date:
  • Size: 32.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for oxidize-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7880d45542616859017b3cca469147ea6c9589354820b74bddfaf47aa924aff5
MD5 c99c024ffddd6408e61831ae10190bc7
BLAKE2b-256 3cbb05c5cc8dd4237c302ce65c9e8a1c672087c10bc3c5d631696246d208889a

See more details on using hashes here.

File details

Details for the file oxidize-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for oxidize-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 81f13f578a0a44b0b50c5db68059ca77a8bd6ed816da33f229229cdd74dda998
MD5 f1034b6f73f18956a59ad92c42192360
BLAKE2b-256 b90993b096c9c9b74fb39c6ae5a0eef3e084a5159127ed362f67bda079bef59e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page