Skip to main content

Fast streaming parser for Crystal Reports XML, with Rust acceleration

Project description

crxml

Fast streaming parser for Crystal Reports XML exports.

from crxml import CrystalXMLSource, to_dataframe

df = to_dataframe(CrystalXMLSource("report.xml", row_tag="Details"))
print(df.head())

Installation

Prerequisites: Python ≥3.10 and Rust.

pip install crxml

About

crxml streams through Crystal Reports XML files row by row, never loading the full document into memory. It extracts field data from nested CR field elements and yields flat dictionaries. A built-in pipeline lets you rename, cast, filter, and drop fields with | operators. The Rust backend processes 100 MB in ~0.5 seconds using <100 MB RSS.

This library is conceptually based on carlosplanchon/xmlstreamer.

API

CrystalXMLSource

CrystalXMLSource(source, row_tag="Details")

Parses a CR XML file and yields dict[str, str] rows. Accepts a file path (string or Path), or a file-like object with a .name attribute. The row_tag parameter controls which XML element is treated as a record (default: Details).

Pipeline stages

Stages are chained with |:

from crxml.stages import RenameFields, CastTypes, DropFields, FilterRows

pipeline = (
    CrystalXMLSource("report.xml")
    | RenameFields({"f1": "invoice", "f2": "amount"})
    | CastTypes({"amount": float})
    | DropFields("tax_rate")
    | FilterRows(lambda r: r["amount"] > 100)
)
  • RenameFields(mapping), renames dict keys
  • CastTypes(types, errors="raise"), casts fields to target types
  • DropFields(*fields), removes fields from rows
  • FilterRows(predicate), keeps rows matching predicate

Sinks

from crxml import to_dataframe, to_csv, collect

df = to_dataframe(pipeline)                  # → pd.DataFrame
to_csv(pipeline, "out.csv")                  # → CSV file
rows = collect(pipeline)                     # → list[dict]

Parallel mode

df = pipeline.parallel(workers=4) |> to_dataframe

Distributes batches across worker processes. See the docs for requirements.

Benchmarks

Test Size Rows Time Throughput RSS
Stream 10 MB 9,010 0.058s 155 K rows/s 94 MB
Stream 50 MB 45,328 0.261s 174 K rows/s 94 MB
Stream 100 MB 90,384 0.508s 178 K rows/s 94 MB
To list 100 MB 90,384 0.574s 157 K rows/s 339 MB
Pipeline 100 MB 90,384 0.781s 116 K rows/s 327 MB
DataFrame 100 MB 90,384 0.675s 134 K rows/s 351 MB

RSS stays flat at ~94 MB regardless of file size.

Documentation

Full documentation is available at the project site, covering installation, usage, stages, custom stages, architecture, performance, FastAPI integration, and the Rust core.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crxml-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (236.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

File details

Details for the file crxml-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for crxml-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 812b23ae07ce3ba21c476f012528d1b7f9f18cda511ba85c91e0da2fcecd290d
MD5 62c6de9691bc22c60b90ac2db32ba875
BLAKE2b-256 b2f22adb40385c70a5fde5b90373c2ebe920c8b4a0513834b3489533572669ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page