Skip to main content

Fast streaming parser for Crystal Reports XML, with Rust acceleration

Project description

crxml

Fast streaming parser for Crystal Reports XML exports.

from crxml import CrystalXMLSource, to_dataframe

df = to_dataframe(CrystalXMLSource("report.xml", row_tag="Details"))
print(df.head())

Installation

Prerequisites: Python ≥3.10 and Rust.

pip install crxml

About

crxml streams through Crystal Reports XML files row by row, never loading the full document into memory. It extracts field data from nested CR field elements and yields flat dictionaries. A built-in pipeline lets you rename, cast, filter, and drop fields with | operators. The Rust backend processes 100 MB in ~0.42 seconds using ~75 MB RSS for streaming.

This library is conceptually based on carlosplanchon/xmlstreamer.

API

CrystalXMLSource

CrystalXMLSource(source, row_tag="Details")

Parses a CR XML file and yields dict[str, str] rows. Accepts a file path (string or Path), or a file-like object with a .name attribute. The row_tag parameter controls which XML element is treated as a record (default: Details).

Pipeline stages

Stages are chained with |:

from crxml.stages import RenameFields, CastTypes, DropFields, FilterRows

pipeline = (
    CrystalXMLSource("report.xml")
    | RenameFields({"f1": "invoice", "f2": "amount"})
    | CastTypes({"amount": float})
    | DropFields("tax_rate")
    | FilterRows(lambda r: r["amount"] > 100)
)
  • RenameFields(mapping), renames dict keys
  • CastTypes(types, errors="raise"), casts fields to target types
  • DropFields(*fields), removes fields from rows
  • FilterRows(predicate), keeps rows matching predicate

Sinks

from crxml import to_dataframe, to_csv, collect

df = to_dataframe(pipeline)                  # → pd.DataFrame
to_csv(pipeline, "out.csv")                  # → CSV file
rows = collect(pipeline)                     # → list[dict]

Parallel mode

df = pipeline.parallel(workers=4) |> to_dataframe

Distributes batches across worker processes. See the docs for requirements.

Benchmarks

Test Size Rows Time Rows/s MB/s RSS
Stream 10 MB 9,010 0.043s 211 K 234 22 MB
Stream 50 MB 45,328 0.223s 203 K 224 45 MB
Stream 100 MB 90,384 0.418s 216 K 239 75 MB
To list 10 MB 9,010 0.052s 174 K 192 32 MB
To list 50 MB 45,328 0.249s 182 K 201 98 MB
To list 100 MB 90,384 0.478s 189 K 209 181 MB
Pipeline 10 MB 9,010 0.060s 150 K 166 32 MB
Pipeline 50 MB 45,328 0.295s 154 K 169 96 MB
Pipeline 100 MB 90,384 0.579s 156 K 173 176 MB
DataFrame 10 MB 9,010 0.320s 28 K 31 86 MB
DataFrame 50 MB 45,328 0.538s 84 K 93 152 MB
DataFrame 100 MB 90,384 0.829s 109 K 121 234 MB

pandas is imported lazily — memory climbs only when to_dataframe is called.

Publishing

./upload.sh

Builds a manylinux2014 wheel + sdist and uploads to PyPI. Requires maturin and twine. The --manylinux 2014 --zig flag ensures PyPI-compatible platform tags — python -m build does not support manylinux flags via PEP 517.

Documentation

Full documentation is available at the project site, covering installation, usage, stages, custom stages, architecture, performance, FastAPI integration, and the Rust core.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crxml-0.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (236.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

File details

Details for the file crxml-0.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for crxml-0.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6189d8ae681f85d77fadec56a29ef9e31a200f34648e54123d56cc55d6085ad5
MD5 0e47eb9e28f1c64fd70a8c0b972f740d
BLAKE2b-256 cbfba65463b422ccacb3e3d20be04cd3f446d43228189fee7760cd041debba7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page