Skip to main content

Polars io-plugin for reading and writing avro files

Project description

polars-avro

build pypi docs

A polars io plugin for reading and writing avro files.

Polars is deprecating support for reading and writing avro files, and this plugin fills in support. Currently it's about 7x slower at reading avro files and up to 20x slower at writing files.

The reason it's slower is beause this uses the apache rust library, which is fully complaint, but does a lot of unnecessary memory allocation and object creation that the polars implementaiton avoids. However, this is likely not a bottlneck, so the benefits of the standard implementation seem to outweight the added computation.

In exchange for speed you get:

  1. future proof - this won't get deprecated
  2. robust support - the current polars avro implementation has bugs with non-contiguous data frames
  3. better coverage - this supports reading map types as lists
  4. scan support - this can scan and push down predicates by chunk

Python Usage

from polars_avro import scan_avro, read_avro, write_avro

lazy = scan_avro(path)
frame = read_avro(path)
write_avro(frame, path)

Rust Usage

There are two main objects exported in rust: AvroScanner for creating an iterator of DataFrames from polars ScanSources, and sink_avro for writing an iterator of DataFrames to a Writeable.

use polars_avro::{AvroScanner, sink_avro, WriteOptions};

let scanner = AvroScanner::new_from_sources(
    &ScanSources::Paths(...),
    false, // expand globs
    None,  // cloud options
    None,  // name for single column avros
).unwrap()

sink_avro(
    scanner.into_iter(
        1024, // batch size
        None, // columns to select
    ).map(Result::unwrap),
    ..., // impl Write
    WriteOptions::default(),
).unwrap();

ℹ️ Avro supports writing with a fire compression schemes. In rust these features need to be enabled manually, e.g. apache-avro/bzip to enable bzip2 compression. Decompression is handled automatically.

Idiosyncorcies

Avro and Arrow don't align fully, and polars only supports a subset of arrow. This library tries to allow you to serialize tow avro and deserialize from avro. Trying to do both means that many types will change at each pass due to the way serde works.

  1. Avro only supports time with at most microsecond resolution, polars only supports time with nanosecond resolution, so writing times values truncates them. You must explicitely allow this behavior.
  2. Avro fixed types don't support storing null values in the individual bytes so while a fixed type can be read into a u8 array, it must be serialized back as a list of i32s. This may be addressed with polars support for arrow fixedlengthbinary, but that seems unlikely.

Benchmarks

Library Read Python Write Python Read Rust Write Rust
polars 6.0319 ms (1.00) 3.0663 ms (1.00) 41,653.91 ns (1.00) 39,970.80 ns (1.00)
polars-avro 39.9563 ms (6.62) 67.9542 ms (22.16) 340,622.90 ns (8.18) 513,200.00 ns (12.84)
polars-fastavro 179.0461 ms (29.68) 246.3771 ms (80.35) - -

Development

Rust

Standard cargo commands will build and test the rust library.

Python

The python library is built with uv and maturin. Run the following to compile rust for use by python:

For local rust development, run

uv run maturin develop -m Cargo.toml

to build a local copy of the rust interface. Add -r if you want to trust the benchmark results.

Testing

cargo fmt --check
cargo clippy --all-features
cargo test
uv run ruff format --check
uv run ruff check
uv run pyright
uv run pytest

Benchmarking

cargo +nightly bench
uv run pytest

ℹ️ For python benchmarks, make sure you've compiled in release mode: uv run maturin develop -m Cargo.toml -r

Releasing

rm -rf dist
uv build --sdist
uv run maturin build -r -o dist --target aarch64-apple-darwin
uv run maturin build -r -o dist --target aarch64-unknown-linux-gnu --zig
uv publish --username __token__

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_avro-0.7.0.tar.gz (140.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_avro-0.7.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (29.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

polars_avro-0.7.0-cp39-abi3-macosx_11_0_arm64.whl (35.1 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file polars_avro-0.7.0.tar.gz.

File metadata

  • Download URL: polars_avro-0.7.0.tar.gz
  • Upload date:
  • Size: 140.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.17

File hashes

Hashes for polars_avro-0.7.0.tar.gz
Algorithm Hash digest
SHA256 c1359107068c6ced0e384a329f57b47873563bf16343e6e1edeb3be1881efdb0
MD5 983718a5b262524462077a99287390c8
BLAKE2b-256 2a3a265ff5cbe6b6a6d4486dc0bb7856a825cd08b4ea3ad12c79c4a03753e84c

See more details on using hashes here.

File details

Details for the file polars_avro-0.7.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for polars_avro-0.7.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2e8d6e6bcd4b716370677c21087f58bb7dd24e1e94fadd4d5e72b8a0b3636dff
MD5 f45faff1a271281a696c8b0566bd7363
BLAKE2b-256 adafdc8290da2811c5c1597f5e7c779795e6e64e9e70a3f9b98ecf353857b719

See more details on using hashes here.

File details

Details for the file polars_avro-0.7.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_avro-0.7.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fdc8d460f6c4caf062d28a3b062068e2cda238485aea1c06b1329147a9726eb4
MD5 67d21994c7179f13470565aee9dea865
BLAKE2b-256 3984f92718fa6c8ab588fa6bd2c32d7695a9b56550d7c3d5b276cf7b9650425a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page