Polars io-plugin for reading and writing avro files
Project description
polars-avro
A polars io plugin for reading and writing avro files.
Polars is deprecating support for reading and writing avro files, and this plugin fills in support. Currently it's about 7x slower at reading avro files and up to 20x slower at writing files.
The reason it's slower is beause this uses the apache rust library, which is fully complaint, but does a lot of unnecessary memory allocation and object creation that the polars implementaiton avoids. However, this is likely not a bottlneck, so the benefits of the standard implementation seem to outweight the added computation.
In exchange for speed you get:
- future proof - this won't get deprecated
- robust support - the current polars avro implementation has bugs with non-contiguous data frames
- better coverage - this supports reading map types as lists
- scan support - this can scan and push down predicates by chunk
Python Usage
from polars_avro import scan_avro, read_avro, write_avro
lazy = scan_avro(path)
frame = read_avro(path)
write_avro(frame, path)
Rust Usage
There are two main objects exported in rust: AvroScanner for creating an
iterator of DataFrames from polars ScanSources, and sink_avro for writing
an iterator of DataFrames to a Writeable.
use polars_avro::{AvroScanner, sink_avro, WriteOptions};
let scanner = AvroScanner::new_from_sources(
&ScanSources::Paths(...),
false, // expand globs
None, // cloud options
None, // name for single column avros
).unwrap()
sink_avro(
scanner.into_iter(
1024, // batch size
None, // columns to select
).map(Result::unwrap),
..., // impl Write
WriteOptions::default(),
).unwrap();
ℹ️ Avro supports writing with a fire compression schemes. In rust these features need to be enabled manually, e.g.
apache-avro/bzipto enable bzip2 compression. Decompression is handled automatically.
Idiosyncorcies
Avro and Arrow don't align fully, and polars only supports a subset of arrow. This library tries to allow you to serialize tow avro and deserialize from avro. Trying to do both means that many types will change at each pass due to the way serde works.
- Avro only supports time with at most microsecond resolution, polars only supports time with nanosecond resolution, so writing times values truncates them. You must explicitely allow this behavior.
- Avro fixed types don't support storing null values in the individual bytes so while a fixed type can be read into a u8 array, it must be serialized back as a list of i32s. This may be addressed with polars support for arrow fixedlengthbinary, but that seems unlikely.
Benchmarks
| Library | Read Python | Write Python | Read Rust | Write Rust |
|---|---|---|---|---|
polars |
6.0319 ms (1.00) | 3.0663 ms (1.00) | 41,653.91 ns (1.00) | 39,970.80 ns (1.00) |
polars-avro |
39.9563 ms (6.62) | 67.9542 ms (22.16) | 340,622.90 ns (8.18) | 513,200.00 ns (12.84) |
polars-fastavro |
179.0461 ms (29.68) | 246.3771 ms (80.35) | - | - |
Development
Rust
Standard cargo commands will build and test the rust library.
Python
The python library is built with uv and maturin. Run the following to compile rust for use by python:
For local rust development, run
uv run maturin develop -m Cargo.toml
to build a local copy of the rust interface. Add -r if you want to trust the
benchmark results.
Testing
cargo fmt --check
cargo clippy --all-features
cargo test
uv run ruff format --check
uv run ruff check
uv run pyright
uv run pytest
Benchmarking
cargo +nightly bench
uv run pytest
ℹ️ For python benchmarks, make sure you've compiled in release mode:
uv run maturin develop -m Cargo.toml -r
Releasing
rm -rf dist
uv build --sdist
uv run maturin build -r -o dist --target aarch64-apple-darwin
uv run maturin build -r -o dist --target aarch64-unknown-linux-gnu --zig
uv publish --username __token__
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_avro-0.4.0.tar.gz.
File metadata
- Download URL: polars_avro-0.4.0.tar.gz
- Upload date:
- Size: 132.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.28
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f947c60603ed10c618bbc7f9c473f78d6f1fab8e3b4b05158f5ce39d131b353
|
|
| MD5 |
eb23f07b81bec23425f0cb642d0a34fa
|
|
| BLAKE2b-256 |
9946a31a397ca0f54faf4e66dd1c90eadb5a48eaa94bfb4c7b67611c9c2af3b7
|
File details
Details for the file polars_avro-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: polars_avro-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 29.6 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.28
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1facf06ac71307e13d29e67bf114172d4847dcf435700e3377fe36207ffcb8ac
|
|
| MD5 |
f70e09c3d0ce2320f59d6621625c9b7e
|
|
| BLAKE2b-256 |
0d502cc1cb0c90cdccb9969bee4ed0ba9613d4c9418e53fb42ee60eb4f53373e
|
File details
Details for the file polars_avro-0.4.0-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: polars_avro-0.4.0-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 34.9 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.28
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1fc4f9daf9e2b4281a2ed3228b20365f653407404143982b62f726e31eabc7ed
|
|
| MD5 |
694fa232ac9d34453cc351efc91efeb9
|
|
| BLAKE2b-256 |
776fe90c233482aef6fec4e97a82866028be2476e63f4dcfffe52a2910aafd8c
|