Polars io-plugin for reading and writing avro files
Project description
polars-avro
A polars io plugin for reading and writing Apache Avro files, built on arrow-avro. It provides scan support with predicate pushdown, map type reading, and continued avro support as polars deprecates its built-in implementation.
Python Usage
from polars_avro import scan_avro, read_avro, write_avro
lazy = scan_avro(path)
frame = read_avro(path)
write_avro([frame], path)
Rust Usage
There are two main exports: [Reader] for iterating DataFrames from avro
sources, and [Writer] for writing DataFrames to an avro file.
use polars_avro::{Reader, Writer, ReadOptions};
// read
let reader = Reader::try_new(
[File::open("data.avro")],
ReadOptions::basic(),
).unwrap();
for batch in reader {
let frame = batch.unwrap();
}
// write
let mut writer = Writer::try_new(file, frame.schema(), None).unwrap();
writer.write(&frame).unwrap();
ℹ️ Avro supports writing with file compression schemes. In rust these need to be enabled via feature flags:
deflate,snappy,bzip2,xz,zstd. Decompression is handled automatically.
Idiosyncrasies
Avro and Arrow don't align fully, and polars only supports a subset of arrow. Some types require casting before writing, and some avro types map to different polars types than you might expect when reading.
Writing
The following polars types error when writing and must be cast first:
| Polars Type | Cast To |
|---|---|
Int8 |
Int32 |
Int16 |
Int32 |
UInt8 |
Int32 |
UInt16 |
Int32 |
UInt32 |
Int64 |
UInt64 |
Int64 (lossy for > 2⁶³) |
Time |
Int64 |
Categorical |
Int32 or String |
Enum |
Int32 or String |
Compression is supported via feature flags: deflate, snappy, bzip2, xz,
zstd.
Reading
utf8_view behavior — the utf8_view option (default false) changes how
certain types are read:
| Type | utf8_view=false (default) |
utf8_view=true |
|---|---|---|
| UUID | binary (16 bytes) | formatted string |
| nullable strings | preserves nulls | replaces null with "" (lossy) |
Since polars tends to work with string views internally, utf8_view=true is
likely faster if you don't mind losing null string distinctions.
Type mappings of note:
| Avro Type | Polars Type |
|---|---|
| Enum | Categorical (not Enum) |
| Map | List of Struct {key, value} |
| BigDecimal | Binary |
| Duration | unsupported (errors) |
| Date | Date (days since epoch) |
| TimeMillis, TimeMicros | Time (nanoseconds) |
| TimestampMillis/Micros/Nanos | Datetime with matching precision and UTC tz |
| LocalTimestampMillis/Micros/Nanos | Datetime with matching precision and no tz |
Constraints: the root avro schema must be a Record, and all files in a multi-file read must share the same schema.
Benchmarks
Python reports median (file reads, in-memory writes). Rust reports mean.
native = polars built-in avro. Ratio relative to native; bold = fastest.
Complex rows use nested/struct types.
| Benchmark | native | polars-avro | jetliner |
|---|---|---|---|
| python read 1K × 2 | 64 µs (1.00x) | 99 µs (1.54x) | 180 µs (2.79x) |
| python read 64K × 2 | 2.7 ms (1.00x) | 2.1 ms (0.78x) | 2.8 ms (1.04x) |
| python read 1K × 8 | 183 µs (1.00x) | 242 µs (1.32x) | 337 µs (1.84x) |
| python read 1M × 8 | 159 ms (1.00x) | 114 ms (0.72x) | 145 ms (0.91x) |
| python read 1M × 128 | 2.6 s (1.00x) | 1.8 s (0.69x) | 2.8 s (1.09x) |
| python read complex 1K × 8 | — | 449 µs | 592 µs |
| python read complex 1M × 8 | — | 181 ms | 260 ms |
| python read proj 1M × 128 → 8 | 1.6 s (1.00x) | 1.2 s (0.75x) | 1.2 s (0.77x) |
| python read proj 1K × 8 → 2 | 133 µs (1.00x) | 297 µs (2.24x) | 264 µs (1.99x) |
| python write 1K × 2 | 42 µs (1.00x) | 30 µs (0.72x) | — |
| python write 64K × 2 | 1.5 ms (1.00x) | 1.1 ms (0.71x) | — |
| python write 1K × 8 | 143 µs (1.00x) | 114 µs (0.80x) | — |
| python write 1M × 8 | 87 ms (1.00x) | 93 ms (1.07x) | — |
| python write 1M × 128 | 1.5 s (1.00x) | 2.2 s (1.48x) | — |
| rust read 1K × 2 | 42 µs (1.00x) | 34 µs (0.80x) | — |
| rust read 1M × 128 | 2.8 s (1.00x) | 2.0 s (0.69x) | — |
| rust read proj 1M × 128 → 8 | 1.3 s (1.00x) | 1.2 s (0.87x) | — |
| rust read proj 1K × 8 → 2 | 109 µs (1.00x) | 116 µs (1.06x) | — |
| rust write 1K × 2 | 42 µs (1.00x) | 22 µs (0.53x) | — |
| rust write 64K × 2 | 1.5 ms (1.00x) | 1.0 ms (0.67x) | — |
| rust write 1K × 8 | 135 µs (1.00x) | 93 µs (0.69x) | — |
| rust write 1M × 8 | 97 ms (1.00x) | 89 ms (0.92x) | — |
| rust write 1M × 128 | 1.6 s (1.00x) | 1.4 s (0.88x) | — |
Development
Rust
Standard cargo commands will build and test the rust library.
Python
The python library is built with uv and maturin. The rust components should build once, ance otherwise allow usage and testing.
You may need to recompile the python bindings with uv run maturin develop.
Testing
cargo fmt --check
cargo clippy --all-features --tests
cargo test
uv run ruff format --check
uv run ruff check
uv run pyright
uv run pytest
Benchmarking
Python benchmarks are disabled by default. To run them:
cargo +nightly bench
uv run pytest --benchmark-only
Releasing
rm -rf dist
uv build --sdist
uv run maturin build -r -o dist --target aarch64-apple-darwin
uv run maturin build -r -o dist --target aarch64-unknown-linux-gnu --zig
uv publish --username __token__
To Do
- reimplement single column reader?
- reimplement better workarounds for types that don't exist, e.g. serialize polars cat/enum to arrow enum and vice versa
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_avro-0.8.1.tar.gz.
File metadata
- Download URL: polars_avro-0.8.1.tar.gz
- Upload date:
- Size: 164.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d51207e34fd5317e210b1b0ce173a8a73c898cec699fab51716a503601787eb8
|
|
| MD5 |
1bf8f4d60b40ba07bd00e542062a6121
|
|
| BLAKE2b-256 |
ab6ba92423c3c71238447a10b796e9aa3c535d5288711565990bfeddf7dbd071
|
File details
Details for the file polars_avro-0.8.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: polars_avro-0.8.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 12.1 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e33fa7a39180de4a379538c901ae15b42c30fce699f57c7299359d38eb5d8fc5
|
|
| MD5 |
6f807280b31608e0ae10c9de266a6835
|
|
| BLAKE2b-256 |
5dc6e6d2019aeb17edcb3b6b399ba4436719873d5f810a5cdd47240c8b87124c
|
File details
Details for the file polars_avro-0.8.1-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: polars_avro-0.8.1-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 12.3 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6dfd0807b129b5d0c363ab4174ab40837c20c3ec3d4c9454587963788bc726e7
|
|
| MD5 |
53e01ae78c461f235a7bb16b07fae0f2
|
|
| BLAKE2b-256 |
9c3858f54120741b08ceb899809bde71ea4e621d7d094b00a7bb6456acd4e02e
|