jetliner

High-performance Avro streaming reader for Polars DataFrames

These details have not been verified by PyPI

Project description

Jetliner

A high-performance a Polars plugin written in Rust with python bindings for fast and memory efficient reading of Avro files into DataFrames.

Jetliner is designed for data pipelines where Avro files live on S3 or local disk and need to land in Polars fast. It streams data block-by-block rather than loading entire files into memory, uses zero-copy techniques, and has (almost) complete support for the Avro spec.

Features

Avro Object Container Files — Reads self-contained .avro files with embedded schemas. Does not support single-object encoding (schema registry) or bare Avro encoding
High-performance streaming — Supports block-by-block processing with minimal memory footprint, ideal for large files
Query optimization — Projection pushdown (select columns) and predicate pushdown (filter rows) at the source via Polars LazyFrames
S3 and local file support — Read Avro files from Amazon S3 or local disk with the same API
All standard codecs — null, snappy, deflate, zstd, bzip2, and xz compression out of the box
(Almost) complete avro schema support — reads almost any valid avro (see limitations)
Flexible error handling — Optionally skip bad blocks for resilience to data corruption
Ridiculously fast reads — Check the benchmarks!

This library was created to serve performance critical scenarios around processing large avro files from python. It's fast but limited to read use cases. If you also need to write avro files from Polars then you should check polars-avro.

Benchmarks

Jetliner is built for speed.

TODO: insert benchmarks plot

Installation

Install from PyPI using pip:

pip install jetliner

Or with uv:

uv add jetliner

Quick Start

Basic File Reading

Use scan_avro() to read an Avro file into a Polars LazyFrame:

import jetliner

# Read a local file
df = jetliner.scan_avro("data.avro").collect()

# Read from S3
df = jetliner.scan_avro("s3://my-bucket/data.avro").collect()

# Or use read_avro() for eager loading with column selection
df = jetliner.read_avro("data.avro", columns=["col1", "col2"])

Streaming with open()

Use open() for fine-grained control over batch processing — useful for progress tracking, or memory-constrained environments:

import jetliner

# Process batches one at a time
with jetliner.open("large_file.avro") as reader:
    for batch in reader:
        print(f"Processing batch with {batch.height} rows")
        process(batch)

# Configure batch size and buffer settings
with jetliner.open(
    "large_file.avro",
    batch_size=50_000,
    buffer_blocks=2,
) as reader:
    for batch in reader:
        process(batch)

S3 Access

Jetliner reads from S3 using default AWS credentials (environment variables, IAM roles, or AWS config):

import jetliner

# Uses AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY from environment
df = jetliner.scan_avro("s3://my-bucket/data.avro").collect()

# Or pass credentials explicitly
df = jetliner.scan_avro(
    "s3://my-bucket/data.avro",
    storage_options={
        "aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
        "aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "region": "us-east-1",
    }
).collect()

# S3-compatible services (MinIO, LocalStack, R2)
df = jetliner.scan_avro(
    "s3://my-bucket/data.avro",
    storage_options={
        "endpoint": "http://localhost:9000",
        "aws_access_key_id": "minioadmin",
        "aws_secret_access_key": "minioadmin",
    }
).collect()

Query Optimization

The scan_avro() API enables Polars query optimizations — projection pushdown, predicate pushdown, and early stopping:

import jetliner
import polars as pl

# Only reads the columns you select (projection pushdown)
# Filters during read, not after (predicate pushdown)
# Stops reading after 1000 rows (early stopping)
result = (
    jetliner.scan_avro("s3://bucket/large_file.avro")
    .select(["user_id", "amount", "status"])
    .filter(pl.col("status") == "active")
    .filter(pl.col("amount") > 100)
    .head(1000)
    .collect()
)

scan_avro() vs read_avro() vs open()

Feature	`scan_avro()`	`read_avro()`	`open()`
Returns	LazyFrame	DataFrame	Iterator of DataFrames
Query optimization	✅ Projection, predicate, early stopping	✅ Column selection	❌ Manual
Batch control	Automatic	Automatic	Full control
Best for	Most queries	Eager loading with columns	Custom streaming, progress tracking

Development

The project uses spec driven development via kiro. See ./.kiro for the specs and related documentation.

Project tasks

This project uses poethepoet for task management.

# Install poe globally with homebrew
brew tap nat-n/poethepoet
brew install nat-n/poethepoet/poethepoet
# Or with uv/pip/pipx
uv tool install poethepoet
# run poe without arguments to list available tasks, defined in pyproject.toml
poe

There are tasks available for formatting, linting, building, and testing. The check task orchestrated all tasks that must complete successfully for a change to be accepted.

Running tests

poe test-rust # run rust unit tests
poe test-property # run rust property tests
poe test-schema # run rust schema tests

Feature flags control codec support: snappy, deflate, zstd, bzip2, xz. Disable what you don't need with --no-default-features --features "snappy,zstd" to optimize build times.

Known Limitations

Read-Only

Jetliner is a read-only library. It does not support writing Avro files.

Avro Object Container Files Only

Jetliner reads Avro Object Container Files (.avro) — self-contained files where the schema is embedded in the file header. It does not support:

Single-object encoding — Used with schema registries (e.g., Confluent Schema Registry, Kafka). These encode objects with a schema fingerprint that requires external lookup.
Bare Avro encoding — Raw Avro binary without any schema information.
Standalone schema files (.avsc) — Schema JSON files are not read directly; schemas are extracted from .avro file headers.

Recursive Types

Avro supports recursive types (e.g., linked lists, trees) where a record can contain references to itself. Since Arrow and Polars don't natively support recursive data structures, Jetliner serializes recursive fields to JSON strings. This preserves data integrity while maintaining compatibility with the Polars DataFrame model.

Example: A binary tree node with left and right children will have those fields serialized as JSON strings that can be parsed if needed after reading.

Complex Top-Level Schemas

Avro is usually used as a table format, with a Record as the top level type. However it may also be used with any other type at the top level.

Jetliner support primitive top level schemas (int, long, string, bytes) which are treated in the resulting polars Dataframe as a Record with a single 'value' key. However complex types have the following limitations:

Arrays as top-level schema: Not yet supported (Polars list builder constraints)
Maps as top-level schema: Not yet supported (struct handling in list builder)

Empty Schemas

An avro schema may consist of a Record with zero fields. Since Polars cannot represent a DataFrame with zero columns, such avro files are no compatible with Jetliner.

Trivia

The Avro Canada C102 Jetliner was the worlds second purpose built jet powered airliner.

Contributing

If you encounter an issue or have an idea for how to make jetliner more awesome, do come say hi in the issues 👋

If you discover an avro file that other libraries can read but jetliner fails (for reasons other than Known Limitation) then please share it.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Feb 17, 2026

This version

0.1.0

Feb 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jetliner-0.1.0.tar.gz (1.0 MB view details)

Uploaded Feb 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jetliner-0.1.0-cp313-cp313-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded Feb 7, 2026 CPython 3.13macOS 11.0+ ARM64

File details

Details for the file jetliner-0.1.0.tar.gz.

File metadata

Download URL: jetliner-0.1.0.tar.gz
Upload date: Feb 7, 2026
Size: 1.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for jetliner-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a35d32f36b2abf2ee19f34ae9d6f95b386f897338f7733a9c652dece362dab5b`
MD5	`01bf3ba3d6264c68e0b6e5545144260c`
BLAKE2b-256	`a33a2b5ffe489167ec2f9e97b9f33fb5e951f51662cf588a8a427efb653dfbe3`

See more details on using hashes here.

File details

Details for the file jetliner-0.1.0-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

Download URL: jetliner-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
Upload date: Feb 7, 2026
Size: 12.9 MB
Tags: CPython 3.13, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for jetliner-0.1.0-cp313-cp313-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`e9f20e00d77e3f32ee2ee9c7350169e53ae60c6e785ec8c389297e0b46519d3f`
MD5	`7cfd2d02f3a46a32dd2edc1283b37c3d`
BLAKE2b-256	`6a2709c6021d93396dfe76e32056002c56f28a2d40fc7d1abcc90f502ef0b73d`

See more details on using hashes here.

jetliner 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Features

Benchmarks

Installation

Quick Start

Basic File Reading

Streaming with open()

S3 Access

Query Optimization

scan_avro() vs read_avro() vs open()

Development

Project tasks

Running tests

Known Limitations

Read-Only

Avro Object Container Files Only

Recursive Types

Complex Top-Level Schemas

Empty Schemas

Trivia

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes