Skip to main content

Automatically generate optimized Parquet/Arrow schemas for better compression.

Project description

AutoParquet

AutoParquet is a Python package that wraps Parquet/Arrow to automatically generate optimized schemas for your data. It focuses on better compression through automatic bit-packing, int-packing, and dictionary encoding, while providing a convenient "header" system for storing custom metadata.

Quick Start

Installation

git clone https://github.com/edawson/autoparquet
cd autoparquet
pip install -e .

Or with dev dependencies:

pip install -e ".[dev,polars]"

Basic Example

import pandas as pd
import autoparquet

df = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "category": ["A", "B", "A", "B", "A"],
    "value": [1.1, 2.2, 3.3, 4.4, 5.5]
})

# Write with automatic schema optimization and a custom header
autoparquet.write_parquet(
    df, 
    "data.parquet", 
    header={"version": "1.0", "author": "Eric T. Dawson"}
)

# Read back the data and the header
df_read, header = autoparquet.read_parquet("data.parquet")
print(header)  # {'version': '1.0', 'author': 'Eric T. Dawson'}

Command Line

# Convert CSV to optimized Parquet (default: zstd compression)
autoparquet csv_to_parquet data.csv

# Convert CSV to Feather with custom options
autoparquet csv_to_feather data.csv -o out.feather -c snappy

# Tab-separated input with reduced float precision
autoparquet csv_to_parquet data.tsv -d $'\t' -f float32

Features

  • Automatic Schema Inference: Downcasts integers to the smallest type that fits and optionally reduces float precision.
  • Optimized Compression: Dictionary-encodes low-cardinality strings with the smallest index type; converts uniform-length strings to FixedSizeBinary.
  • Custom Headers: Easily add and retrieve custom metadata (versioning, key-value pairs) in Parquet files.
  • Multi-Framework Support: Works with Pandas, Polars, and cuDF.
  • CLI: Convert CSV files to optimized Parquet or Feather from the command line.

Documentation

  • Usage Guide - Detailed examples, API reference, and advanced features
  • Contributing - Development setup and contribution guidelines

Development

# Run tests
pytest

# Lint and format
ruff check .
ruff format .

# Type checking
mypy src/autoparquet tests

# Or use make
make test
make lint
make check

Requirements

  • Python 3.9+
  • PyArrow 14.0.0+
  • Pandas 2.0.0+ (optional)
  • Polars 0.20.0+ (optional)

License

MIT License

Citation

@software{autoparquet2026,
  author = {Dawson, Eric T.},
  title = {AutoParquet: Automatic Schema Optimization for Parquet Files},
  year = {2026},
  url = {https://github.com/erictdawson/autoparquet}
}

This repository was generated using Gemini 3 Flash, based on a specification written by the author. The code was reviewed by the author for correctness and tested locally.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoparquet-0.1.3.tar.gz (13.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autoparquet-0.1.3-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file autoparquet-0.1.3.tar.gz.

File metadata

  • Download URL: autoparquet-0.1.3.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for autoparquet-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ab43aa5da15c8aac0358c9b422ac12d5edcb1d057d0866074bcfc6ede7caef11
MD5 1cddd3aa32500a9ba97aeb25c67ca122
BLAKE2b-256 268a1f505f40b0cb7a4bf8a8ea8e81c2aed73402b18cc2a2ee613f8ce72f2488

See more details on using hashes here.

File details

Details for the file autoparquet-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: autoparquet-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for autoparquet-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5c3738c5105ddd204201475a82b91bad287c8d3e0304ebc3b75150b374382232
MD5 ff65da788a5384aacc655ee0cd74973a
BLAKE2b-256 724047a0d72286a2a08b86c8c334e6a3abfad0c0fd05c8f05b266582195f80b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page