Skip to main content

Automatically generate optimized Parquet/Arrow schemas for better compression.

Project description

AutoParquet

AutoParquet is a Python package that wraps Parquet/Arrow to automatically generate optimized schemas for your data. It focuses on better compression through automatic bit-packing, int-packing, and dictionary encoding, while providing a convenient "header" system for storing custom metadata.

Quick Start

Installation

git clone https://github.com/edawson/autoparquet
cd autoparquet
pip install -e .

Or with dev dependencies:

pip install -e ".[dev,polars]"

Basic Example

import pandas as pd
import autoparquet

df = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "category": ["A", "B", "A", "B", "A"],
    "value": [1.1, 2.2, 3.3, 4.4, 5.5]
})

# Write with automatic schema optimization and a custom header
autoparquet.write_parquet(
    df,
    "data.parquet",
    header={"version": "1.0", "author": "Eric T. Dawson"}
)

# Read back the data and the header
df_read, header = autoparquet.read_parquet("data.parquet")
print(header)  # {'version': '1.0', 'author': 'Eric T. Dawson'}

Command Line

# Convert CSV to optimized Parquet (default: zstd compression)
autoparquet csv_to_parquet data.csv

# Convert CSV to Feather with custom options
autoparquet csv_to_feather data.csv -o out.feather -c snappy

# Tab-separated input with reduced float precision
autoparquet csv_to_parquet data.tsv -d $'\t' -f float32

Features

  • Automatic Schema Inference: Downcasts integers to the smallest type that fits and optionally reduces float precision.
  • Optimized Compression: Dictionary-encodes low-cardinality strings with the smallest index type; converts uniform-length strings to FixedSizeBinary.
  • Custom Headers: Easily add and retrieve custom metadata (versioning, key-value pairs) in Parquet files.
  • Multi-Framework Support: Works with Pandas, Polars, and cuDF.
  • CLI: Convert CSV files to optimized Parquet or Feather from the command line.

Documentation

  • Usage Guide - Detailed examples, API reference, and advanced features
  • Contributing - Development setup and contribution guidelines

Development

# Run tests
pytest

# Lint and format
ruff check .
ruff format .

# Type checking
mypy src/autoparquet tests

# Or use make
make test
make lint
make check

Requirements

  • Python 3.9+
  • PyArrow 14.0.0+
  • Pandas 2.0.0+ (optional)
  • Polars 0.20.0+ (optional)

License

MIT License

Citation

@software{autoparquet2026,
  author = {Dawson, Eric T.},
  title = {AutoParquet: Automatic Schema Optimization for Parquet Files},
  year = {2026},
  url = {https://github.com/edawson/autoparquet}
}

This repository was generated using Gemini 3 Flash, based on a specification written by the author. The code was reviewed by the author for correctness and tested locally.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoparquet-0.1.5.tar.gz (26.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autoparquet-0.1.5-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file autoparquet-0.1.5.tar.gz.

File metadata

  • Download URL: autoparquet-0.1.5.tar.gz
  • Upload date:
  • Size: 26.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for autoparquet-0.1.5.tar.gz
Algorithm Hash digest
SHA256 0402f4e0473bd2995c665c889053055fdd0aef8f16bdb03e09b3d534fe569d77
MD5 08cea6f4965e44b401e4b94b4b41dce9
BLAKE2b-256 a3b24f0f78fd9ebfb01cf0a6982fbdb836cbdc76514239d3677c4547e4f7a294

See more details on using hashes here.

File details

Details for the file autoparquet-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: autoparquet-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for autoparquet-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8153e413bfa03bdb80d2b8fe3060245e9efaf116380a47d3c6ec7bb008dea86a
MD5 ed34873c044eaa191970d872d576c279
BLAKE2b-256 2aa82b687e38a73dc55f74b767d9064e878c1ef6ebae74c4070bfe500875245c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page