Automatically generate optimized Parquet/Arrow schemas for better compression.
Project description
AutoParquet
AutoParquet is a Python package that wraps Parquet/Arrow to automatically generate optimized schemas for your data. It focuses on better compression through automatic bit-packing, int-packing, and dictionary encoding, while providing a convenient "header" system for storing custom metadata.
Quick Start
Installation
git clone https://github.com/edawson/autoparquet
cd autoparquet
pip install -e .
Or with dev dependencies:
pip install -e ".[dev,polars]"
Basic Example
import pandas as pd
import autoparquet
df = pd.DataFrame({
"id": [1, 2, 3, 4, 5],
"category": ["A", "B", "A", "B", "A"],
"value": [1.1, 2.2, 3.3, 4.4, 5.5]
})
# Write with automatic schema optimization and a custom header
autoparquet.write_parquet(
df,
"data.parquet",
header={"version": "1.0", "author": "Eric T. Dawson"}
)
# Read back the data and the header
df_read, header = autoparquet.read_parquet("data.parquet")
print(header) # {'version': '1.0', 'author': 'Eric T. Dawson'}
Command Line
# Convert CSV to optimized Parquet (default: zstd compression)
autoparquet csv_to_parquet data.csv
# Convert CSV to Feather with custom options
autoparquet csv_to_feather data.csv -o out.feather -c snappy
# Tab-separated input with reduced float precision
autoparquet csv_to_parquet data.tsv -d $'\t' -f float32
Features
- Automatic Schema Inference: Downcasts integers to the smallest type that fits and optionally reduces float precision.
- Optimized Compression: Dictionary-encodes low-cardinality strings with the smallest index type; converts uniform-length strings to FixedSizeBinary.
- Custom Headers: Easily add and retrieve custom metadata (versioning, key-value pairs) in Parquet files.
- Multi-Framework Support: Works with Pandas, Polars, and cuDF.
- CLI: Convert CSV files to optimized Parquet or Feather from the command line.
Documentation
- Usage Guide - Detailed examples, API reference, and advanced features
- Contributing - Development setup and contribution guidelines
Development
# Run tests
pytest
# Lint and format
ruff check .
ruff format .
# Type checking
mypy src/autoparquet tests
# Or use make
make test
make lint
make check
Requirements
- Python 3.9+
- PyArrow 14.0.0+
- Pandas 2.0.0+ (optional)
- Polars 0.20.0+ (optional)
License
MIT License
Citation
@software{autoparquet2026,
author = {Dawson, Eric T.},
title = {AutoParquet: Automatic Schema Optimization for Parquet Files},
year = {2026},
url = {https://github.com/edawson/autoparquet}
}
This repository was generated using Gemini 3 Flash, based on a specification written by the author. The code was reviewed by the author for correctness and tested locally.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autoparquet-0.1.5.tar.gz.
File metadata
- Download URL: autoparquet-0.1.5.tar.gz
- Upload date:
- Size: 26.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0402f4e0473bd2995c665c889053055fdd0aef8f16bdb03e09b3d534fe569d77
|
|
| MD5 |
08cea6f4965e44b401e4b94b4b41dce9
|
|
| BLAKE2b-256 |
a3b24f0f78fd9ebfb01cf0a6982fbdb836cbdc76514239d3677c4547e4f7a294
|
File details
Details for the file autoparquet-0.1.5-py3-none-any.whl.
File metadata
- Download URL: autoparquet-0.1.5-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8153e413bfa03bdb80d2b8fe3060245e9efaf116380a47d3c6ec7bb008dea86a
|
|
| MD5 |
ed34873c044eaa191970d872d576c279
|
|
| BLAKE2b-256 |
2aa82b687e38a73dc55f74b767d9064e878c1ef6ebae74c4070bfe500875245c
|