Pure-python parquet parser, for education
Project description
Por Qué: Python Parquet Parser
¿Por qué? ¿Por qué no?
Si, ¿pero por qué? ¡Porque, parquet, python!
But seriously, why "Por Qué"?
Because asking "why" leads to understanding! This project exists to answer "why does Parquet work the way it does?" by implementing it from first principles.
[!WARNING] This is a project for education, it is NOT suitable for any production uses.
Overview
Por Qué is a Python Apache Parquet parser built from scratch for educational purposes. It implements Parquet's binary format in highly-readable python to more easily provide insights into how Parquet files work internally.
Features
- Complete reader stack - Parse files, row groups, column chunks, and pages
- Metadata inspection - Parse and display Parquet file metadata
- Schema analysis - View detailed schema structure with logical types
- Row group information - Inspect row group statistics and column metadata
- Compression analysis - Calculate compression ratios and storage efficiency
- HTTP support - Read Parquet files from URLs using range requests
- Async parallelism - supports reading from async sources that support parallelism, like the files over HTTP
Installation
With pip:
pip install 'por-que'
Usage
Python API
from por_que import AsyncHttpFile, ParquetFile
# Read from local file
with open("data.parquet", "rb") as f:
parquet_file = await ParquetFile.from_reader(f, "data.parquet")
# Access file-level metadata
print(f"Total rows: {parquet_file.metadata.metadata.row_count}")
print(f"Columns: {parquet_file.metadata.metadata.column_count}")
print(f"Row groups: {parquet_file.metadata.metadata.row_group_count}")
print(f"Parquet version: {parquet_file.metadata.metadata.version}")
# Access schema information
schema = parquet_file.metadata.metadata.schema_root
print(f"Schema: {schema}")
# Access column chunks and parse data
for column_chunk in parquet_file.column_chunks:
print(f"Column: {column_chunk.path_in_schema}")
print(f" Compression: {column_chunk.codec}")
print(f" Values: {column_chunk.num_values}")
# Parse all data from the column
data = column_chunk.parse_all_data_pages(f)
print(f" First values: {data[:5]}")
# Read from URL
asnyc with AsyncHttpFile("https://example.com/data.parquet") as f:
parquet_file = await ParquetFile.from_reader(f, "https://example.com/data.parquet")
# Access pages within a column chunk
column_chunk = parquet_file.column_chunks[0]
for page in column_chunk.data_pages:
print(f"Page at offset {page.start_offset}")
print(f" Type: {page.page_type}")
print(f" Values: {page.num_values}")
print(f" Encoding: {page.encoding}")
# Serialize to JSON or dict
json_output = parquet_file.to_json(indent=2)
dict_output = parquet_file.to_dict()
# Deserialize from JSON or dict
restored = ParquetFile.from_json(json_output)
restored = ParquetFile.from_dict(dict_output)
[!TIP] Exported json files can be used with
ver-por-que, an experimental 100% client-side web UI for visualization.
What You'll Learn
By exploring this codebase, you can learn about:
- Parquet file format - Binary structure, magic bytes, footer layout
- Thrift protocol - Binary serialization format used by Parquet
- Schema representation - How nested and complex data types are encoded
- Compression techniques - Various compression algorithms and their efficiency
- Column storage - Columnar storage benefits and trade-offs
- Metadata organization - How Parquet organizes file and column statistics
- Lazy loading patterns - Efficient data access without loading entire files
- Binary parsing - Low-level byte manipulation and struct unpacking
Educational Focus
This implementation prioritizes readability and understanding over performance:
- Explicit parsing logic instead of generated Thrift code
- Comprehensive comments explaining binary format details
- Step-by-step Thrift deserialization
- Clear separation of concerns between parsing and data structures
- Educational debug logging (enable with
logging.basicConfig(level=logging.DEBUG)) - Structured architecture mirroring Parquet's physical layout
Architecture
src/por_que/
├── parsers/ # Low-level binary parsers
│ ├── parquet/ # Parquet format parsers
│ │ ├── metadata.py # File metadata parser
│ │ ├── page.py # Page header parser
│ │ ├── page_index.py # Page index structures
│ │ ├── schema.py # Schema tree parser
│ │ ├── statistics.py # Statistics parser
│ │ ├── row_group.py # Row group metadata
│ │ ├── column.py # Column chunk metadata
│ │ └── ... # Other metadata parsers
│ ├── page_content/ # Page data decoding
│ │ ├── data.py # Data page decoder
│ │ ├── dictionary.py # Dictionary page decoder
│ │ └── compressors.py # Compression codecs
│ ├── thrift/ # Thrift protocol implementation
│ │ ├── parser.py # Core Thrift parser
│ │ └── enums.py # Thrift type definitions
│ ├── logical_types.py # Logical type converters
│ └── physical_types.py # Physical type parsers
├── physical.py # Main ParquetFile class
├── file_metadata.py # Metadata data structures
├── pages.py # Page data structures
├── protocols.py # Type protocols
├── enums.py # Parquet format enums
├── constants.py # Format constants
└── exceptions.py # Exception classes
Current Capabilities
Implemented Features
- Complete metadata parsing - All Parquet metadata structures
- Schema parsing - Full schema tree with logical types
- Page parsing - All page types (DATA_PAGE, DATA_PAGE_V2, DICTIONARY_PAGE, INDEX_PAGE)
- Data decoding - Convert raw page data to Python values
- Compression support - Snappy, GZIP, Brotli, LZ4, LZO, Zstd decompression
- Encoding support - PLAIN, DICTIONARY, RLE, DELTA (all variants), BYTE_STREAM_SPLIT
- Nested data - Definition and repetition level handling
- Statistics parsing - Min/max values, null counts, and distinct counts
- Page indexes - Column and offset index structures
- HTTP support - Range requests for remote file reading
- Serialization - Export to JSON/dict and restore from serialized formats
Future Development
- Performance optimizations
- Additional test coverage for edge cases
- Refactoring and code organization improvements
Not Planned
- Write support (creating Parquet files)
Contributing
This is primarily an educational project. Feel free to:
- Report bugs or parsing issues
- Suggest improvements for educational value
- Add more comprehensive test cases
- Improve documentation and comments
License
Apache License 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file por_que-0.2.2.tar.gz.
File metadata
- Download URL: por_que-0.2.2.tar.gz
- Upload date:
- Size: 574.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7b441581cf6c171cd26678b9739994c0a5a32bafcfd338a4b3956706f1681e5
|
|
| MD5 |
9695096fadd8ed9f235ea79c8d2a76d7
|
|
| BLAKE2b-256 |
183e959a63abae9dd65646ede877c5614320e003b3a0a41463a81010aed4755b
|
Provenance
The following attestation bundles were made for por_que-0.2.2.tar.gz:
Publisher:
release.yml on jkeifer/por-que
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
por_que-0.2.2.tar.gz -
Subject digest:
d7b441581cf6c171cd26678b9739994c0a5a32bafcfd338a4b3956706f1681e5 - Sigstore transparency entry: 651324979
- Sigstore integration time:
-
Permalink:
jkeifer/por-que@c60690a190f8c41ba7db7ffffbab471370c016c3 -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/jkeifer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c60690a190f8c41ba7db7ffffbab471370c016c3 -
Trigger Event:
release
-
Statement type:
File details
Details for the file por_que-0.2.2-py3-none-any.whl.
File metadata
- Download URL: por_que-0.2.2-py3-none-any.whl
- Upload date:
- Size: 77.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5abbaa97073ca90865208a828b53888f9fe8396fa8b2589456eb84b67a54de8
|
|
| MD5 |
ed5732bc6a5e3d1a32e4aaa926915f94
|
|
| BLAKE2b-256 |
ab5ab4bec2b0e330c23d3c269fa9134d9d9f95d9f616ab87fe0b754f859a2a5f
|
Provenance
The following attestation bundles were made for por_que-0.2.2-py3-none-any.whl:
Publisher:
release.yml on jkeifer/por-que
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
por_que-0.2.2-py3-none-any.whl -
Subject digest:
b5abbaa97073ca90865208a828b53888f9fe8396fa8b2589456eb84b67a54de8 - Sigstore transparency entry: 651324980
- Sigstore integration time:
-
Permalink:
jkeifer/por-que@c60690a190f8c41ba7db7ffffbab471370c016c3 -
Branch / Tag:
refs/tags/v0.2.2 - Owner: https://github.com/jkeifer
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c60690a190f8c41ba7db7ffffbab471370c016c3 -
Trigger Event:
release
-
Statement type: