Pure-python parquet parser, for education

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

jkeifer

Project description

Por Qué: Python Parquet Parser

¿Por qué? ¿Por qué no?

Si, ¿pero por qué? ¡Porque, parquet, python!

But seriously, why "Por Qué"?

Because asking "why" leads to understanding! This project exists to answer "why does Parquet work the way it does?" by implementing it from first principles.

[!WARNING] This is a project for education, it is NOT suitable for any production uses.

Overview

Por Qué is a Python Apache Parquet parser built from scratch for educational purposes. It implements Parquet's binary format in highly-readable python to more easily provide insights into how Parquet files work internally.

Features

Complete reader stack - Parse files, row groups, column chunks, and pages
Metadata inspection - Parse and display Parquet file metadata
Schema analysis - View detailed schema structure with logical types
Row group information - Inspect row group statistics and column metadata
Compression analysis - Calculate compression ratios and storage efficiency
HTTP support - Read Parquet files from URLs using range requests
Async parallelism - supports reading from async sources that support parallelism, like the files over HTTP

Installation

With pip:

pip install 'por-que'

Usage

Python API

from por_que import AsyncHttpFile, ParquetFile

# Read from local file
with open("data.parquet", "rb") as f:
    parquet_file = await ParquetFile.from_reader(f, "data.parquet")

    # Access file-level metadata
    print(f"Total rows: {parquet_file.metadata.metadata.row_count}")
    print(f"Columns: {parquet_file.metadata.metadata.column_count}")
    print(f"Row groups: {parquet_file.metadata.metadata.row_group_count}")
    print(f"Parquet version: {parquet_file.metadata.metadata.version}")

    # Access schema information
    schema = parquet_file.metadata.metadata.schema_root
    print(f"Schema: {schema}")

    # Access column chunks and parse data
    for column_chunk in parquet_file.column_chunks:
        print(f"Column: {column_chunk.path_in_schema}")
        print(f"  Compression: {column_chunk.codec}")
        print(f"  Values: {column_chunk.num_values}")

        # Parse all data from the column
        data = column_chunk.parse_all_data_pages(f)
        print(f"  First values: {data[:5]}")

# Read from URL
asnyc with AsyncHttpFile("https://example.com/data.parquet") as f:
    parquet_file = await ParquetFile.from_reader(f, "https://example.com/data.parquet")

    # Access pages within a column chunk
    column_chunk = parquet_file.column_chunks[0]
    for page in column_chunk.data_pages:
        print(f"Page at offset {page.start_offset}")
        print(f"  Type: {page.page_type}")
        print(f"  Values: {page.num_values}")
        print(f"  Encoding: {page.encoding}")

# Serialize to JSON or dict
json_output = parquet_file.to_json(indent=2)
dict_output = parquet_file.to_dict()

# Deserialize from JSON or dict
restored = ParquetFile.from_json(json_output)
restored = ParquetFile.from_dict(dict_output)

What You'll Learn

By exploring this codebase, you can learn about:

Parquet file format - Binary structure, magic bytes, footer layout
Thrift protocol - Binary serialization format used by Parquet
Schema representation - How nested and complex data types are encoded
Compression techniques - Various compression algorithms and their efficiency
Column storage - Columnar storage benefits and trade-offs
Metadata organization - How Parquet organizes file and column statistics
Lazy loading patterns - Efficient data access without loading entire files
Binary parsing - Low-level byte manipulation and struct unpacking

Educational Focus

This implementation prioritizes readability and understanding over performance:

Explicit parsing logic instead of generated Thrift code
Comprehensive comments explaining binary format details
Step-by-step Thrift deserialization
Clear separation of concerns between parsing and data structures
Educational debug logging (enable with logging.basicConfig(level=logging.DEBUG))
Structured architecture mirroring Parquet's physical layout

Architecture

src/por_que/
├── parsers/                # Low-level binary parsers
│   ├── parquet/            # Parquet format parsers
│   │   ├── metadata.py     # File metadata parser
│   │   ├── page.py         # Page header parser
│   │   ├── page_index.py   # Page index structures
│   │   ├── schema.py       # Schema tree parser
│   │   ├── statistics.py   # Statistics parser
│   │   ├── row_group.py    # Row group metadata
│   │   ├── column.py       # Column chunk metadata
│   │   └── ...             # Other metadata parsers
│   ├── page_content/       # Page data decoding
│   │   ├── data.py         # Data page decoder
│   │   ├── dictionary.py   # Dictionary page decoder
│   │   └── compressors.py  # Compression codecs
│   ├── thrift/             # Thrift protocol implementation
│   │   ├── parser.py       # Core Thrift parser
│   │   └── enums.py        # Thrift type definitions
│   ├── logical_types.py    # Logical type converters
│   └── physical_types.py   # Physical type parsers
├── physical.py             # Main ParquetFile class
├── file_metadata.py        # Metadata data structures
├── pages.py                # Page data structures
├── protocols.py            # Type protocols
├── enums.py                # Parquet format enums
├── constants.py            # Format constants
└── exceptions.py           # Exception classes

Current Capabilities

Implemented Features

Complete metadata parsing - All Parquet metadata structures
Schema parsing - Full schema tree with logical types
Page parsing - All page types (DATA_PAGE, DATA_PAGE_V2, DICTIONARY_PAGE, INDEX_PAGE)
Data decoding - Convert raw page data to Python values
Compression support - Snappy, GZIP, Brotli, LZ4, LZO, Zstd decompression
Encoding support - PLAIN, DICTIONARY, RLE, DELTA (all variants), BYTE_STREAM_SPLIT
Nested data - Definition and repetition level handling
Statistics parsing - Min/max values, null counts, and distinct counts
Page indexes - Column and offset index structures
HTTP support - Range requests for remote file reading
Serialization - Export to JSON/dict and restore from serialized formats

Future Development

Performance optimizations
Additional test coverage for edge cases
Refactoring and code organization improvements

Not Planned

Write support (creating Parquet files)

Contributing

This is primarily an educational project. Feel free to:

Report bugs or parsing issues
Suggest improvements for educational value
Add more comprehensive test cases
Improve documentation and comments

License

Apache License 2.0

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

jkeifer

Release history Release notifications | RSS feed

0.2.4

Nov 17, 2025

0.2.3

Oct 29, 2025

0.2.2

Oct 29, 2025

0.2.1

Oct 21, 2025

This version

0.2.0

Oct 21, 2025

0.1.0

Oct 14, 2025

0.0.4

Sep 4, 2025

0.0.3

Sep 3, 2025

0.0.2

Aug 28, 2025

0.0.1

Aug 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

por_que-0.2.0.tar.gz (569.7 kB view details)

Uploaded Oct 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

por_que-0.2.0-py3-none-any.whl (77.7 kB view details)

Uploaded Oct 21, 2025 Python 3

File details

Details for the file por_que-0.2.0.tar.gz.

File metadata

Download URL: por_que-0.2.0.tar.gz
Upload date: Oct 21, 2025
Size: 569.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for por_que-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`24cc82071ae0ec08b3071c0af4bcbbe76ddd629cfad75572fed2f32f8b866ec4`
MD5	`7fa7a5dcd7c07bc89cffbf45ae5bc4ad`
BLAKE2b-256	`05a62756c3055d183f80ea79fa2f44f9bac511050a6b88c548cfdc125d5466ef`

See more details on using hashes here.

Provenance

The following attestation bundles were made for por_que-0.2.0.tar.gz:

Publisher: release.yml on jkeifer/por-que

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: por_que-0.2.0.tar.gz
- Subject digest: 24cc82071ae0ec08b3071c0af4bcbbe76ddd629cfad75572fed2f32f8b866ec4
- Sigstore transparency entry: 626175832
- Sigstore integration time: Oct 21, 2025
Source repository:
- Permalink: jkeifer/por-que@81194a79ecaf0725595fc7d193976c436556b66e
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/jkeifer
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@81194a79ecaf0725595fc7d193976c436556b66e
- Trigger Event: release

File details

Details for the file por_que-0.2.0-py3-none-any.whl.

File metadata

Download URL: por_que-0.2.0-py3-none-any.whl
Upload date: Oct 21, 2025
Size: 77.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for por_que-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8fed7de1a094c298cb0b99f521d401e20607a51b92c337dd496319eae732bf65`
MD5	`866f82ffc48d8a8c25c635891ae8768d`
BLAKE2b-256	`13bfda4570ce34350a84adbd9bea79efe99519db8a56837f755e4f0be56c34c2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for por_que-0.2.0-py3-none-any.whl:

Publisher: release.yml on jkeifer/por-que

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: por_que-0.2.0-py3-none-any.whl
- Subject digest: 8fed7de1a094c298cb0b99f521d401e20607a51b92c337dd496319eae732bf65
- Sigstore transparency entry: 626175839
- Sigstore integration time: Oct 21, 2025
Source repository:
- Permalink: jkeifer/por-que@81194a79ecaf0725595fc7d193976c436556b66e
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/jkeifer
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@81194a79ecaf0725595fc7d193976c436556b66e
- Trigger Event: release

por-que 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

Por Qué: Python Parquet Parser

But seriously, why "Por Qué"?

Overview

Features

Installation

Usage

Python API

What You'll Learn

Educational Focus

Architecture

Current Capabilities

Implemented Features

Future Development

Not Planned

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance