Inspect the on-disk layout and metadata of Parquet files.
Project description
Parquet Analyzer
A Python tool for deep inspection and analysis of Apache Parquet files, providing detailed insights into file structure, metadata, and binary layout.
Installation
pip install parquet-analyzer
To work from a local clone instead, install in editable mode:
pip install -e .
Requirements
- Python 3.11+
- thrift>=0.16 (installed automatically)
Usage
Basic usage
# Analyze a Parquet file and emit the JSON summary/footer/pages bundle
parquet-analyzer example.parquet
# Show raw segment structures (offsets, lengths, thrift payloads)
parquet-analyzer --output-mode segments example.parquet
# Generate an interactive HTML report and save it to disk
parquet-analyzer --output-mode html -o report.html example.parquet
# Enable debug logging while running any mode
parquet-analyzer --log-level DEBUG example.parquet
# Run via python -m if the console script is unavailable
python -m parquet_analyzer example.parquet
Output Formats
Standard output (--output-mode default)
The default output provides a structured JSON payload with three main sections:
Summary statistics
{
"summary": {
"num_rows": 10,
"num_row_groups": 1,
"num_columns": 2,
"num_pages": 2,
"num_data_pages": 2,
"num_v1_data_pages": 2,
"num_v2_data_pages": 0,
"num_dict_pages": 0,
"page_header_size": 47,
"uncompressed_page_data_size": 130,
"compressed_page_data_size": 96,
"uncompressed_page_size": 177,
"compressed_page_size": 143,
"column_index_size": 48,
"offset_index_size": 23,
"bloom_filter_size": 0,
"footer_size": 527,
"file_size": 753
}
}
Footer metadata
Complete Parquet file metadata including:
- Schema definition with column types and repetition levels
- Row group information
- Column chunk metadata
- Encoding and compression details
Page information
Detailed breakdown of all pages organized by column:
- Data pages with encoding and statistics
- Dictionary pages
- Column indexes
- Offset indexes
- Bloom filters
Detailed segments (--output-mode segments)
When using --output-mode segments, the tool outputs a detailed segment-by-segment breakdown showing:
[
{
"offset": 0,
"length": 4,
"name": "magic_number",
"value": "PAR1"
},
{
"offset": 4,
"length": 24,
"name": "page",
"value": [
{
"offset": 5,
"length": 1,
"name": "type",
"value": 0,
"metadata": {
"type": "i32",
"enum_type": "PageType",
"enum_name": "DATA_PAGE"
}
}
]
}
]
This mode is useful for:
- Debugging Parquet file corruption
- Understanding exact binary layout
- Analyzing file format compliance
- Optimizing file structure
HTML report (--output-mode html)
Emits a standalone HTML document with collapsible sections for summary statistics, schema, key-value metadata, row groups, aggregated column statistics, segments, and (optionally) the raw footer. Use the --html-sections flag to control which sections are rendered:
parquet-analyzer --output-mode html \
--html-sections summary schema columns \
-o report.html \
example.parquet
Technical details
The tool uses a custom Thrift protocol implementation (OffsetRecordingProtocol) that wraps the standard Thrift compact protocol to track byte offsets and lengths of all decoded structures. This enables precise mapping of logical Parquet structures to their binary representation.
Development
Environment setup
pip install -e .[dev]
hatch run dev:lint
hatch run dev:test
hatch run dev:test-cov
# Or run everything at once
hatch run dev:check
The development extra pulls in tooling (hatch, ruff, pytest) and pyarrow so tests can generate Parquet fixtures on the fly.
Regenerating Thrift bindings
The Python modules in src/parquet are generated from parquet.thrift.
-
Install the Apache Thrift compiler (
brew install thrifton macOS, or download a release from the Apache Thrift project). -
From the repository root, regenerate everything in one step:
hatch run dev:update-thrift
This refreshes
parquet.thrift, runs the compiler, and removes any straysrc/__init__.pythe compiler may create.
Contributing
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
License
Released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parquet_analyzer-0.2.0.tar.gz.
File metadata
- Download URL: parquet_analyzer-0.2.0.tar.gz
- Upload date:
- Size: 63.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11528ce43cb01bf4ba3cca065dd9d531e79bce9bb0343acd04603dec403a39a6
|
|
| MD5 |
005f988d46da8eb935e88d7e284182db
|
|
| BLAKE2b-256 |
6ddb4a45d00e2f2ebb931ea7d85d9ff67baa5d4cdc5577238bae452481edcad5
|
File details
Details for the file parquet_analyzer-0.2.0-py3-none-any.whl.
File metadata
- Download URL: parquet_analyzer-0.2.0-py3-none-any.whl
- Upload date:
- Size: 53.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
283a0400b2d6825e11f5ca8e1fcae164ec645dc79b782c508eb101402d69c873
|
|
| MD5 |
d9400adb67fe652dc55f3f1415f541ec
|
|
| BLAKE2b-256 |
c9e484db81f1a8d0507362f0ef69263beada81743a1afc0735fae07c3b4825f0
|