Inspect the on-disk layout and metadata of Parquet files.

These details have not been verified by PyPI

Project links

Project description

Parquet Analyzer

A Python tool for deep inspection and analysis of Apache Parquet files, providing detailed insights into file structure, metadata, and binary layout.

Features

File Structure Analysis: Parse and visualize the complete binary structure of Parquet files
Metadata Inspection: Extract and display schema, row group, and column metadata
Page-Level Details: Analyze data pages, dictionary pages, and their headers
Offset Tracking: Show exact byte offsets and lengths of all file components
Statistics Summary: Generate comprehensive file statistics and size breakdowns
Thrift Protocol Support: Deep dive into Thrift-encoded metadata structures

Installation

pip install parquet-analyzer

To work from a local clone instead, install in editable mode:

pip install -e .

Requirements

Python 3.8+
thrift>=0.16 (installed automatically)

Usage

Basic Usage

# Analyze a Parquet file and get summary information
parquet-analyzer example.parquet

# Show detailed offset and Thrift structure information
parquet-analyzer -s example.parquet

# Enable debug logging
parquet-analyzer --log-level DEBUG example.parquet

# Run via python -m if the console script is unavailable
python -m parquet_analyzer example.parquet

Command Line Options

parquet_file: Path to the Parquet file to analyze (required)
-s, --show-offsets-and-thrift-details: Show detailed byte offsets and Thrift structure information
--log-level LOG_LEVEL: Set logging level (DEBUG, INFO, WARNING, ERROR)

Output Formats

Standard Output (Default)

The default output provides a structured JSON view with three main sections:

1. Summary Statistics

{
  "summary": {
    "num_rows": 10,
    "num_row_groups": 1,
    "num_columns": 2,
    "num_pages": 2,
    "num_data_pages": 2,
    "num_v1_data_pages": 2,
    "num_v2_data_pages": 0,
    "num_dict_pages": 0,
    "page_header_size": 47,
    "uncompressed_page_data_size": 130,
    "compressed_page_data_size": 96,
    "uncompressed_page_size": 177,
    "compressed_page_size": 143,
    "column_index_size": 48,
    "offset_index_size": 23,
    "bloom_fitler_size": 0,
    "footer_size": 527,
    "file_size": 753
  }
}

2. Footer Metadata

Complete Parquet file metadata including:

Schema definition with column types and repetition levels
Row group information
Column chunk metadata
Encoding and compression details

3. Page Information

Detailed breakdown of all pages organized by column and row group:

Data pages with encoding and statistics
Dictionary pages
Column indexes
Offset indexes
Bloom filters

Detailed Output (`-s` flag)

When using the -s flag, the tool outputs a detailed segment-by-segment breakdown showing:

[
  {
    "offset": 0,
    "length": 4,
    "name": "magic_number",
    "value": "PAR1"
  },
  {
    "offset": 4,
    "length": 24,
    "name": "page",
    "value": [
      {
        "offset": 5,
        "length": 1,
        "name": "type",
        "value": 0,
        "metadata": {
          "type": "i32",
          "enum_type": "PageType",
          "enum_name": "DATA_PAGE"
        }
      }
    ]
  }
]

This mode is useful for:

Debugging Parquet file corruption
Understanding exact binary layout
Analyzing file format compliance
Optimizing file structure

Understanding the Output

File Structure Components

Magic Numbers: PAR1 headers at file start and end
Page Headers: Thrift-encoded metadata for each data/dictionary page
Page Data: Compressed/uncompressed column data
Column Indexes: Statistics for data pages (optional)
Offset Indexes: Byte offsets for data pages (optional)
Bloom Filters: Bloom filter data for columns (optional)
Footer: File metadata including schema and row group information
Footer Length: 4-byte little-endian footer size

Statistics Explained

num_rows: Total number of rows across all row groups
num_row_groups: Number of row groups in the file
num_columns: Number of columns in the schema
num_pages: Total pages (data + dictionary)
num_v1_data_pages: Data pages using format v1
num_v2_data_pages: Data pages using format v2
page_header_size: Total bytes used by page headers
compressed_page_size: Total compressed data size
uncompressed_page_size: Total uncompressed data size

Technical Details

Architecture

The tool uses a custom Thrift protocol implementation (OffsetRecordingProtocol) that wraps the standard Thrift compact protocol to track byte offsets and lengths of all decoded structures. This enables precise mapping of logical Parquet structures to their binary representation.

Key Components

OffsetRecordingProtocol: Tracks byte positions during Thrift deserialization
TFileTransport: File-based transport supporting seeking and offset tracking
Segment Creation: Converts offset information into structured output
Gap Filling: Identifies unknown or unaccounted byte ranges

Supported Parquet Features

All Parquet data types (primitive and logical)
Compression codecs
Encoding types
Page formats (v1 and v2)
Column indexes and offset indexes
Bloom filters
Nested schemas

Use Cases

Performance Analysis

Identify compression efficiency across columns
Analyze page sizes and distribution
Understand storage overhead from metadata

File Debugging

Locate corrupted segments
Verify file format compliance
Analyze encoding choices

Schema Evolution

Compare file structures across versions
Understand metadata changes
Analyze backward compatibility

Storage Optimization

Identify opportunities for better compression
Analyze row group sizing
Optimize column ordering

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

License

Released under the MIT License.

Related Projects

Apache Parquet - The Apache Parquet file format
parquet-python - Python Parquet libraries
parquet-tools - Official Parquet command-line tools

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Nov 5, 2025

0.2.1

Nov 4, 2025

0.2.0

Nov 4, 2025

0.2.0.dev0 pre-release

Nov 4, 2025

This version

0.1.0

Nov 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parquet_analyzer-0.1.0.tar.gz (52.1 kB view details)

Uploaded Nov 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parquet_analyzer-0.1.0-py3-none-any.whl (38.1 kB view details)

Uploaded Nov 1, 2025 Python 3

File details

Details for the file parquet_analyzer-0.1.0.tar.gz.

File metadata

Download URL: parquet_analyzer-0.1.0.tar.gz
Upload date: Nov 1, 2025
Size: 52.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for parquet_analyzer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b3536bb3c0f2d6654100a84f586c4891837d096c07a8278840fc7dd0ca272c91`
MD5	`ae20fbd52ab7b065f7c4671d544576fb`
BLAKE2b-256	`ba0db5744cd0fa108c616b1f76ba02ce976b6c4593f22d22899db3388ae284cd`

See more details on using hashes here.

File details

Details for the file parquet_analyzer-0.1.0-py3-none-any.whl.

File metadata

Download URL: parquet_analyzer-0.1.0-py3-none-any.whl
Upload date: Nov 1, 2025
Size: 38.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for parquet_analyzer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`976e00dd76a2ec5d07fe6718290d370dd28333ac0e22283f4fe8cf242c563694`
MD5	`1857427aa998f7835b9327e5f6cf7ecd`
BLAKE2b-256	`9f2fb79ec381024466e41b821e034031cb74f68152672da27729f6e9f9518bf5`

See more details on using hashes here.

parquet-analyzer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Parquet Analyzer

Features

Installation

Requirements

Usage

Basic Usage

Command Line Options

Output Formats

Standard Output (Default)

1. Summary Statistics

2. Footer Metadata

3. Page Information

Detailed Output (-s flag)

Understanding the Output

File Structure Components

Statistics Explained

Technical Details

Architecture

Key Components

Supported Parquet Features

Use Cases

Performance Analysis

File Debugging

Schema Evolution

Storage Optimization

Contributing

License

Related Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Detailed Output (`-s` flag)