Skip to main content

Inspect the on-disk layout and metadata of Parquet files.

Project description

Parquet Analyzer

A Python tool for deep inspection and analysis of Apache Parquet files, providing detailed insights into file structure, metadata, and binary layout.

For an example interactive HTML report generated by this tool, see: https://clee704.github.io/parquet-analyzer/examples/example.html.

Installation

pip install parquet-analyzer

Requirements

  • Python 3.11+

Usage

Basic usage

# Analyze a Parquet file and emit the JSON summary/footer/pages bundle
parquet-analyzer example.parquet

# Show raw segment structures (offsets, lengths, thrift payloads)
parquet-analyzer --output-mode segments example.parquet

# Generate an interactive HTML report and save it to disk
parquet-analyzer --output-mode html -o report.html example.parquet

# Generate an HTML report with selected sections only
parquet-analyzer --output-mode html \
  --html-sections summary schema key-value-metadata row-groups columns segments \
  -o report.html example.parquet

# Enable debug logging while running any mode
parquet-analyzer --log-level DEBUG example.parquet

# Run via python -m if the console script is unavailable
python -m parquet_analyzer example.parquet

Output Formats

Standard output (--output-mode default)

The default output provides a structured JSON payload with three main sections:

Summary statistics

{
  "summary": {
    "num_rows": 10,
    "num_row_groups": 1,
    "num_columns": 2,
    "num_pages": 2,
    "num_data_pages": 2,
    "num_v1_data_pages": 2,
    "num_v2_data_pages": 0,
    "num_dict_pages": 0,
    "page_header_size": 47,
    "uncompressed_page_data_size": 130,
    "compressed_page_data_size": 96,
    "uncompressed_page_size": 177,
    "compressed_page_size": 143,
    "column_index_size": 48,
    "offset_index_size": 23,
    "bloom_filter_size": 0,
    "footer_size": 527,
    "file_size": 753
  }
}

Footer metadata

Complete Parquet file metadata including:

  • Schema definition with column types and repetition levels
  • Row group information
  • Column chunk metadata
  • Encoding and compression details

Page information

Detailed breakdown of all pages organized by column:

  • Data pages with encoding and statistics
  • Dictionary pages
  • Column indexes
  • Offset indexes
  • Bloom filters

Segments (--output-mode segments)

When using --output-mode segments, the tool outputs a detailed segment-by-segment breakdown showing:

[
  {
    "offset": 0,
    "length": 4,
    "name": "magic_number",
    "value": "PAR1"
  },
  {
    "offset": 4,
    "length": 24,
    "name": "page",
    "value": [
      {
        "offset": 5,
        "length": 1,
        "name": "type",
        "value": 0,
        "metadata": {
          "type": "i32",
          "enum_type": "PageType",
          "enum_name": "DATA_PAGE"
        }
      }
    ]
  }
]

This mode is useful for:

  • Understanding exact binary layout
  • Analyzing file format compliance
  • Optimizing file structure

HTML report (--output-mode html)

Emits a standalone HTML document with collapsible sections for summary statistics, schema, key-value metadata, row groups, aggregated column statistics, segments, and the raw footer. Use the --html-sections flag to control which sections are rendered:

parquet-analyzer --output-mode html \
  --html-sections summary schema key-value-metadata row-groups columns segments \
  -o report.html \
  example.parquet

Example: https://clee704.github.io/parquet-analyzer/examples/example.html

Technical details

The tool uses a custom Thrift protocol implementation (OffsetRecordingProtocol) that wraps the standard Thrift compact protocol to track byte offsets and lengths of all decoded structures. This enables precise mapping of logical Parquet structures to their binary representation.

Development

Environment setup

pip install -e .[dev]
hatch run dev:check  # will format, lint, type-check, test with coverage

The development extra pulls in tooling (hatch, ruff, pytest) and pyarrow so tests can generate Parquet fixtures on the fly.

Regenerating Thrift bindings

The Python modules in src/parquet are generated from parquet.thrift.

  1. Install the Apache Thrift compiler (brew install thrift on macOS, or download a release from the Apache Thrift project).

  2. From the repository root, regenerate everything in one step:

    hatch run dev:update-thrift
    

    This refreshes parquet.thrift, runs the compiler, and removes any stray src/__init__.py the compiler may create.

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

License

This project is licensed under the Apache License 2.0.

© 2025 Chungmin Lee

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parquet_analyzer-0.3.0.tar.gz (68.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parquet_analyzer-0.3.0-py3-none-any.whl (60.8 kB view details)

Uploaded Python 3

File details

Details for the file parquet_analyzer-0.3.0.tar.gz.

File metadata

  • Download URL: parquet_analyzer-0.3.0.tar.gz
  • Upload date:
  • Size: 68.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for parquet_analyzer-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ed1a4b7e6d9039a3cc8645e8dd23c1ae01b3e286cd51498bc0ae07542962b118
MD5 961f4a03a80b89dda6333abd1b88d6a7
BLAKE2b-256 2706130333434b918aa9d94470997b66143079727d852f828e73d40b025a5713

See more details on using hashes here.

File details

Details for the file parquet_analyzer-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for parquet_analyzer-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d423c781c8fe77c5d423bc9481232d94bb92b485d379f583ed7beb0959c81f67
MD5 79ec8e62f160cb192cf3695daa7a7966
BLAKE2b-256 0107d2ed89a692cb3ffe22cc6d6a815ac380e7adbd9a5cc61a835a436327cda0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page