Skip to main content

Inspect the on-disk layout and metadata of Parquet files.

Project description

Parquet Analyzer

A Python tool for deep inspection and analysis of Apache Parquet files, providing detailed insights into file structure, metadata, and binary layout.

Installation

pip install parquet-analyzer

To work from a local clone instead, install in editable mode:

pip install -e .

Requirements

  • Python 3.11+
  • thrift>=0.16 (installed automatically)

Usage

Basic usage

# Analyze a Parquet file and get summary information
parquet-analyzer example.parquet

# Divide the file into segments and show detailed offset and Thrift structure information for each segment
parquet-analyzer -s example.parquet

# Enable debug logging
parquet-analyzer --log-level DEBUG example.parquet

# Run via python -m if the console script is unavailable
python -m parquet_analyzer example.parquet

Output Formats

Standard output (default)

The default output provides a structured JSON view with three main sections:

Summary statistics

{
  "summary": {
    "num_rows": 10,
    "num_row_groups": 1,
    "num_columns": 2,
    "num_pages": 2,
    "num_data_pages": 2,
    "num_v1_data_pages": 2,
    "num_v2_data_pages": 0,
    "num_dict_pages": 0,
    "page_header_size": 47,
    "uncompressed_page_data_size": 130,
    "compressed_page_data_size": 96,
    "uncompressed_page_size": 177,
    "compressed_page_size": 143,
    "column_index_size": 48,
    "offset_index_size": 23,
    "bloom_fitler_size": 0,
    "footer_size": 527,
    "file_size": 753
  }
}

Footer metadata

Complete Parquet file metadata including:

  • Schema definition with column types and repetition levels
  • Row group information
  • Column chunk metadata
  • Encoding and compression details

Page information

Detailed breakdown of all pages organized by column:

  • Data pages with encoding and statistics
  • Dictionary pages
  • Column indexes
  • Offset indexes
  • Bloom filters

Detailed output (-s flag)

When using the -s flag, the tool outputs a detailed segment-by-segment breakdown showing:

[
  {
    "offset": 0,
    "length": 4,
    "name": "magic_number",
    "value": "PAR1"
  },
  {
    "offset": 4,
    "length": 24,
    "name": "page",
    "value": [
      {
        "offset": 5,
        "length": 1,
        "name": "type",
        "value": 0,
        "metadata": {
          "type": "i32",
          "enum_type": "PageType",
          "enum_name": "DATA_PAGE"
        }
      }
    ]
  }
]

This mode is useful for:

  • Debugging Parquet file corruption
  • Understanding exact binary layout
  • Analyzing file format compliance
  • Optimizing file structure

Technical details

The tool uses a custom Thrift protocol implementation (OffsetRecordingProtocol) that wraps the standard Thrift compact protocol to track byte offsets and lengths of all decoded structures. This enables precise mapping of logical Parquet structures to their binary representation.

Development

Environment setup

pip install -e .[dev]
hatch run dev:lint
hatch run dev:test
hatch run dev:test-cov
# Or run everything at once
hatch run dev:check

The development extra pulls in tooling (hatch, ruff, pytest) and pyarrow so tests can generate Parquet fixtures on the fly.

Regenerating Thrift bindings

The Python modules in src/parquet are generated from parquet.thrift.

  1. Install the Apache Thrift compiler (brew install thrift on macOS, or download a release from the Apache Thrift project).

  2. From the repository root, regenerate everything in one step:

    hatch run dev:update-thrift
    

    This refreshes parquet.thrift, runs the compiler, and removes any stray src/__init__.py the compiler may create.

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

License

Released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parquet_analyzer-0.2.0.dev0.tar.gz (52.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parquet_analyzer-0.2.0.dev0-py3-none-any.whl (37.4 kB view details)

Uploaded Python 3

File details

Details for the file parquet_analyzer-0.2.0.dev0.tar.gz.

File metadata

  • Download URL: parquet_analyzer-0.2.0.dev0.tar.gz
  • Upload date:
  • Size: 52.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for parquet_analyzer-0.2.0.dev0.tar.gz
Algorithm Hash digest
SHA256 18a447cb290ee29063265cf29fa64f15d988d1e1fb5aaf129c11f7633bf3aca4
MD5 3bc48ea3be4a6382532ae98fb73ac478
BLAKE2b-256 9a63763538c4200c7b92d9f649f6b831e454a484397c27e946941c40f69b36c2

See more details on using hashes here.

File details

Details for the file parquet_analyzer-0.2.0.dev0-py3-none-any.whl.

File metadata

File hashes

Hashes for parquet_analyzer-0.2.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 07508a5fa7c0ec96dac9e52eee2cd30809c0410300afae7063a044dbdf7f66a4
MD5 449594b4376902d8f03db94b4add77e9
BLAKE2b-256 54707ece607b06b833c4d4f37d63c4262e9619fd02e889207d3da667f6d776f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page