Skip to main content

Display version compression and bloom filter information about a parquet file

Project description

iparq

Python package

Dependabot Updates

Upload Python Package

alt text After reading this blog, I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there’s no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.

New Bloom filters information: Displays if there are bloom filters. Read more about bloom filters in this great article.

Installation

Zero installation - Recommended

  1. Make sure to have Astral’s UV installed by following the steps here:

    https://docs.astral.sh/uv/getting-started/installation/

  2. Execute the following command:

    uvx iparq yourparquet.parquet
    

Using pip

  1. Install the package using pip:

    pip install iparq
    
  2. Verify the installation by running:

    iparq --help
    

Using uv

  1. Make sure to have Astral’s UV installed by following the steps here:

    https://docs.astral.sh/uv/getting-started/installation/

  2. Execute the following command:

    uv pip install iparq
    
  3. Verify the installation by running:

    iparq --help
    

Using Homebrew in a MAC

  1. Run the following:

    brew tap MiguelElGallo/tap https://github.com/MiguelElGallo//homebrew-iparq.git
    brew install MiguelElGallo/tap/iparq
    iparq —help
    

Usage

Run

iparq <filename>

Replace <filename> with the path to your .parquet file. The utility will read the metadata of the file and print the compression codecs used in the parquet file.

Example ouput - Bloom Filters

ParquetMetaModel(
    created_by='DuckDB version v1.2.1 (build 8e52ec4395)',
    num_columns=1,
    num_rows=100000000,
    num_row_groups=10,
    format_version='1.0',
    serialized_size=1196
)
Column Compression Info:
Row Group 0:
  Column 'r' (Index 0): SNAPPY
Row Group 1:
  Column 'r' (Index 0): SNAPPY
Row Group 2:
  Column 'r' (Index 0): SNAPPY
Row Group 3:
  Column 'r' (Index 0): SNAPPY
Row Group 4:
  Column 'r' (Index 0): SNAPPY
Row Group 5:
  Column 'r' (Index 0): SNAPPY
Row Group 6:
  Column 'r' (Index 0): SNAPPY
Row Group 7:
  Column 'r' (Index 0): SNAPPY
Row Group 8:
  Column 'r' (Index 0): SNAPPY
Row Group 9:
  Column 'r' (Index 0): SNAPPY
Bloom Filter Info:
Row Group 0:
  Column 'r' (Index 0): Has bloom filter
Row Group 1:
  Column 'r' (Index 0): Has bloom filter
Row Group 2:
  Column 'r' (Index 0): Has bloom filter
Row Group 3:
  Column 'r' (Index 0): Has bloom filter
Row Group 4:
  Column 'r' (Index 0): Has bloom filter
Row Group 5:
  Column 'r' (Index 0): Has bloom filter
Row Group 6:
  Column 'r' (Index 0): Has bloom filter
Row Group 7:
  Column 'r' (Index 0): Has bloom filter
Row Group 8:
  Column 'r' (Index 0): Has bloom filter
Row Group 9:
  Column 'r' (Index 0): Has bloom filter
Compression codecs: {'SNAPPY'}

Example output

ParquetMetaModel(
    created_by='parquet-cpp-arrow version 14.0.2',
    num_columns=19,
    num_rows=2964624,
    num_row_groups=3,
    format_version='2.6',
    serialized_size=6357
)
Column Compression Info:
Row Group 0:
  Column 'VendorID' (Index 0): ZSTD
  Column 'tpep_pickup_datetime' (Index 1): ZSTD
  Column 'tpep_dropoff_datetime' (Index 2): ZSTD
  Column 'passenger_count' (Index 3): ZSTD
  Column 'trip_distance' (Index 4): ZSTD
  Column 'RatecodeID' (Index 5): ZSTD
  Column 'store_and_fwd_flag' (Index 6): ZSTD
  Column 'PULocationID' (Index 7): ZSTD
  Column 'DOLocationID' (Index 8): ZSTD
  Column 'payment_type' (Index 9): ZSTD
  Column 'fare_amount' (Index 10): ZSTD
  Column 'extra' (Index 11): ZSTD
  Column 'mta_tax' (Index 12): ZSTD
  Column 'tip_amount' (Index 13): ZSTD
  Column 'tolls_amount' (Index 14): ZSTD
  Column 'improvement_surcharge' (Index 15): ZSTD
  Column 'total_amount' (Index 16): ZSTD
  Column 'congestion_surcharge' (Index 17): ZSTD
  Column 'Airport_fee' (Index 18): ZSTD
Row Group 1:
  Column 'VendorID' (Index 0): ZSTD
  Column 'tpep_pickup_datetime' (Index 1): ZSTD
  Column 'tpep_dropoff_datetime' (Index 2): ZSTD
  Column 'passenger_count' (Index 3): ZSTD
  Column 'trip_distance' (Index 4): ZSTD
  Column 'RatecodeID' (Index 5): ZSTD
  Column 'store_and_fwd_flag' (Index 6): ZSTD
  Column 'PULocationID' (Index 7): ZSTD
  Column 'DOLocationID' (Index 8): ZSTD
  Column 'payment_type' (Index 9): ZSTD
  Column 'fare_amount' (Index 10): ZSTD
  Column 'extra' (Index 11): ZSTD
  Column 'mta_tax' (Index 12): ZSTD
  Column 'tip_amount' (Index 13): ZSTD
  Column 'tolls_amount' (Index 14): ZSTD
  Column 'improvement_surcharge' (Index 15): ZSTD
  Column 'total_amount' (Index 16): ZSTD
  Column 'congestion_surcharge' (Index 17): ZSTD
  Column 'Airport_fee' (Index 18): ZSTD
Row Group 2:
  Column 'VendorID' (Index 0): ZSTD
  Column 'tpep_pickup_datetime' (Index 1): ZSTD
  Column 'tpep_dropoff_datetime' (Index 2): ZSTD
  Column 'passenger_count' (Index 3): ZSTD
  Column 'trip_distance' (Index 4): ZSTD
  Column 'RatecodeID' (Index 5): ZSTD
  Column 'store_and_fwd_flag' (Index 6): ZSTD
  Column 'PULocationID' (Index 7): ZSTD
  Column 'DOLocationID' (Index 8): ZSTD
  Column 'payment_type' (Index 9): ZSTD
  Column 'fare_amount' (Index 10): ZSTD
  Column 'extra' (Index 11): ZSTD
  Column 'mta_tax' (Index 12): ZSTD
  Column 'tip_amount' (Index 13): ZSTD
  Column 'tolls_amount' (Index 14): ZSTD
  Column 'improvement_surcharge' (Index 15): ZSTD
  Column 'total_amount' (Index 16): ZSTD
  Column 'congestion_surcharge' (Index 17): ZSTD
  Column 'Airport_fee' (Index 18): ZSTD
Compression codecs: {'ZSTD'}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iparq-0.2.0.tar.gz (4.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

iparq-0.2.0-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file iparq-0.2.0.tar.gz.

File metadata

  • Download URL: iparq-0.2.0.tar.gz
  • Upload date:
  • Size: 4.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for iparq-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e49f32acc0e8fb924031da2c3d90f28aca84fa560a2f43f6e717ca98d1c33ea6
MD5 455718c91c8899809fba4a42d07fc535
BLAKE2b-256 bd30ad7795d8157ca0bd286c777eda3cb37dc2bc7bb224c8d85ac536b6cc0a96

See more details on using hashes here.

Provenance

The following attestation bundles were made for iparq-0.2.0.tar.gz:

Publisher: python-publish.yml on MiguelElGallo/iparq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file iparq-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: iparq-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for iparq-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2de0ea95ca5146e2311f706e772429e2c6643a0bbd69cd8d94807f1e0e77163b
MD5 fd8c23328d5abba6268517e8080b461c
BLAKE2b-256 ec6b4f1788d4c9b88e3d8ccd1241a248342eb9f628b60152694c3f668aa709dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for iparq-0.2.0-py3-none-any.whl:

Publisher: python-publish.yml on MiguelElGallo/iparq

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page