Skip to main content

CLI tool for inspecting parquet files.

Project description

Parquet-Inspector

A command line tool for inspecting parquet files with PyArrow.

Installation

pip install parquet-inspector

Usage

parquet-inspector: cli tool for inspecting parquet files.

positional arguments:
  {metadata,schema,head,tail,count,validate,to-jsonl,to-parquet}
    metadata            print file metadata
    schema              print data schema
    head                print first n rows (default is 10)
    tail                print last n rows (default is 10)
    count               print number of rows
    validate            validate file
    to-jsonl            convert parquet file to jsonl
    to-parquet          convert jsonl file to parquet

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --threads, -t         use threads for reading
  --mmap, -m            use memory mapping for reading

Examples

# Print the metadata of a parquet file
$ pqi metadata my_file.parquet
created_by: parquet-cpp-arrow version 6.0.1
num_columns: 3
num_rows: 2
num_row_groups: 1
format_version: 1.0
serialized_size: 818
# Print the schema of a parquet file
$ pqi schema my_file.parquet
a: list<item: int64>
  child 0, item: int64
b: struct<c: bool, d: timestamp[ms]>
  child 0, c: bool
  child 1, d: timestamp[ms]
# Print the first 5 rows of a parquet file (default is 10)
$ pqi head -n 5 my_file.parquet
{"a": 1, "b": {"c": true, "d": "1991-02-03 00:00:00"}}
{"a": 2, "b": {"c": false, "d": "2019-04-01 00:00:00"}}
{"a": 3, "b": {"c": true, "d": "2019-04-01 00:00:00"}}
{"a": 4, "b": {"c": false, "d": "2019-04-01 00:00:00"}}
{"a": 5, "b": {"c": true, "d": "2019-04-01 00:00:00"}}
# Print the last 5 rows of a parquet file
$ pqi tail -n 5 my_file.parquet
{"a": 3, "b": {"c": true, "d": "2019-04-01 00:00:00"}}
{"a": 4, "b": {"c": false, "d": "2019-04-01 00:00:00"}}
{"a": 5 "b": {"c": true, "d": "2019-04-01 00:00:00"}}
{"a": 6 "b": {"c": false, "d": "2019-04-01 00:00:00"}}
{"a": 7 "b": {"c": true, "d": "2019-04-01 00:00:00"}}
# Print the first 5 rows of a parquet file, only reading the column a
$ pqi head -n 5 -c a my_file.parquet
{'a': 1}
{'a': 2}
{'a': 3}
{'a': 4}
{'a': 5}
# Print the first 3 rows that satisfy the condition a > 3
# (filters are defined in disjunctive normal form)
$ pqi head -n 3 -f "[('a', '>', 3)]" my_file.parquet
{"a": 4, "b": {"c": false, "d": "2019-04-01 00:00:00"}}
{"a": 5 "b": {"c": true, "d": "2019-04-01 00:00:00"}}
{"a": 6 "b": {"c": false, "d": "2019-04-01 00:00:00"}}
# Print the number of rows in a parquet file
$ pqi count my_file.parquet
7
# Validate a parquet file
$ pqi validate my_file.parquet
OK
# Convert a parquet file to jsonl
$ pqi to-jsonl my_file.parquet
$ cat my_file.jsonl
{"a": 1, "b": {"c": true, "d": "1991-02-03 00:00:00"}}
{"a": 2, "b": {"c": false, "d": "2019-04-01 00:00:00"}}
{"a": 3, "b": {"c": true, "d": "2019-04-01 00:00:00"}}
{"a": 4, "b": {"c": false, "d": "2019-04-01 00:00:00"}}
{"a": 5, "b": {"c": true, "d": "2019-04-01 00:00:00"}}
{"a": 6, "b": {"c": false, "d": "2019-04-01 00:00:00"}}
{"a": 7, "b": {"c": true, "d": "2019-04-01 00:00:00"}}
# Convert a jsonl file to parquet
$ pqi to-parquet my_file.jsonl
$ pqi head my_file.parquet
{"a": 1, "b": {"c": true, "d": "1991-02-03 00:00:00"}}
{"a": 2, "b": {"c": false, "d": "2019-04-01 00:00:00"}}
{"a": 3, "b": {"c": true, "d": "2019-04-01 00:00:00"}}
{"a": 4, "b": {"c": false, "d": "2019-04-01 00:00:00"}}
{"a": 5, "b": {"c": true, "d": "2019-04-01 00:00:00"}}
{"a": 6, "b": {"c": false, "d": "2019-04-01 00:00:00"}}
{"a": 7, "b": {"c": true, "d": "2019-04-01 00:00:00"}}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parquet-inspector-0.1.0.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

parquet_inspector-0.1.0-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file parquet-inspector-0.1.0.tar.gz.

File metadata

  • Download URL: parquet-inspector-0.1.0.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for parquet-inspector-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e14f0b534bce5c07a91ea701d0463331fd62b1d50959c0ee3f139fe7cfb54903
MD5 50287dc4a335a1c8af0da5bb1871f9a0
BLAKE2b-256 90c592da202dec80753f33f0e873baa52951b33279512732b1fab50ecdf24bb9

See more details on using hashes here.

File details

Details for the file parquet_inspector-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for parquet_inspector-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 795f4ed5fe05012c8cad05b64419443d1100c6f22b1cda98144584dccdd1e632
MD5 a358cdbd15abdcde8ca2d9692bb30713
BLAKE2b-256 83f7db28a818cec768604b5deb0b6c65e82480f8a813a864150918bf1787993e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page