Skip to main content

A powerful command-line tool for inspecting tabular files like Parquet, CSV, and XLSX

Project description

parq-cli

Python Version License

A command-line tool for inspecting, transforming, and comparing tabular files.

Chinese README

Overview

parq focuses on the workflows that come up most often when working with .parquet, .csv, .tsv, and .xlsx files:

  • inspect metadata and schema
  • preview the first or last rows
  • count rows
  • split large files
  • compute lightweight column stats (with cardinality and top-values for string columns)
  • convert between supported formats
  • diff two datasets by key
  • merge compatible files

The CLI keeps startup light with lazy imports, preserves plain and json output modes for automation, and avoids unnecessary full-table materialization for large CSV/XLSX workflows where possible.

Installation

pip install parq-cli

Enable .xlsx support with the optional dependency:

pip install "parq-cli[xlsx]"

Quick Start

# Inspect metadata
parq meta data.parquet
parq meta --fast data.csv

# Show schema
parq schema data.xlsx

# Preview rows
parq head data.parquet
parq head -n 10 --columns id,name data.csv
parq tail -n 20 data.csv

# Count rows
parq count data.parquet

# Split files
parq split data.csv --record-count 100000 -n "chunks/part-%03d.csv"
parq split data.parquet --file-count 4 -n "chunks/part-%02d.parquet"
parq split data.csv --record-count 100000 -n "out/part-%03d.csv" --force   # overwrite existing

# Column statistics (string columns include cardinality and top values)
parq stats sales.parquet --columns amount,category --limit 10
parq stats sales.parquet --columns category --top-n 10    # show top 10 most frequent values

# Format conversion (with live progress bar)
parq convert raw.xlsx cleaned.parquet
parq convert source.parquet export.csv --columns id,name,status
parq convert source.parquet export.csv --force             # overwrite if exists

# Read TSV files or use a custom delimiter
parq head data.tsv
parq head --delimiter ";" data.csv

# Read a specific XLSX sheet
parq head --sheet Sheet2 report.xlsx
parq head --sheet 1 report.xlsx                            # 0-based index

# Dataset diff
parq diff old.parquet new.parquet --key id --columns status,amount
parq diff left.csv right.csv --key id --summary-only

# Merge compatible inputs (with live progress bar)
parq merge part-001.parquet part-002.parquet merged.parquet
parq merge chunks/*.parquet merged.parquet --force         # overwrite if exists

Supported Formats

Command Parquet CSV TSV XLSX
meta yes yes yes yes
schema yes yes yes yes
head / tail yes yes yes yes
count yes yes yes yes
split yes yes yes yes
stats yes yes yes yes
convert yes yes yes yes
diff yes yes yes no, convert first
merge yes yes yes yes

XLSX support requires openpyxl. TSV files are auto-detected by the .tsv extension; a custom delimiter can be supplied with --delimiter.

Command Reference

meta

parq meta FILE
parq meta --fast FILE

Shows file-level metadata such as path, format, column count, file size, row-group count, and when available, row count and Parquet-specific metadata.

Use --fast when you want a cheap metadata pass on CSV/XLSX files. In fast mode, expensive fields such as full row counts are skipped.

schema

parq schema FILE

Shows column names, types, and nullable information.

head and tail

parq head FILE
parq head -n 20 FILE
parq head -n 20 --columns id,name FILE

parq tail FILE
parq tail -n 20 FILE
parq tail -n 20 --columns id,name FILE

Notes:

  • default preview size is 5
  • --columns accepts a comma-separated list
  • missing files return a friendly error with exit code 1
  • empty header-only CSV/XLSX files return an empty preview with detected columns
  • an empty csv with no header raises a friendly Empty CSV file error

count

parq count FILE

Returns the total row count.

split

parq split FILE --file-count N
parq split FILE --record-count N
parq split FILE --record-count 100000 -n "chunks/part-%03d.parquet"
parq split FILE --record-count 100000 -n "chunks/part-%03d.csv" --force

Splits one input file into multiple output files.

Rules:

  • specify exactly one of --file-count or --record-count
  • output format is inferred from --name-format
  • by default, existing target files raise an error; use --force / -F to overwrite
  • in --record-count mode, CSV/XLSX now stream in a single pass instead of pre-counting the entire file
  • a live progress bar is shown during the split

stats

parq stats FILE
parq stats FILE --columns amount,category
parq stats FILE --limit 20
parq stats FILE --columns category --top-n 10

Computes simple per-column statistics.

  • numeric columns include count, null_count, min, max, mean
  • string, boolean, and date columns additionally include cardinality and top_values (top N most frequent values with their occurrence counts)
  • default --top-n is 5; set to 0 to suppress top-values output entirely
  • default --limit is 50 to avoid flooding the terminal on very wide tables

convert

parq convert SOURCE OUTPUT
parq convert SOURCE OUTPUT --columns id,name,status
parq convert SOURCE OUTPUT --force

Converts a supported input file to another supported output format. The output format is determined by the OUTPUT suffix.

Notes:

  • current targets are .parquet, .csv, .tsv, and .xlsx
  • conversion is streaming-based where possible
  • a live progress bar is shown during the conversion
  • by default, existing output files raise an error; use --force / -F to overwrite

diff

parq diff LEFT RIGHT --key id
parq diff LEFT RIGHT --key id1,id2 --columns status,amount
parq diff LEFT RIGHT --key id --summary-only

Compares two datasets by key and reports:

  • row count delta
  • rows only present on the left
  • rows only present on the right
  • changed rows for the selected columns
  • schema-only columns and same-name type mismatches

Notes:

  • --key is required
  • diff currently supports Parquet and CSV inputs
  • XLSX files should be converted first
  • duplicate keys on either side are treated as an error
  • --summary-only keeps the counts and omits sample payloads

merge

parq merge INPUT1 INPUT2 OUTPUT
parq merge chunks/*.parquet merged.parquet
parq merge chunks/*.parquet merged.parquet --force

Merges multiple compatible input files into a single output file. The last positional argument is the output path.

Notes:

  • schemas must be identical or safely unifiable by Arrow
  • by default, existing output files raise an error; use --force / -F to overwrite
  • output format is inferred from the output suffix
  • a live progress bar is shown during the merge

Output Modes

Global options:

  • --version, -v: show version information
  • --output, -o: select output format (rich | plain | json)
  • --delimiter, -d: field delimiter for CSV/TSV input (default: ,); .tsv files default to \t automatically
  • --sheet: XLSX sheet name or 0-based index to read (default: active sheet)
  • --help: show command help

Available output modes:

  • rich: human-friendly terminal rendering
  • plain: low-overhead tabular output for shell pipelines
  • json: machine-readable structured output

Examples:

parq meta data.parquet --output json
parq --output plain stats data.csv
parq --delimiter ";" head semicolon_data.csv
parq --sheet "Sales" head report.xlsx
parq diff left.parquet right.parquet --key id --summary-only --output json

On Windows terminals that cannot safely render emoji or extended characters, Rich headings automatically fall back to a safe plain style instead of crashing.

Large File Notes

  • Parquet metadata, row counts, and previews use Arrow metadata and row-group shortcuts where available.
  • CSV tail uses a fixed-size column window instead of materializing every row as Python dicts.
  • CSV/XLSX split --record-count streams in one pass.
  • meta --fast is the best option when you need quick metadata from large CSV/XLSX inputs.
  • XLSX schema inference samples the first 1000 rows instead of scanning the entire sheet up front.

For repeated heavy workflows, converting large CSV/XLSX files to Parquet is still the best path for throughput.

Development

Install development dependencies:

uv sync --extra dev

or:

pip install -e ".[dev]"

Useful commands:

python -m parq --help
pytest -m "not performance"
pytest tests/test_performance.py -m performance -q -s
ruff check parq tests
ruff check --fix parq tests
pytest --cov=parq --cov-report=html

Status

Implemented:

  • metadata and schema inspection
  • head and tail preview
  • row counting
  • file splitting (with progress bar, --force overwrite)
  • column statistics (numeric + string cardinality/top-values, --top-n)
  • format conversion (with progress bar, --force overwrite)
  • keyed dataset diff
  • compatible file merge (with progress bar, --force overwrite)
  • TSV auto-detection and custom delimiter support (--delimiter)
  • XLSX multi-sheet selection (--sheet)

Planned improvements are now centered on deeper performance tuning, richer diff workflows, and broader reporting capabilities rather than adding the core commands from scratch.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parq_cli-0.2.0.tar.gz (4.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parq_cli-0.2.0-py3-none-any.whl (31.6 kB view details)

Uploaded Python 3

File details

Details for the file parq_cli-0.2.0.tar.gz.

File metadata

  • Download URL: parq_cli-0.2.0.tar.gz
  • Upload date:
  • Size: 4.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for parq_cli-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8f145e39f39cf1d0f1606fe77ae174fd5c0be4772e6a1d7074d313fb503a0fec
MD5 5b276a0c8e0cf310802cc2b31ab1e577
BLAKE2b-256 ea599bbb1c09a95ded9b287ae1eb902ac5a619704e5c65d8c180aff83ee6b9ac

See more details on using hashes here.

File details

Details for the file parq_cli-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: parq_cli-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 31.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for parq_cli-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41929cc91ebab511304b74c46e235467ff6afccb8a3af7f1f6b1f5deb38027ec
MD5 dff188e9e77a1235ac93a5765c4d804a
BLAKE2b-256 2d1735abb82f12cd2b597b5cab0937830683788de585133b9e03663d12ab84f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page