Skip to main content

A powerful command-line tool for inspecting tabular files like Parquet, CSV, and XLSX

Project description

parq-cli

Python Version License

A powerful command-line tool for tabular files (.parquet, .csv, .xlsx) ๐Ÿš€

English | ็ฎ€ไฝ“ไธญๆ–‡

โœจ Features

  • ๐Ÿ“Š Metadata Viewing: Quickly view file metadata (row count, column count, file size, etc.)
  • ๐Ÿ“‹ Schema Display: Beautifully display file column structure and data types
  • ๐Ÿ‘€ Data Preview: Support viewing the first N rows or last N rows of a file
  • ๐Ÿ”ข Row Count: Quickly get the total number of rows in a file
  • โœ‚๏ธ File Splitting: Split large files into multiple smaller files
  • ๐Ÿ—œ๏ธ Compression Info: Display file compression type and file size
  • ๐ŸŽจ Beautiful Output: Use Rich library for colorful, formatted terminal output
  • ๐Ÿ“ฆ Smart Display: Automatically detect nested structures, showing logical and physical column counts

๐Ÿ“ฆ Installation

pip install parq-cli

# Optional: enable .xlsx support
pip install "parq-cli[xlsx]"

๐Ÿš€ Quick Start

Basic Usage

# View file metadata
parq meta data.parquet
parq meta data.csv
parq meta data.xlsx

# Display schema information
parq schema data.parquet

# Display first 5 rows (default)
parq head data.parquet

# Display first 10 rows
parq head -n 10 data.parquet

# Display last 5 rows (default)
parq tail data.parquet

# Display last 20 rows
parq tail -n 20 data.parquet

# Display total row count
parq count data.parquet

# Split file into 3 parts
parq split data.parquet --file-count 3

# Split file with 1000 records per file
parq split data.parquet --record-count 1000

๐Ÿ“– Command Reference

View Metadata

parq meta FILE

Display file metadata (row count, column count, file size, etc.). Supported input formats: .parquet, .csv, .xlsx (xlsx requires openpyxl).

View Schema

parq schema FILE

Display the column structure and data types of a file. Supported input formats: .parquet, .csv, .xlsx (xlsx requires openpyxl).

Preview Data

# Display first N rows (default 5)
parq head FILE
parq head -n N FILE

# Display last N rows (default 5)
parq tail FILE
parq tail -n N FILE

Notes:

  • N must be a non-negative integer.
  • If the input file does not exist, parq exits with code 1 and prints a friendly error message.
  • Header-only CSV/XLSX files return an empty preview with the detected columns; an empty CSV with no header returns a friendly Empty CSV file error.
  • Supported input formats: .parquet, .csv, .xlsx (xlsx requires openpyxl).

Statistics

# Display total row count
parq count FILE

Split Files

# Split into N files
parq split FILE --file-count N

# Split with M records per file
parq split FILE --record-count M

# Custom output format
parq split FILE -f N -n "output-%03d.parquet"

# Split into subdirectory
parq split FILE -f 3 -n "output/part-%02d.parquet"

Split a source file into multiple smaller files. You can specify either the number of output files (--file-count) or the number of records per file (--record-count). The output file names are formatted according to the --name-format pattern (default: result-%06d.parquet).
The output format is inferred from the file extension in --name-format (for example .parquet, .csv, .xlsx). When using --file-count, N must be a positive integer and cannot exceed the total rows of the source file.

Global Options

  • --version, -v: Display version information
  • --output, -o: Output format (rich, plain, json)
  • --help: Display help information

Output mode notes:

  • rich is optimized for human-readable terminal inspection.
  • plain is optimized for shell pipelines and escapes embedded tabs/newlines as \t and \n.
  • json is optimized for machine-readable integrations and preserves row values structurally.

๐Ÿ“ Large File Notes

  • Parquet metadata, head, and tail use PyArrow metadata and row-group optimizations where possible.
  • CSV preview, count, and split operations stream through the input in batches instead of materializing the full file up front.
  • XLSX preview, count, and split operations process rows incrementally to keep memory usage linear in the requested preview or chunk size.
  • Converting very large tabular files to Parquet still gives the best overall throughput, but large CSV/XLSX files no longer require full-table materialization for preview or split workflows.

๐ŸŽจ Output Examples

Metadata Display

Regular File (No Nested Structure):

$ parq meta data.parquet
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ“Š Parquet File Metadata โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ file_path: data.parquet                                                โ”‚
โ”‚ num_rows: 1000                                                         โ”‚
โ”‚ num_columns: 5 (logical)                                               โ”‚
โ”‚ file_size: 123.45 KB                                                   โ”‚
โ”‚ compression: SNAPPY                                                    โ”‚
โ”‚ num_row_groups: 1                                                      โ”‚
โ”‚ format_version: 2.6                                                    โ”‚
โ”‚ serialized_size: 126412                                                โ”‚
โ”‚ created_by: parquet-cpp-arrow version 18.0.0                          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Nested Structure File (Shows Physical Column Count):

$ parq meta nested.parquet
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿ“Š Parquet File Metadata โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ file_path: nested.parquet                                              โ”‚
โ”‚ num_rows: 500                                                          โ”‚
โ”‚ num_columns: 3 (logical)                                               โ”‚
โ”‚ num_physical_columns: 8 (storage)                                      โ”‚
โ”‚ file_size: 2.34 MB                                                     โ”‚
โ”‚ compression: ZSTD                                                      โ”‚
โ”‚ num_row_groups: 2                                                      โ”‚
โ”‚ format_version: 2.6                                                    โ”‚
โ”‚ serialized_size: 2451789                                               โ”‚
โ”‚ created_by: parquet-cpp-arrow version 21.0.0                          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Notes:

  • compression may show one codec (for example SNAPPY) or multiple codecs joined by commas when mixed compression exists.

Schema Display

$ parq schema data.parquet
                    ๐Ÿ“‹ Schema Information
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Column Name โ”ƒ Data Type     โ”ƒ Nullable โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ id          โ”‚ int64         โ”‚ โœ—        โ”‚
โ”‚ name        โ”‚ string        โ”‚ โœ“        โ”‚
โ”‚ age         โ”‚ int64         โ”‚ โœ“        โ”‚
โ”‚ city        โ”‚ string        โ”‚ โœ“        โ”‚
โ”‚ salary      โ”‚ double        โ”‚ โœ“        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ› ๏ธ Tech Stack

  • PyArrow: High-performance Parquet reading engine
  • Typer: Modern CLI framework
  • Rich: Beautiful terminal output

๐Ÿงช Development

Install Development Dependencies

# Recommended with uv
uv sync --extra dev

# Or with pip
pip install -e ".[dev]"

Run Tests

pytest

Run Tests (With Coverage)

pytest --cov=parq --cov-report=html

Code Formatting and Checking

# Check and auto-fix with Ruff

ruff check --fix parq tests

# Find dead code
vulture parq tests scripts

๐Ÿ—บ๏ธ Roadmap

  • Basic metadata viewing
  • Schema display
  • Data preview (head/tail)
  • Row count statistics
  • File size and compression information display
  • Nested structure smart detection (logical vs physical column count)
  • Add split command, split a parquet file into multiple parquet files
  • Data statistical analysis
  • Add convert command, convert a parquet file to other formats (CSV, JSON, Excel)
  • Add diff command, compare the differences between two parquet files
  • Add merge command, merge multiple parquet files into one parquet file

๐Ÿ“ฆ Release Process (for maintainers)

We use automated scripts to manage versions and releases:

# Bump version and create tag
python scripts/bump_version.py patch  # 0.1.0 -> 0.1.1 (bug fixes)
python scripts/bump_version.py minor  # 0.1.0 -> 0.2.0 (new features)
python scripts/bump_version.py major  # 0.1.0 -> 1.0.0 (breaking changes)

# Push to trigger GitHub Actions
git push origin main
git push origin v0.1.1  # Replace with actual version

GitHub Actions will automatically:

  • โœ… Run tests on Linux/macOS/Windows before publishing
  • โœ… Check for version conflicts
  • โœ… Fail fast on network errors while checking PyPI versions
  • โœ… Build the package
  • โœ… Publish to PyPI
  • โœ… Create GitHub Release

See scripts/README.md for detailed documentation.

๐Ÿค Contributing

Issues and Pull Requests are welcome!

  1. Fork this repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details

๐Ÿ™ Acknowledgments

  • Inspired by parquet-cli
  • Thanks to the Apache Arrow team for powerful Parquet support
  • Thanks to the Rich library for adding color to terminal output

๐Ÿ“ฎ Contact


โญ If this project helps you, please give it a Star!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parq_cli-0.1.9.tar.gz (617.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parq_cli-0.1.9-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file parq_cli-0.1.9.tar.gz.

File metadata

  • Download URL: parq_cli-0.1.9.tar.gz
  • Upload date:
  • Size: 617.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for parq_cli-0.1.9.tar.gz
Algorithm Hash digest
SHA256 7ecf9bbd99bee8d898fe0587b620c38b8b1419d7ef19c64f6e2df9f9ef270711
MD5 70540114d8e90815d126e3d2ad6ca6a2
BLAKE2b-256 1f7dfd1b8a0068fcb3b7bc7bca98abc3aa34dffbcc03844d1eada4c3837729fc

See more details on using hashes here.

File details

Details for the file parq_cli-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: parq_cli-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for parq_cli-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 bb7b8a2dd7f6d426e5366022d02aa1551a3625ba3c90e40c7ee60c096776b164
MD5 60bbf18532d63d6944a3d09f5b894cd1
BLAKE2b-256 35c95acfbfa65562ca51b14fd1b8877f476b63a77cd388a5fea4dd32b24357b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page