A powerful command-line tool for inspecting and analyzing Apache Parquet files
Project description
parq-cli
A powerful command-line tool for tabular files (.parquet, .csv, .xlsx) ๐
English | ็ฎไฝไธญๆ
โจ Features
- ๐ Metadata Viewing: Quickly view file metadata (row count, column count, file size, etc.)
- ๐ Schema Display: Beautifully display file column structure and data types
- ๐ Data Preview: Support viewing the first N rows or last N rows of a file
- ๐ข Row Count: Quickly get the total number of rows in a file
- โ๏ธ File Splitting: Split large files into multiple smaller files
- ๐๏ธ Compression Info: Display file compression type and file size
- ๐จ Beautiful Output: Use Rich library for colorful, formatted terminal output
- ๐ฆ Smart Display: Automatically detect nested structures, showing logical and physical column counts
๐ฆ Installation
pip install parq-cli
# Optional: enable .xlsx support
pip install "parq-cli[xlsx]"
๐ Quick Start
Basic Usage
# View file metadata
parq meta data.parquet
parq meta data.csv
parq meta data.xlsx
# Display schema information
parq schema data.parquet
# Display first 5 rows (default)
parq head data.parquet
# Display first 10 rows
parq head -n 10 data.parquet
# Display last 5 rows (default)
parq tail data.parquet
# Display last 20 rows
parq tail -n 20 data.parquet
# Display total row count
parq count data.parquet
# Split file into 3 parts
parq split data.parquet --file-count 3
# Split file with 1000 records per file
parq split data.parquet --record-count 1000
๐ Command Reference
View Metadata
parq meta FILE
Display file metadata (row count, column count, file size, etc.).
Supported input formats: .parquet, .csv, .xlsx (xlsx requires openpyxl).
View Schema
parq schema FILE
Display the column structure and data types of a file.
Supported input formats: .parquet, .csv, .xlsx (xlsx requires openpyxl).
Preview Data
# Display first N rows (default 5)
parq head FILE
parq head -n N FILE
# Display last N rows (default 5)
parq tail FILE
parq tail -n N FILE
Notes:
Nmust be a non-negative integer.- If the input file does not exist, parq exits with code
1and prints a friendly error message. - Supported input formats:
.parquet,.csv,.xlsx(xlsx requiresopenpyxl).
Statistics
# Display total row count
parq count FILE
Split Files
# Split into N files
parq split FILE --file-count N
# Split with M records per file
parq split FILE --record-count M
# Custom output format
parq split FILE -f N -n "output-%03d.parquet"
# Split into subdirectory
parq split FILE -f 3 -n "output/part-%02d.parquet"
Split a source file into multiple smaller files. You can specify either the number of output files (--file-count) or the number of records per file (--record-count). The output file names are formatted according to the --name-format pattern (default: result-%06d.parquet).
The output format is inferred from the file extension in --name-format (for example .parquet, .csv, .xlsx).
When using --file-count, N must be a positive integer and cannot exceed the total rows of the source file.
Global Options
--version, -v: Display version information--help: Display help information
๐จ Output Examples
Metadata Display
Regular File (No Nested Structure):
$ parq meta data.parquet
โญโโโโโโโโโโโโโโโโโโโโโโโ ๐ Parquet File Metadata โโโโโโโโโโโโโโโโโโโโโโโโฎ
โ file_path: data.parquet โ
โ num_rows: 1000 โ
โ num_columns: 5 (logical) โ
โ file_size: 123.45 KB โ
โ compression: SNAPPY โ
โ num_row_groups: 1 โ
โ format_version: 2.6 โ
โ serialized_size: 126412 โ
โ created_by: parquet-cpp-arrow version 18.0.0 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Nested Structure File (Shows Physical Column Count):
$ parq meta nested.parquet
โญโโโโโโโโโโโโโโโโโโโโโโโ ๐ Parquet File Metadata โโโโโโโโโโโโโโโโโโโโโโโโฎ
โ file_path: nested.parquet โ
โ num_rows: 500 โ
โ num_columns: 3 (logical) โ
โ num_physical_columns: 8 (storage) โ
โ file_size: 2.34 MB โ
โ compression: ZSTD โ
โ num_row_groups: 2 โ
โ format_version: 2.6 โ
โ serialized_size: 2451789 โ
โ created_by: parquet-cpp-arrow version 21.0.0 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Notes:
compressionmay show one codec (for exampleSNAPPY) or multiple codecs joined by commas when mixed compression exists.
Schema Display
$ parq schema data.parquet
๐ Schema Information
โโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโณโโโโโโโโโโโ
โ Column Name โ Data Type โ Nullable โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ id โ int64 โ โ โ
โ name โ string โ โ โ
โ age โ int64 โ โ โ
โ city โ string โ โ โ
โ salary โ double โ โ โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโ
๐ ๏ธ Tech Stack
- PyArrow: High-performance Parquet reading engine
- Typer: Modern CLI framework
- Rich: Beautiful terminal output
๐งช Development
Install Development Dependencies
# Recommended with uv
uv sync --extra dev
# Or with pip
pip install -e ".[dev]"
Run Tests
pytest
Run Tests (With Coverage)
pytest --cov=parq --cov-report=html
Code Formatting and Checking
# Check and auto-fix with Ruff
ruff check --fix parq tests
# Find dead code
vulture parq tests scripts
๐บ๏ธ Roadmap
- Basic metadata viewing
- Schema display
- Data preview (head/tail)
- Row count statistics
- File size and compression information display
- Nested structure smart detection (logical vs physical column count)
- Add split command, split a parquet file into multiple parquet files
- Data statistical analysis
- Add convert command, convert a parquet file to other formats (CSV, JSON, Excel)
- Add diff command, compare the differences between two parquet files
- Add merge command, merge multiple parquet files into one parquet file
๐ฆ Release Process (for maintainers)
We use automated scripts to manage versions and releases:
# Bump version and create tag
python scripts/bump_version.py patch # 0.1.0 -> 0.1.1 (bug fixes)
python scripts/bump_version.py minor # 0.1.0 -> 0.2.0 (new features)
python scripts/bump_version.py major # 0.1.0 -> 1.0.0 (breaking changes)
# Push to trigger GitHub Actions
git push origin main
git push origin v0.1.1 # Replace with actual version
GitHub Actions will automatically:
- โ Run tests on Linux/macOS/Windows before publishing
- โ Check for version conflicts
- โ Fail fast on network errors while checking PyPI versions
- โ Build the package
- โ Publish to PyPI
- โ Create GitHub Release
See scripts/README.md for detailed documentation.
๐ค Contributing
Issues and Pull Requests are welcome!
- Fork this repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details
๐ Acknowledgments
- Inspired by parquet-cli
- Thanks to the Apache Arrow team for powerful Parquet support
- Thanks to the Rich library for adding color to terminal output
๐ฎ Contact
- Author: SimonSun
- Project URL: https://github.com/Tendo33/parq-cli
โญ If this project helps you, please give it a Star!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parq_cli-0.1.7.tar.gz.
File metadata
- Download URL: parq_cli-0.1.7.tar.gz
- Upload date:
- Size: 610.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc1e2ffb65ea03890f63d0297840923d27d65bcccffd3f318d29588f3c17470b
|
|
| MD5 |
2308fd4561e9d7ff2026b0c22400cf00
|
|
| BLAKE2b-256 |
d0a3d7bbccbd722eb8203fa67393726cc7192cd6ef56a2a90f4fa07b8cc881ca
|
File details
Details for the file parq_cli-0.1.7-py3-none-any.whl.
File metadata
- Download URL: parq_cli-0.1.7-py3-none-any.whl
- Upload date:
- Size: 18.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b995cf8437119d2b6f1884fc15b18c07645e82f6214a6060665af465dc9aa37
|
|
| MD5 |
d2eb2c90e26113680d99a6fdda7cc866
|
|
| BLAKE2b-256 |
24764a3df17741a58bef7dd4ff3f02ad326cf2f2ec3640ec10c3b09053d2f977
|