A universal wrapper for working with dataframes in Python, seamlessly switching between pandas and Dask based on file size
Project description
ParquetFrame
A universal wrapper for working with dataframes in Python, seamlessly switching between pandas and Dask based on file size or manual control.
Features
🚀 Intelligent Backend Selection: Memory-aware automatic switching between pandas and Dask based on file size, system resources, and file characteristics
📁 Smart File Handling: Reads parquet files without requiring file extensions (.parquet, .pqt)
🔄 Seamless Switching: Convert between pandas and Dask with simple methods
⚡ Full API Compatibility: All pandas/Dask operations work transparently
🗃️ SQL Support: Execute SQL queries on DataFrames using DuckDB with automatic JOIN capabilities
🧬 BioFrame Integration: Genomic interval operations with parallel Dask implementations
🖥️ Powerful CLI: Command-line interface for data exploration, SQL queries, and batch processing
📝 Script Generation: Automatic Python script generation from CLI sessions
⚡ Performance Optimization: Built-in benchmarking tools and intelligent threshold detection
📋 YAML Workflows: Define complex data processing pipelines in YAML with declarative syntax
🎯 Zero Configuration: Works out of the box with sensible defaults
Quick Start
Installation
# Basic installation
pip install parquetframe
# With CLI support
pip install parquetframe[cli]
# With SQL support (includes DuckDB)
pip install parquetframe[sql]
# With genomics support (includes bioframe)
pip install parquetframe[bio]
# All features
pip install parquetframe[all]
# Development installation
pip install parquetframe[dev,all]
Basic Usage
import parquetframe as pf
# Read a file - automatically chooses pandas or Dask based on size
df = pf.read("my_data") # Handles .parquet/.pqt extensions automatically
# All standard DataFrame operations work
result = df.groupby("column").sum()
# Save without worrying about extensions
df.save("output") # Saves as output.parquet
# Manual control
df.to_dask() # Convert to Dask
df.to_pandas() # Convert to pandas
Advanced Usage
import parquetframe as pqf
# Custom threshold
df = pf.read("data", threshold_mb=50) # Use Dask for files >50MB
# Force backend
df = pf.read("data", islazy=True) # Force Dask
df = pf.read("data", islazy=False) # Force pandas
# Check current backend
print(df.islazy) # True for Dask, False for pandas
# Chain operations
result = (pf.read("input")
.groupby("category")
.sum()
.save("result"))
SQL Operations
import parquetframe as pf
# Read data
customers = pf.read("customers.parquet")
orders = pf.read("orders.parquet")
# Execute SQL queries with automatic JOIN
result = customers.sql("""
SELECT c.name, c.age, SUM(o.amount) as total_spent
FROM df c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.age > 25
GROUP BY c.name, c.age
ORDER BY total_spent DESC
""", orders=orders)
# Works with both pandas and Dask backends
print(result.head())
Genomic Data Analysis
import parquetframe as pf
# Read genomic interval data
genes = pf.read("genes.parquet")
peaks = pf.read("chip_seq_peaks.parquet")
# Find overlapping intervals with parallel processing
overlaps = genes.bio.overlap(peaks, broadcast=True)
# Cluster nearby genomic features
clustered = genes.bio.cluster(min_dist=1000)
# Works efficiently with both small and large datasets
print(f"Found {len(overlaps)} gene-peak overlaps")
CLI Usage
ParquetFrame includes a powerful command-line interface for data exploration and processing:
Basic Commands
# Get file information
pframe info data.parquet
# Quick data preview
pframe run data.parquet
# Interactive mode
pframe interactive data.parquet
# SQL queries on parquet files
pframe sql "SELECT * FROM df WHERE age > 30" --file data.parquet
pframe sql --interactive --file data.parquet
Data Processing
# Filter and transform data
pframe run data.parquet \
--query "age > 30" \
--columns "name,age,city" \
--head 10
# Save processed data with script generation
pframe run data.parquet \
--query "status == 'active'" \
--output "filtered.parquet" \
--save-script "my_analysis.py"
# Force specific backends
pframe run data.parquet --force-dask --describe
pframe run data.parquet --force-pandas --info
# SQL operations with JOINs
pframe sql "SELECT * FROM df JOIN customers ON df.id = customers.id" \
--file orders.parquet \
--join "customers=customers.parquet" \
--output results.parquet
Interactive Mode
# Start interactive session
pframe interactive data.parquet
# In the interactive session:
>>> pf.query("age > 25").groupby("city").size()
>>> pf.save("result.parquet", save_script="session.py")
>>> exit()
Performance Benchmarking
# Run comprehensive performance benchmarks
pframe benchmark
# Benchmark specific operations
pframe benchmark --operations "groupby,filter,sort"
# Test with custom file sizes
pframe benchmark --file-sizes "1000,10000,100000"
# Save benchmark results
pframe benchmark --output results.json --quiet
YAML Workflows
# Create an example workflow
pframe workflow --create-example my_pipeline.yml
# List available workflow step types
pframe workflow --list-steps
# Execute a workflow
pframe workflow my_pipeline.yml
# Execute with custom variables
pframe workflow my_pipeline.yml --variables "input_dir=data,min_age=21"
# Validate workflow without executing
pframe workflow --validate my_pipeline.yml
Key Benefits
- Intelligent Performance: Memory-aware backend selection considering file size, system resources, and file characteristics
- Built-in Benchmarking: Comprehensive performance analysis tools to optimize your data processing workflows
- Simplicity: One consistent API regardless of backend
- Flexibility: Override automatic decisions when needed
- Compatibility: Drop-in replacement for pandas.read_parquet()
- CLI Power: Full command-line interface for data exploration, batch processing, and performance benchmarking
- Reproducibility: Automatic Python script generation from CLI sessions
- Zero-Configuration Optimization: Automatic performance improvements with intelligent defaults
Requirements
- Python 3.9+
- pandas >= 2.0.0
- dask[dataframe] >= 2023.1.0
- pyarrow >= 10.0.0
Optional Dependencies
CLI Features ([cli])
- click >= 8.0 (for CLI interface)
- rich >= 13.0 (for enhanced terminal output)
- psutil >= 5.8.0 (for performance monitoring and memory-aware backend selection)
- pyyaml >= 6.0 (for YAML workflow support)
SQL Features ([sql])
- duckdb >= 0.9.0 (for SQL query functionality)
Genomics Features ([bio])
- bioframe >= 0.4.0 (for genomic interval operations)
Development Status
✅ Stable & Production Ready: All 203 tests passing with 65% test coverage 🔄 Active Development: Regular updates and improvements 🐛 Bug-Free Core: Recently resolved all critical issues and test failures 📦 Latest Release: v0.1.1 with enhanced stability and bug fixes
CLI Reference
Commands
pframe info <file>- Display file information and schemapframe run <file> [options]- Process data with various optionspframe interactive [file]- Start interactive Python sessionpframe sql <query> [options]- Execute SQL queries on parquet filespframe benchmark [options]- Run performance benchmarks and analysispframe workflow [file] [options]- Execute or manage YAML workflow files
Options for pframe run
--query, -q- Filter data (e.g., "age > 30")--columns, -c- Select columns (e.g., "name,age,city")--head, -h N- Show first N rows--tail, -t N- Show last N rows--sample, -s N- Show N random rows--describe- Statistical description--info- Data types and info--output, -o- Save to file--save-script, -S- Generate Python script--threshold- Size threshold for backend selection (MB)--force-pandas- Force pandas backend--force-dask- Force Dask backend
Options for pframe sql
--file, -f- Main parquet file to query (available as 'df')--join, -j- Additional files for JOINs in format 'name=path'--output, -o- Save query results to file--interactive, -i- Start interactive SQL mode--explain- Show query execution plan--validate- Validate SQL query syntax
Options for pframe benchmark
--output, -o- Save benchmark results to JSON file--quiet, -q- Run in quiet mode (minimal output)--operations- Comma-separated operations to benchmark (groupby,filter,sort,aggregation,join)--file-sizes- Comma-separated test file sizes in rows (e.g., '1000,10000,100000')
Options for pframe workflow
--validate, -v- Validate workflow file without executing--variables, -V- Set workflow variables as key=value pairs--list-steps- List all available workflow step types--create-example PATH- Create an example workflow file--quiet, -q- Run in quiet mode (minimal output)
Documentation
Full documentation is available at https://leechristophermurray.github.io/parquetframe/
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parquetframe-0.2.1.tar.gz.
File metadata
- Download URL: parquetframe-0.2.1.tar.gz
- Upload date:
- Size: 53.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
489e057517ccc5610f3a4683bfed2f0612be8d3f95e887bbf8a78ebc621d3999
|
|
| MD5 |
4dd3509c8894250fcb0ddc759737b3bf
|
|
| BLAKE2b-256 |
3b323af0ce2d2de903a8fc918c956e7d03e328cc1e4f437cafb2f9585fbaa011
|
Provenance
The following attestation bundles were made for parquetframe-0.2.1.tar.gz:
Publisher:
release.yml on leechristophermurray/parquetframe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parquetframe-0.2.1.tar.gz -
Subject digest:
489e057517ccc5610f3a4683bfed2f0612be8d3f95e887bbf8a78ebc621d3999 - Sigstore transparency entry: 558604820
- Sigstore integration time:
-
Permalink:
leechristophermurray/parquetframe@27c22aa9aa6aef4e37fbb2b2ce7450aecf4a9e2d -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/leechristophermurray
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@27c22aa9aa6aef4e37fbb2b2ce7450aecf4a9e2d -
Trigger Event:
push
-
Statement type:
File details
Details for the file parquetframe-0.2.1-py3-none-any.whl.
File metadata
- Download URL: parquetframe-0.2.1-py3-none-any.whl
- Upload date:
- Size: 36.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
998500bc2d9310140cbbd86fe335616efd1f46af0e260e63cb61bb65647ce584
|
|
| MD5 |
95474d5d27f505e7bf23985f330c1f02
|
|
| BLAKE2b-256 |
dc7c9a7091b619416b74a63cbdb13ea2ecab060f465f66120970292498c35c47
|
Provenance
The following attestation bundles were made for parquetframe-0.2.1-py3-none-any.whl:
Publisher:
release.yml on leechristophermurray/parquetframe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parquetframe-0.2.1-py3-none-any.whl -
Subject digest:
998500bc2d9310140cbbd86fe335616efd1f46af0e260e63cb61bb65647ce584 - Sigstore transparency entry: 558604855
- Sigstore integration time:
-
Permalink:
leechristophermurray/parquetframe@27c22aa9aa6aef4e37fbb2b2ce7450aecf4a9e2d -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/leechristophermurray
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@27c22aa9aa6aef4e37fbb2b2ce7450aecf4a9e2d -
Trigger Event:
push
-
Statement type: