A universal data processing framework with multi-format support (CSV, JSON, Parquet, ORC) and intelligent pandas/Dask backend selection

These details have not been verified by PyPI

Project description

ParquetFrame

ParquetFrame Logo

The ultimate Python data processing framework combining intelligent pandas/Dask switching with AI-powered exploration, genomic computing support, and advanced workflow orchestration.

🏆 Production-Ready: Successfully published to PyPI with 334 passing tests, 54% coverage, and comprehensive CI/CD pipeline 🤖 AI-First: Pioneering local LLM integration for privacy-preserving natural language data queries ⚡ Performance-Optimized: Shows 7-90% speed improvements with intelligent memory-aware backend selection

Features

🚀 Intelligent Backend Selection: Memory-aware automatic switching between pandas and Dask based on file size, system resources, and file characteristics

📁 Multi-Format Support: Seamlessly work with CSV, JSON, ORC, and Parquet files with automatic format detection

📁 Smart File Handling: Reads files without requiring extensions - supports .parquet, .pqt, .csv, .tsv, .json, .jsonl, .ndjson, .orc

🔄 Seamless Switching: Convert between pandas and Dask with simple methods

⚡ Full API Compatibility: All pandas/Dask operations work transparently

🗃️ SQL Support: Execute SQL queries on DataFrames using DuckDB with automatic JOIN capabilities

🧬 BioFrame Integration: Genomic interval operations with parallel Dask implementations

🖥️ Powerful CLI: Command-line interface for data exploration, SQL queries, and batch processing

📝 Script Generation: Automatic Python script generation from CLI sessions

⚡ Performance Optimization: Built-in benchmarking tools and intelligent threshold detection

📋 YAML Workflows: Define complex data processing pipelines in YAML with declarative syntax

🤖 AI-Powered Queries: Natural language to SQL conversion using local LLM models (Ollama)

📋 Interactive Terminal: Rich CLI with command history, autocomplete, and natural language support

🎯 Zero Configuration: Works out of the box with sensible defaults

Quick Start

Installation

# Basic installation
pip install parquetframe

# With CLI support
pip install parquetframe[cli]

# With SQL support (includes DuckDB)
pip install parquetframe[sql]

# With genomics support (includes bioframe)
pip install parquetframe[bio]

# With AI support (includes ollama)
pip install parquetframe[ai]

# All features
pip install parquetframe[all]

# Development installation
pip install parquetframe[dev,all]

Basic Usage

import parquetframe as pf

# Read a file - automatically chooses pandas or Dask based on size
df = pf.read("my_data")  # Handles .parquet/.pqt extensions automatically

# All standard DataFrame operations work
result = df.groupby("column").sum()

# Save without worrying about extensions
df.save("output")  # Saves as output.parquet

# Manual control
df.to_dask()    # Convert to Dask
df.to_pandas()  # Convert to pandas

Multi-Format Support

import parquetframe as pf

# Automatic format detection - works with all supported formats
csv_data = pf.read("sales.csv")        # CSV with automatic delimiter detection
json_data = pf.read("events.json")     # JSON with nested data support
parquet_data = pf.read("users.pqt")    # Parquet for optimal performance
orc_data = pf.read("logs.orc")         # ORC for big data ecosystems

# JSON Lines for streaming data
stream_data = pf.read("events.jsonl")  # Newline-delimited JSON

# TSV files with automatic tab detection
tsv_data = pf.read("data.tsv")         # Tab-separated values

# Manual format override when needed
text_as_csv = pf.read("data.txt", format="csv")

# All formats work with the same API
result = (csv_data
          .query("amount > 100")
          .groupby("region")
          .sum()
          .save("summary.parquet"))  # Convert to optimal format

# Intelligent backend selection works for all formats
large_csv = pf.read("huge_dataset.csv")  # Automatically uses Dask if >100MB
small_json = pf.read("config.json")     # Uses pandas for small files

Advanced Usage

import parquetframe as pf

# Custom threshold
df = pf.read("data", threshold_mb=50)  # Use Dask for files >50MB

# Force backend
df = pf.read("data", islazy=True)   # Force Dask
df = pf.read("data", islazy=False)  # Force pandas

# Check current backend
print(df.islazy)  # True for Dask, False for pandas

# Chain operations
result = (pf.read("input")
          .groupby("category")
          .sum()
          .save("result"))

SQL Operations

import parquetframe as pf

# Read data
customers = pf.read("customers.parquet")
orders = pf.read("orders.parquet")

# Execute SQL queries with automatic JOIN
result = customers.sql("""
    SELECT c.name, c.age, SUM(o.amount) as total_spent
    FROM df c
    JOIN orders o ON c.customer_id = o.customer_id
    WHERE c.age > 25
    GROUP BY c.name, c.age
    ORDER BY total_spent DESC
""", orders=orders)

# Works with both pandas and Dask backends
print(result.head())

AI-Powered Natural Language Queries

import parquetframe as pf
from parquetframe.ai import LLMAgent

# Set up AI agent (requires ollama to be installed)
agent = LLMAgent(model_name="llama3.2")

# Read your data
df = pf.read("sales_data.parquet")

# Ask questions in natural language
result = await agent.generate_query(
    "Show me the top 5 customers by total sales this year",
    df
)

if result.success:
    print(f"Generated SQL: {result.query}")
    print(result.result.head())
else:
    print(f"Query failed: {result.error}")

# More complex queries
result = await agent.generate_query(
    "What is the average order value by region, sorted by highest first?",
    df
)

Genomic Data Analysis

import parquetframe as pf

# Read genomic interval data
genes = pf.read("genes.parquet")
peaks = pf.read("chip_seq_peaks.parquet")

# Find overlapping intervals with parallel processing
overlaps = genes.bio.overlap(peaks, broadcast=True)

# Cluster nearby genomic features
clustered = genes.bio.cluster(min_dist=1000)

# Works efficiently with both small and large datasets
print(f"Found {len(overlaps)} gene-peak overlaps")

CLI Usage

ParquetFrame includes a powerful command-line interface for data exploration and processing:

Basic Commands

# Get file information - works with any supported format
pframe info data.parquet    # Parquet files
pframe info sales.csv       # CSV files
pframe info events.json     # JSON files
pframe info logs.orc        # ORC files

# Quick data preview with auto-format detection
pframe run data.csv         # Automatically detects CSV
pframe run events.jsonl     # JSON Lines format
pframe run users.tsv        # Tab-separated values

# Interactive mode with any format
pframe interactive data.csv

# Interactive mode with AI support
pframe interactive data.parquet --ai

# SQL queries on parquet files
pframe sql "SELECT * FROM df WHERE age > 30" --file data.parquet
pframe sql --interactive --file data.parquet

# AI-powered natural language queries
pframe query "show me users older than 30" --file data.parquet --ai
pframe query "what is the average age by city?" --file data.parquet --ai

Data Processing

# Filter and transform data
pframe run data.parquet \
  --query "age > 30" \
  --columns "name,age,city" \
  --head 10

# Save processed data with script generation
pframe run data.parquet \
  --query "status == 'active'" \
  --output "filtered.parquet" \
  --save-script "my_analysis.py"

# Force specific backends
pframe run data.parquet --force-dask --describe
pframe run data.parquet --force-pandas --info

# SQL operations with JOINs
pframe sql "SELECT * FROM df JOIN customers ON df.id = customers.id" \
  --file orders.parquet \
  --join "customers=customers.parquet" \
  --output results.parquet

Interactive Mode

# Start interactive session
pframe interactive data.parquet

# In the interactive session:
>>> pf.query("age > 25").groupby("city").size()
>>> pf.save("result.parquet", save_script="session.py")

# With AI enabled:
>>> show me all users from New York
>>> what is the average income by department?
>>> \\deps  # Check AI dependencies
>>> \\quit

Performance Benchmarking

# Run comprehensive performance benchmarks
pframe benchmark

# Benchmark specific operations
pframe benchmark --operations "groupby,filter,sort"

# Test with custom file sizes
pframe benchmark --file-sizes "1000,10000,100000"

# Save benchmark results
pframe benchmark --output results.json --quiet

YAML Workflows

# Create an example workflow
pframe workflow --create-example my_pipeline.yml

# List available workflow step types
pframe workflow --list-steps

# Execute a workflow
pframe workflow my_pipeline.yml

# Execute with custom variables
pframe workflow my_pipeline.yml --variables "input_dir=data,min_age=21"

# Validate workflow without executing
pframe workflow --validate my_pipeline.yml

Key Benefits

Intelligent Performance: Memory-aware backend selection considering file size, system resources, and file characteristics
Built-in Benchmarking: Comprehensive performance analysis tools to optimize your data processing workflows
Simplicity: One consistent API regardless of backend
Flexibility: Override automatic decisions when needed
Compatibility: Drop-in replacement for pandas.read_parquet()
CLI Power: Full command-line interface for data exploration, batch processing, and performance benchmarking
Reproducibility: Automatic Python script generation from CLI sessions
Zero-Configuration Optimization: Automatic performance improvements with intelligent defaults

Requirements

Python 3.9+
pandas >= 2.0.0
dask[dataframe] >= 2023.1.0
pyarrow >= 10.0.0

Optional Dependencies

CLI Features ([cli])

click >= 8.0 (for CLI interface)
rich >= 13.0 (for enhanced terminal output)
psutil >= 5.8.0 (for performance monitoring and memory-aware backend selection)
pyyaml >= 6.0 (for YAML workflow support)

SQL Features ([sql])

duckdb >= 0.9.0 (for SQL query functionality)

Genomics Features ([bio])

bioframe >= 0.4.0 (for genomic interval operations)

AI Features ([ai])

ollama >= 0.1.0 (for natural language to SQL conversion)
prompt-toolkit >= 3.0.0 (for enhanced interactive CLI)

Development Status

✅ Production Ready (v0.3.0): Multi-format support with comprehensive testing across CSV, JSON, Parquet, and ORC formats 🧪 Robust Testing: Complete test suite for AI, CLI, SQL, bioframe, and workflow functionality 🔄 Active Development: Regular updates with cutting-edge AI and performance optimization features 🏆 Quality Excellence: 9.2/10 assessment score with professional CI/CD pipeline 🤖 AI-Powered: First DataFrame library with local LLM integration for natural language queries ⚡ Performance Leader: Consistent speed improvements over direct pandas usage 📦 Feature Complete: 83% of advanced features fully implemented (29 of 35)

CLI Reference

Commands

pframe info <file> - Display file information and schema
pframe run <file> [options] - Process data with various options
pframe interactive [file] - Start interactive Python session with optional AI support
pframe query <question> [options] - Ask natural language questions about your data
pframe sql <query> [options] - Execute SQL queries on parquet files
pframe deps - Check and display dependency status
pframe benchmark [options] - Run performance benchmarks and analysis
pframe workflow [file] [options] - Execute or manage YAML workflow files

Options for `pframe run`

--query, -q - Filter data (e.g., "age > 30")
--columns, -c - Select columns (e.g., "name,age,city")
--head, -h N - Show first N rows
--tail, -t N - Show last N rows
--sample, -s N - Show N random rows
--describe - Statistical description
--info - Data types and info
--output, -o - Save to file
--save-script, -S - Generate Python script
--threshold - Size threshold for backend selection (MB)
--force-pandas - Force pandas backend
--force-dask - Force Dask backend

Options for `pframe query`

--file, -f - Parquet file to query
--db-uri - Database URI to connect to
--ai - Enable AI-powered natural language processing
--model - LLM model to use (default: llama3.2)

Options for `pframe interactive`

--ai - Enable AI-powered natural language queries
--no-ai - Disable AI features (default if ollama not available)

Options for `pframe sql`

--file, -f - Main parquet file to query (available as 'df')
--join, -j - Additional files for JOINs in format 'name=path'
--output, -o - Save query results to file
--interactive, -i - Start interactive SQL mode
--explain - Show query execution plan
--validate - Validate SQL query syntax

Options for `pframe benchmark`

--output, -o - Save benchmark results to JSON file
--quiet, -q - Run in quiet mode (minimal output)
--operations - Comma-separated operations to benchmark (groupby,filter,sort,aggregation,join)
--file-sizes - Comma-separated test file sizes in rows (e.g., '1000,10000,100000')

Options for `pframe workflow`

--validate, -v - Validate workflow file without executing
--variables, -V - Set workflow variables as key=value pairs
--list-steps - List all available workflow step types
--create-example PATH - Create an example workflow file
--quiet, -q - Run in quiet mode (minimal output)

Documentation

Full documentation is available at https://leechristophermurray.github.io/parquetframe/

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.0.0

Dec 8, 2025

2.0.0b0 pre-release

Oct 22, 2025

2.0.0a7 pre-release

Dec 7, 2025

2.0.0a5 pre-release

Oct 21, 2025

2.0.0a0 pre-release

Oct 21, 2025

1.0.1

Oct 19, 2025

0.5.3

Oct 15, 2025

0.4.2

Oct 15, 2025

This version

0.4.0

Oct 15, 2025

0.3.2

Oct 14, 2025

0.3.1

Oct 14, 2025

0.2.3.2

Sep 27, 2025

0.2.3.1

Sep 27, 2025

0.2.3

Sep 26, 2025

0.2.1

Sep 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parquetframe-0.4.0.tar.gz (133.6 kB view details)

Uploaded Oct 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parquetframe-0.4.0-py3-none-any.whl (89.4 kB view details)

Uploaded Oct 15, 2025 Python 3

File details

Details for the file parquetframe-0.4.0.tar.gz.

File metadata

Download URL: parquetframe-0.4.0.tar.gz
Upload date: Oct 15, 2025
Size: 133.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parquetframe-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`c1afbe646b8f8c18645b142e84365938c3e468e6b40dd708c1baea22ce9789a2`
MD5	`e0edca7cf5269b6b5539150cc088317d`
BLAKE2b-256	`8531d36f829ba36b35d3121a7f406588c496e3fec2fd3233c0e49a3903d12ebe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for parquetframe-0.4.0.tar.gz:

Publisher: release.yml on leechristophermurray/parquetframe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: parquetframe-0.4.0.tar.gz
- Subject digest: c1afbe646b8f8c18645b142e84365938c3e468e6b40dd708c1baea22ce9789a2
- Sigstore transparency entry: 606960299
- Sigstore integration time: Oct 15, 2025
Source repository:
- Permalink: leechristophermurray/parquetframe@a386e12b4b4fe65d2a3d19dec891d8ce166aaa0f
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/leechristophermurray
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a386e12b4b4fe65d2a3d19dec891d8ce166aaa0f
- Trigger Event: push

File details

Details for the file parquetframe-0.4.0-py3-none-any.whl.

File metadata

Download URL: parquetframe-0.4.0-py3-none-any.whl
Upload date: Oct 15, 2025
Size: 89.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parquetframe-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d65d6b301b955e5b0d69375cd6aa7b6311baf95a15a8e710641948f2ac392ebb`
MD5	`e4800ac4b708a097737ffe5b268ecf54`
BLAKE2b-256	`72f9232c82f9c00dc57f180cd1a254a9453aa50b4bfd0e5d91c2f5c7d73c012a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for parquetframe-0.4.0-py3-none-any.whl:

Publisher: release.yml on leechristophermurray/parquetframe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: parquetframe-0.4.0-py3-none-any.whl
- Subject digest: d65d6b301b955e5b0d69375cd6aa7b6311baf95a15a8e710641948f2ac392ebb
- Sigstore transparency entry: 606960305
- Sigstore integration time: Oct 15, 2025
Source repository:
- Permalink: leechristophermurray/parquetframe@a386e12b4b4fe65d2a3d19dec891d8ce166aaa0f
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/leechristophermurray
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@a386e12b4b4fe65d2a3d19dec891d8ce166aaa0f
- Trigger Event: push

parquetframe 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ParquetFrame

Features

Quick Start

Installation

Basic Usage

Multi-Format Support

Advanced Usage

SQL Operations

AI-Powered Natural Language Queries

Genomic Data Analysis

CLI Usage

Basic Commands

Data Processing

Interactive Mode

Performance Benchmarking

YAML Workflows

Key Benefits

Requirements

Optional Dependencies

Development Status

CLI Reference

Commands

Options for pframe run

Options for pframe query

Options for pframe interactive

Options for pframe sql

Options for pframe benchmark

Options for pframe workflow

Documentation

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Options for `pframe run`

Options for `pframe query`

Options for `pframe interactive`

Options for `pframe sql`

Options for `pframe benchmark`

Options for `pframe workflow`