Pipe your data naturally - intuitive data manipulation with readable syntax
Project description
PipeFrame ๐
Pipe Your Data Naturally
A modern, intuitive data manipulation library for Python that makes your data workflows read like natural language. Built on pandas' robust foundation with a clean, pipe-based syntax inspired by R's dplyr and tidyverse.
from pipeframe import *
# Your data pipeline reads like a story
result = (df
>> filter('age > 21')
>> group_by('city')
>> summarize(avg_income='mean(income)', count='count()')
>> arrange('-avg_income')
)
๐ก How to read
>>: Read the>>operator as "pipe to" or "then". For example, the code above reads as: "Take df, then filter for age > 21, then group by city, then summarize..."
๐ Why PipeFrame?
Readability First
# โ Traditional pandas: Hard to read
df[df['age'] > 30].groupby('dept')['salary'].mean().sort_values(ascending=False)
# โ
PipeFrame: Clear and intuitive
df >> filter('age > 30') >> group_by('dept') >> summarize(avg='mean(salary)') >> arrange('-avg')
Key Features
- ๐ Pipe Operator
>>- Natural method chaining without nested parentheses - ๐ String Expressions - Write conditions as readable strings:
'age > 30 & salary > 50000' - ๐ Security Hardened - Built-in expression validation prevents code injection
- ๐ผ Pandas Compatible - Works seamlessly with existing pandas DataFrames
- ๐ฏ Type Safe - Full type hints for excellent IDE support and autocomplete
- โก Performance - Only ~5-15% overhead vs raw pandas
- ๐ Rich I/O - Read/write CSV, Excel, JSON, Parquet, SQL, and more
- ๐ Powerful Reshaping - Tidyr-style pivoting, melting, and transformations
- ๐ก๏ธ Production Ready - Comprehensive error handling and validation
๐ Quick Start
Installation
# Basic installation
pip install pipeframe
# With all optional dependencies
pip install pipeframe[all]
# Specific features
pip install pipeframe[excel] # Excel support
pip install pipeframe[parquet] # Parquet files
pip install pipeframe[sql] # SQL databases
pip install pipeframe[plot] # Visualization
Hello PipeFrame!
from pipeframe import *
# Create a DataFrame
df = DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 32, 37, 29],
'salary': [50000, 65000, 72000, 58000],
'dept': ['Engineering', 'Marketing', 'Engineering', 'Sales']
})
# Transform with intuitive verbs
result = (df
>> filter('age > 30')
>> define(
bonus='salary * 0.1',
total='salary + bonus'
)
>> select('name', 'dept', 'total')
>> arrange('-total')
)
print(result)
# name dept total
# 0 Charlie Engineering 79200.0
# 1 Bob Marketing 71500.0
๐ Core Concepts
The Pipe Operator >>
Chain operations naturally without nested function calls:
# Traditional approach (hard to read)
result = arrange(
select(
define(
filter(df, 'age > 25'),
experience='2024 - start_year'
),
'name', 'experience', 'salary'
),
'-salary'
)
# PipeFrame approach (reads like a recipe)
result = (df
>> filter('age > 25')
>> define(experience='2024 - start_year')
>> select('name', 'experience', 'salary')
>> arrange('-salary')
)
Core Verbs
| Verb | Purpose | Example |
|---|---|---|
define() |
Create/modify columns | >> define(total='price * quantity') |
filter() |
Filter rows | >> filter('age > 30 & city == "NYC"') |
select() |
Choose columns | >> select('name', 'age', 'salary') |
arrange() |
Sort data | >> arrange('-salary', 'name') |
group_by() |
Group data | >> group_by('category', 'region') |
summarize() |
Aggregate | >> summarize(total='sum(sales)', avg='mean(price)') |
rename() |
Rename columns | >> rename(customer_id='cid') |
distinct() |
Unique rows | >> distinct('product', 'store') |
๐ฅ Advanced Features
Conditional Logic
# if_else for binary conditions
df >> define(
status=if_else('salary > 60000', 'High', 'Standard'),
category=if_else('age >= 30', 'Senior', 'Junior')
)
# case_when for multiple conditions
df >> define(
grade=case_when(
('score >= 90', 'A'),
('score >= 80', 'B'),
('score >= 70', 'C'),
('score >= 60', 'D'),
default='F'
)
)
GroupBy Operations
# Summary by group
summary = (df
>> group_by('department', 'location')
>> summarize(
headcount='count()',
avg_salary='mean(salary)',
total_sales='sum(sales)',
top_performer='max(performance_score)'
)
>> arrange('-avg_salary')
)
# Multiple aggregations
analysis = (df
>> group_by('product_category')
>> summarize(
units_sold='sum(quantity)',
revenue='sum(price * quantity)',
avg_price='mean(price)',
num_transactions='count()'
)
>> define(
avg_transaction_value='revenue / num_transactions'
)
)
Data Reshaping
# Pivot wider (long to wide)
wide = (df
>> pivot_wider(
id_cols='student',
names_from='subject',
values_from='grade'
)
)
# Pivot longer (wide to long)
long = (df
>> pivot_longer(
cols=['Q1_sales', 'Q2_sales', 'Q3_sales', 'Q4_sales'],
names_to='quarter',
values_to='sales'
)
)
# Separate columns
separated = df >> separate('full_name', into=['first', 'last'], sep=' ')
# Unite columns
united = df >> unite('full_date', ['year', 'month', 'day'], sep='-')
Column Selection Helpers
# Select by pattern
df >> select(
'id',
starts_with('date_'), # All columns starting with 'date_'
ends_with('_amount'), # All columns ending with '_amount'
contains('price'), # All columns containing 'price'
matches(r'Q\d_sales') # Regex pattern matching
)
# Column ranges
df >> select('id', 'name:salary') # Select from 'name' to 'salary'
I/O Operations
# Read from various sources
df = read_csv('data.csv')
df = read_excel('data.xlsx', sheet_name='Sales')
df = read_json('data.json', orient='records')
df = read_parquet('data.parquet')
df = read_sql('SELECT * FROM users', connection)
df = read_clipboard() # Paste from spreadsheet!
# Write to different formats
df.to_csv('output.csv', index=False)
df.to_excel('report.xlsx', sheet_name='Results')
df.to_parquet('data.parquet', compression='gzip')
df.to_json('data.json', orient='records', lines=True)
๐ฏ Real-World Examples
Sales Analysis Pipeline
from pipeframe import *
# Load and analyze sales data
analysis = (
read_csv('sales_data.csv')
>> filter('date >= "2024-01-01" & revenue > 0')
>> define(
profit='revenue - cost',
margin='profit / revenue * 100',
quarter='pd.to_datetime(date).dt.quarter'
)
>> group_by('product_category', 'quarter')
>> summarize(
total_revenue='sum(revenue)',
total_profit='sum(profit)',
avg_margin='mean(margin)',
num_sales='count()'
)
>> define(
profit_per_sale='total_profit / num_sales'
)
>> arrange('-total_revenue')
)
# Export results
analysis.to_excel('quarterly_analysis.xlsx', sheet_name='Summary')
Customer Segmentation
# Segment customers by behavior
segments = (df
>> filter('total_purchases > 0')
>> define(
avg_order_value='total_spent / total_purchases',
recency_days='(pd.Timestamp.now() - last_purchase_date).dt.days',
segment=case_when(
('avg_order_value > 100 & recency_days < 30', 'Premium Active'),
('avg_order_value > 100 & recency_days >= 30', 'Premium At Risk'),
('recency_days < 30', 'Standard Active'),
('recency_days < 90', 'At Risk'),
default='Churned'
)
)
>> group_by('segment')
>> summarize(
customers='count()',
total_value='sum(total_spent)',
avg_value='mean(total_spent)'
)
)
Data Cleaning Pipeline
# Clean and standardize data
clean_data = (
read_excel('messy_data.xlsx')
>> filter('id.notna()') # Remove rows without ID
>> define(
# Standardize text fields
name='name.str.title().str.strip()',
email='email.str.lower().str.strip()',
# Parse dates
signup_date='pd.to_datetime(signup_date)',
# Fill missing values
phone='phone.fillna("Not Provided")',
# Create derived fields
account_age_days='(pd.Timestamp.now() - signup_date).dt.days'
)
>> distinct('email', keep='first') # Deduplicate by email
>> arrange('signup_date')
)
๐ Security Features
PipeFrame includes built-in security features to prevent code injection:
# โ
Safe expressions are allowed
df >> define(total='price * quantity')
df >> filter('age > 30 & city == "NYC"')
# โ Dangerous expressions are blocked
df >> define(bad="__import__('os').system('rm -rf /')")
# PipeFrameExpressionError: Expression contains dangerous pattern
# All string expressions are validated before execution
# - Blocks: __import__, exec(), eval(), compile(), open(), file()
# - Validates expression syntax
# - Uses pandas' restricted eval environment
๐ Performance
PipeFrame adds minimal overhead while dramatically improving code readability:
Benchmarks (1M rows):
- Filter operation: ~8% overhead
- GroupBy aggregation: ~12% overhead
- Complex pipeline (5 operations): ~10% overhead
Why the overhead is worth it:
- ๐ง Reduced cognitive load
- ๐ Fewer bugs from clearer intent
- โก Faster development time
- ๐ฅ Easier code review
- ๐ Better maintainability
๐ Learning Resources
- Tutorial Notebook - Complete walkthrough
- API Reference - Detailed documentation
- Examples - Real-world use cases
- Contributing Guide - How to contribute
๐ค Contributing
We welcome contributions! Here's how you can help:
- ๐ Report bugs - Open an issue
- ๐ก Suggest features - Share your ideas
- ๐ Improve docs - Help others learn
- ๐ง Submit PRs - Fix bugs or add features
See CONTRIBUTING.md for guidelines.
๐ License
MIT License - see LICENSE file for details.
๐จโ๐ป Author
Dr. Yasser Mustafa
AI & Data Science Specialist | Theoretical Physics PhD
- ๐ PhD in Theoretical Nuclear Physics
- ๐ผ 10+ years in production AI/ML systems
- ๐ฌ 48+ research publications
- ๐ข Experience: Government (Abu Dhabi), Media (Track24), Recruitment (Reed), Energy (ADNOC)
- ๐ Based in Newcastle Upon Tyne, UK
- โ๏ธ yasser.mustafan@gmail.com
- ๐ LinkedIn | GitHub
PipeFrame was born from years of working with data pipelines in production environments, combining the elegance of R's tidyverse with Python's practicality.
๐ Star History
If PipeFrame helps your work, please consider giving it a star! โญ
๐ Roadmap
Current (v0.2.0)
- โ Core verbs and operators
- โ Security hardening
- โ Comprehensive I/O
- โ Reshape operations
- โ Type hints
Upcoming (v0.3.0)
- Join operations (left_join, inner_join, etc.)
- Window functions
- Time series helpers
- Enhanced plotting integration
- Performance optimizations
Future (v1.0.0)
- Lazy evaluation engine
- Alternative backends (Polars, DuckDB)
- Distributed computing support
- Interactive data exploration tools
- SQL generation from pipes
๐ฌ Community
- Issues: Report bugs or request features
- Discussions: Ask questions, share use cases
Built with โค๏ธ for data scientists who value readability
Make your data speak naturally with PipeFrame ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pipeframe-0.2.1.tar.gz.
File metadata
- Download URL: pipeframe-0.2.1.tar.gz
- Upload date:
- Size: 91.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47e99c471d343b19db46a57a985d48ad4e7e05dfb647d99779e29f28e65b7ec2
|
|
| MD5 |
88459b58092380ca58c2ef1f7e7ba3da
|
|
| BLAKE2b-256 |
052b3357e79266df4527ca71ac695ee7c7cd540615e3048beabb69a109b199b5
|
Provenance
The following attestation bundles were made for pipeframe-0.2.1.tar.gz:
Publisher:
publish.yml on Yasser03/pipeframe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pipeframe-0.2.1.tar.gz -
Subject digest:
47e99c471d343b19db46a57a985d48ad4e7e05dfb647d99779e29f28e65b7ec2 - Sigstore transparency entry: 1003808495
- Sigstore integration time:
-
Permalink:
Yasser03/pipeframe@c508a56f59d11cb518347b4f6373cbd891260c98 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Yasser03
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c508a56f59d11cb518347b4f6373cbd891260c98 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file pipeframe-0.2.1-py3-none-any.whl.
File metadata
- Download URL: pipeframe-0.2.1-py3-none-any.whl
- Upload date:
- Size: 42.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10a416249c2a6c8af5a87b76e8464659957968ac9e8d8e6379e654d189c1b806
|
|
| MD5 |
4613d6fd6480b6d5cbe292232b75597a
|
|
| BLAKE2b-256 |
2fccd9588be59c31e857cdb3454bd25cded141171241102c61465943a20349e0
|
Provenance
The following attestation bundles were made for pipeframe-0.2.1-py3-none-any.whl:
Publisher:
publish.yml on Yasser03/pipeframe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pipeframe-0.2.1-py3-none-any.whl -
Subject digest:
10a416249c2a6c8af5a87b76e8464659957968ac9e8d8e6379e654d189c1b806 - Sigstore transparency entry: 1003808497
- Sigstore integration time:
-
Permalink:
Yasser03/pipeframe@c508a56f59d11cb518347b4f6373cbd891260c98 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Yasser03
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c508a56f59d11cb518347b4f6373cbd891260c98 -
Trigger Event:
workflow_dispatch
-
Statement type: