Skip to main content

Pipe your data naturally - intuitive data manipulation with readable syntax

Project description

PipeFrame ๐Ÿ”„

Pipe Your Data Naturally

PyPI version Python 3.8+ License: MIT Code style: black

A modern, intuitive data manipulation library for Python that makes your data workflows read like natural language. Built on pandas' robust foundation with a clean, pipe-based syntax inspired by R's dplyr and tidyverse.

from pipeframe import *

# Your data pipeline reads like a story
result = (df
    >> filter('age > 21')
    >> group_by('city')  
    >> summarize(avg_income='mean(income)', count='count()')
    >> arrange('-avg_income')
)

๐Ÿ’ก How to read >>: Read the >> operator as "pipe to" or "then". For example, the code above reads as: "Take df, then filter for age > 21, then group by city, then summarize..."


๐ŸŒŸ Why PipeFrame?

Readability First

# โŒ Traditional pandas: Hard to read
df[df['age'] > 30].groupby('dept')['salary'].mean().sort_values(ascending=False)

# โœ… PipeFrame: Clear and intuitive
df >> filter('age > 30') >> group_by('dept') >> summarize(avg='mean(salary)') >> arrange('-avg')

Key Features

  • ๐Ÿ”— Pipe Operator >> - Natural method chaining without nested parentheses
  • ๐Ÿ“ String Expressions - Write conditions as readable strings: 'age > 30 & salary > 50000'
  • ๐Ÿ”’ Security Hardened - Built-in expression validation prevents code injection
  • ๐Ÿผ Pandas Compatible - Works seamlessly with existing pandas DataFrames
  • ๐ŸŽฏ Type Safe - Full type hints for excellent IDE support and autocomplete
  • โšก Performance - Only ~5-15% overhead vs raw pandas
  • ๐Ÿ“Š Rich I/O - Read/write CSV, Excel, JSON, Parquet, SQL, and more
  • ๐Ÿ”„ Powerful Reshaping - Tidyr-style pivoting, melting, and transformations
  • ๐Ÿ›ก๏ธ Production Ready - Comprehensive error handling and validation

๐Ÿš€ Quick Start

Installation

# Basic installation
pip install pipeframe

# With all optional dependencies
pip install pipeframe[all]

# Specific features
pip install pipeframe[excel]      # Excel support
pip install pipeframe[parquet]    # Parquet files
pip install pipeframe[sql]        # SQL databases
pip install pipeframe[plot]       # Visualization

Hello PipeFrame!

from pipeframe import *

# Create a DataFrame
df = DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 32, 37, 29],
    'salary': [50000, 65000, 72000, 58000],
    'dept': ['Engineering', 'Marketing', 'Engineering', 'Sales']
})

# Transform with intuitive verbs
result = (df
    >> filter('age > 30')
    >> define(
        bonus='salary * 0.1',
        total='salary + bonus'
    )
    >> select('name', 'dept', 'total')
    >> arrange('-total')
)

print(result)
#       name          dept    total
# 0  Charlie  Engineering  79200.0
# 1      Bob     Marketing  71500.0

๐Ÿ“š Core Concepts

The Pipe Operator >>

Chain operations naturally without nested function calls:

# Traditional approach (hard to read)
result = arrange(
    select(
        define(
            filter(df, 'age > 25'),
            experience='2024 - start_year'
        ),
        'name', 'experience', 'salary'
    ),
    '-salary'
)

# PipeFrame approach (reads like a recipe)
result = (df
    >> filter('age > 25')
    >> define(experience='2024 - start_year')
    >> select('name', 'experience', 'salary')
    >> arrange('-salary')
)

Core Verbs

Verb Purpose Example
define() Create/modify columns >> define(total='price * quantity')
filter() Filter rows >> filter('age > 30 & city == "NYC"')
select() Choose columns >> select('name', 'age', 'salary')
arrange() Sort data >> arrange('-salary', 'name')
group_by() Group data >> group_by('category', 'region')
summarize() Aggregate >> summarize(total='sum(sales)', avg='mean(price)')
rename() Rename columns >> rename(customer_id='cid')
distinct() Unique rows >> distinct('product', 'store')

๐Ÿ”ฅ Advanced Features

Conditional Logic

# if_else for binary conditions
df >> define(
    status=if_else('salary > 60000', 'High', 'Standard'),
    category=if_else('age >= 30', 'Senior', 'Junior')
)

# case_when for multiple conditions
df >> define(
    grade=case_when(
        ('score >= 90', 'A'),
        ('score >= 80', 'B'),
        ('score >= 70', 'C'),
        ('score >= 60', 'D'),
        default='F'
    )
)

GroupBy Operations

# Summary by group
summary = (df
    >> group_by('department', 'location')
    >> summarize(
        headcount='count()',
        avg_salary='mean(salary)',
        total_sales='sum(sales)',
        top_performer='max(performance_score)'
    )
    >> arrange('-avg_salary')
)

# Multiple aggregations
analysis = (df
    >> group_by('product_category')
    >> summarize(
        units_sold='sum(quantity)',
        revenue='sum(price * quantity)',
        avg_price='mean(price)',
        num_transactions='count()'
    )
    >> define(
        avg_transaction_value='revenue / num_transactions'
    )
)

Data Reshaping

# Pivot wider (long to wide)
wide = (df
    >> pivot_wider(
        id_cols='student',
        names_from='subject',
        values_from='grade'
    )
)

# Pivot longer (wide to long)
long = (df
    >> pivot_longer(
        cols=['Q1_sales', 'Q2_sales', 'Q3_sales', 'Q4_sales'],
        names_to='quarter',
        values_to='sales'
    )
)

# Separate columns
separated = df >> separate('full_name', into=['first', 'last'], sep=' ')

# Unite columns
united = df >> unite('full_date', ['year', 'month', 'day'], sep='-')

Column Selection Helpers

# Select by pattern
df >> select(
    'id',
    starts_with('date_'),      # All columns starting with 'date_'
    ends_with('_amount'),      # All columns ending with '_amount'
    contains('price'),         # All columns containing 'price'
    matches(r'Q\d_sales')      # Regex pattern matching
)

# Column ranges
df >> select('id', 'name:salary')  # Select from 'name' to 'salary'

I/O Operations

# Read from various sources
df = read_csv('data.csv')
df = read_excel('data.xlsx', sheet_name='Sales')
df = read_json('data.json', orient='records')
df = read_parquet('data.parquet')
df = read_sql('SELECT * FROM users', connection)
df = read_clipboard()  # Paste from spreadsheet!

# Write to different formats
df.to_csv('output.csv', index=False)
df.to_excel('report.xlsx', sheet_name='Results')
df.to_parquet('data.parquet', compression='gzip')
df.to_json('data.json', orient='records', lines=True)

๐ŸŽฏ Real-World Examples

Sales Analysis Pipeline

from pipeframe import *

# Load and analyze sales data
analysis = (
    read_csv('sales_data.csv')
    >> filter('date >= "2024-01-01" & revenue > 0')
    >> define(
        profit='revenue - cost',
        margin='profit / revenue * 100',
        quarter='pd.to_datetime(date).dt.quarter'
    )
    >> group_by('product_category', 'quarter')
    >> summarize(
        total_revenue='sum(revenue)',
        total_profit='sum(profit)',
        avg_margin='mean(margin)',
        num_sales='count()'
    )
    >> define(
        profit_per_sale='total_profit / num_sales'
    )
    >> arrange('-total_revenue')
)

# Export results
analysis.to_excel('quarterly_analysis.xlsx', sheet_name='Summary')

Customer Segmentation

# Segment customers by behavior
segments = (df
    >> filter('total_purchases > 0')
    >> define(
        avg_order_value='total_spent / total_purchases',
        recency_days='(pd.Timestamp.now() - last_purchase_date).dt.days',
        segment=case_when(
            ('avg_order_value > 100 & recency_days < 30', 'Premium Active'),
            ('avg_order_value > 100 & recency_days >= 30', 'Premium At Risk'),
            ('recency_days < 30', 'Standard Active'),
            ('recency_days < 90', 'At Risk'),
            default='Churned'
        )
    )
    >> group_by('segment')
    >> summarize(
        customers='count()',
        total_value='sum(total_spent)',
        avg_value='mean(total_spent)'
    )
)

Data Cleaning Pipeline

# Clean and standardize data
clean_data = (
    read_excel('messy_data.xlsx')
    >> filter('id.notna()')  # Remove rows without ID
    >> define(
        # Standardize text fields
        name='name.str.title().str.strip()',
        email='email.str.lower().str.strip()',
        # Parse dates
        signup_date='pd.to_datetime(signup_date)',
        # Fill missing values
        phone='phone.fillna("Not Provided")',
        # Create derived fields
        account_age_days='(pd.Timestamp.now() - signup_date).dt.days'
    )
    >> distinct('email', keep='first')  # Deduplicate by email
    >> arrange('signup_date')
)

๐Ÿ”’ Security Features

PipeFrame includes built-in security features to prevent code injection:

# โœ… Safe expressions are allowed
df >> define(total='price * quantity')
df >> filter('age > 30 & city == "NYC"')

# โŒ Dangerous expressions are blocked
df >> define(bad="__import__('os').system('rm -rf /')")
# PipeFrameExpressionError: Expression contains dangerous pattern

# All string expressions are validated before execution
# - Blocks: __import__, exec(), eval(), compile(), open(), file()
# - Validates expression syntax
# - Uses pandas' restricted eval environment

๐Ÿ“Š Performance

PipeFrame adds minimal overhead while dramatically improving code readability:

Benchmarks (1M rows):

  • Filter operation: ~8% overhead
  • GroupBy aggregation: ~12% overhead
  • Complex pipeline (5 operations): ~10% overhead

Why the overhead is worth it:

  • ๐Ÿง  Reduced cognitive load
  • ๐Ÿ› Fewer bugs from clearer intent
  • โšก Faster development time
  • ๐Ÿ‘ฅ Easier code review
  • ๐Ÿ“š Better maintainability

๐ŸŽ“ Learning Resources


๐Ÿค Contributing

We welcome contributions! Here's how you can help:

  1. ๐Ÿ› Report bugs - Open an issue
  2. ๐Ÿ’ก Suggest features - Share your ideas
  3. ๐Ÿ“ Improve docs - Help others learn
  4. ๐Ÿ”ง Submit PRs - Fix bugs or add features

See CONTRIBUTING.md for guidelines.


๐Ÿ“œ License

MIT License - see LICENSE file for details.


๐Ÿ‘จโ€๐Ÿ’ป Author

Dr. Yasser Mustafa

AI & Data Science Specialist | Theoretical Physics PhD

  • ๐ŸŽ“ PhD in Theoretical Nuclear Physics
  • ๐Ÿ’ผ 10+ years in production AI/ML systems
  • ๐Ÿ”ฌ 48+ research publications
  • ๐Ÿข Experience: Government (Abu Dhabi), Media (Track24), Recruitment (Reed), Energy (ADNOC)
  • ๐Ÿ“ Based in Newcastle Upon Tyne, UK
  • โœ‰๏ธ yasser.mustafan@gmail.com
  • ๐Ÿ”— LinkedIn | GitHub

PipeFrame was born from years of working with data pipelines in production environments, combining the elegance of R's tidyverse with Python's practicality.


๐ŸŒŸ Star History

If PipeFrame helps your work, please consider giving it a star! โญ


๐Ÿ“ˆ Roadmap

Current (v0.2.0)

  • โœ… Core verbs and operators
  • โœ… Security hardening
  • โœ… Comprehensive I/O
  • โœ… Reshape operations
  • โœ… Type hints

Upcoming (v0.3.0)

  • Join operations (left_join, inner_join, etc.)
  • Window functions
  • Time series helpers
  • Enhanced plotting integration
  • Performance optimizations

Future (v1.0.0)

  • Lazy evaluation engine
  • Alternative backends (Polars, DuckDB)
  • Distributed computing support
  • Interactive data exploration tools
  • SQL generation from pipes

๐Ÿ’ฌ Community

  • Issues: Report bugs or request features
  • Discussions: Ask questions, share use cases

Built with โค๏ธ for data scientists who value readability

Make your data speak naturally with PipeFrame ๐Ÿ”„

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipeframe-0.2.0.tar.gz (75.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pipeframe-0.2.0-py3-none-any.whl (42.6 kB view details)

Uploaded Python 3

File details

Details for the file pipeframe-0.2.0.tar.gz.

File metadata

  • Download URL: pipeframe-0.2.0.tar.gz
  • Upload date:
  • Size: 75.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for pipeframe-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8b0bad4e14f1dc953e182123f630cc2e4e9eb0408b7c38322a3ec42bf3550eb6
MD5 d9d5c590358a6664beece90f5fb17545
BLAKE2b-256 7ad60d80c1bd21a5934fc61c1e47e71f612530ad592a1aa3dd76be3c45100897

See more details on using hashes here.

File details

Details for the file pipeframe-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pipeframe-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 42.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for pipeframe-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a7d80fee6aa4da7a01c5f5040a81d3f19bc8cfe6982fec999ff0f1f9ec48ed2d
MD5 66cc741a98ab5c260c25252251be50fb
BLAKE2b-256 247f7d6162461db7049b7cb98cc1dfd1a3f0f43f89dba44dce1b3da1c5faac80

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page