Skip to main content

Pipe your data naturally - intuitive data manipulation with readable syntax

Project description

PipeFrame ๐Ÿ”„

Pipe Your Data Naturally

PyPI version PyPI downloads Python 3.8+ License: MIT Code style: black

A modern, intuitive data manipulation library for Python that makes your data workflows read like natural language. Built on pandas' robust foundation with a clean, pipe-based syntax inspired by R's dplyr and tidyverse.

from pipeframe import *

# Your data pipeline reads like a story
result = (df
    >> filter('age > 21')
    >> group_by('city')  
    >> summarize(avg_income='mean(income)', count='count()')
    >> arrange('-avg_income')
)

๐Ÿ’ก How to read >>: Read the >> operator as "pipe to" or "then". For example, the code above reads as: "Take df, then filter for age > 21, then group by city, then summarize..."


๐ŸŒŸ Why PipeFrame?

Readability First

# โŒ Traditional pandas: Hard to read
df[df['age'] > 30].groupby('dept')['salary'].mean().sort_values(ascending=False)

# โœ… PipeFrame: Clear and intuitive
df >> filter('age > 30') >> group_by('dept') >> summarize(avg='mean(salary)') >> arrange('-avg')

Key Features

  • ๐Ÿ”— Pipe Operator >> - Natural method chaining without nested parentheses
  • ๐Ÿ“ String Expressions - Write conditions as readable strings: 'age > 30 & salary > 50000'
  • ๐Ÿ”’ Security Hardened - Built-in expression validation prevents code injection
  • ๐Ÿผ Pandas Compatible - Works seamlessly with existing pandas DataFrames
  • ๐ŸŽฏ Type Safe - Full type hints for excellent IDE support and autocomplete
  • โšก Performance - Only ~5-15% overhead vs raw pandas
  • ๐Ÿ“Š Rich I/O - Read/write CSV, Excel, JSON, Parquet, SQL, and more
  • ๐Ÿ”„ Powerful Reshaping - Tidyr-style pivoting, melting, and transformations
  • ๐Ÿ›ก๏ธ Production Ready - Comprehensive error handling and validation

๐Ÿš€ Quick Start

Installation

# Basic installation
pip install pipeframe

# With all optional dependencies
pip install pipeframe[all]

# Specific features
pip install pipeframe[excel]      # Excel support
pip install pipeframe[parquet]    # Parquet files
pip install pipeframe[sql]        # SQL databases
pip install pipeframe[plot]       # Visualization

Hello PipeFrame!

from pipeframe import *

# Create a DataFrame
df = DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 32, 37, 29],
    'salary': [50000, 65000, 72000, 58000],
    'dept': ['Engineering', 'Marketing', 'Engineering', 'Sales']
})

# Transform with intuitive verbs
result = (df
    >> filter('age > 30')
    >> define(
        bonus='salary * 0.1',
        total='salary + bonus'
    )
    >> select('name', 'dept', 'total')
    >> arrange('-total')
)

print(result)
#       name          dept    total
# 0  Charlie  Engineering  79200.0
# 1      Bob     Marketing  71500.0

๐Ÿ“š Core Concepts

The Pipe Operator >>

Chain operations naturally without nested function calls:

# Traditional approach (hard to read)
result = arrange(
    select(
        define(
            filter(df, 'age > 25'),
            experience='2024 - start_year'
        ),
        'name', 'experience', 'salary'
    ),
    '-salary'
)

# PipeFrame approach (reads like a recipe)
result = (df
    >> filter('age > 25')
    >> define(experience='2024 - start_year')
    >> select('name', 'experience', 'salary')
    >> arrange('-salary')
)

Core Verbs

Verb Purpose Example
define() Create/modify columns >> define(total='price * quantity')
filter() Filter rows >> filter('age > 30 & city == "NYC"')
select() Choose columns >> select('name', 'age', 'salary')
arrange() Sort data >> arrange('-salary', 'name')
group_by() Group data >> group_by('category', 'region')
summarize() Aggregate >> summarize(total='sum(sales)', avg='mean(price)')
rename() Rename columns >> rename(customer_id='cid')
distinct() Unique rows >> distinct('product', 'store')

๐Ÿ”ฅ Advanced Features

Conditional Logic

# if_else for binary conditions
df >> define(
    status=if_else('salary > 60000', 'High', 'Standard'),
    category=if_else('age >= 30', 'Senior', 'Junior')
)

# case_when for multiple conditions
df >> define(
    grade=case_when(
        ('score >= 90', 'A'),
        ('score >= 80', 'B'),
        ('score >= 70', 'C'),
        ('score >= 60', 'D'),
        default='F'
    )
)

GroupBy Operations

# Summary by group
summary = (df
    >> group_by('department', 'location')
    >> summarize(
        headcount='count()',
        avg_salary='mean(salary)',
        total_sales='sum(sales)',
        top_performer='max(performance_score)'
    )
    >> arrange('-avg_salary')
)

# Multiple aggregations
analysis = (df
    >> group_by('product_category')
    >> summarize(
        units_sold='sum(quantity)',
        revenue='sum(price * quantity)',
        avg_price='mean(price)',
        num_transactions='count()'
    )
    >> define(
        avg_transaction_value='revenue / num_transactions'
    )
)

Data Reshaping

# Pivot wider (long to wide)
wide = (df
    >> pivot_wider(
        id_cols='student',
        names_from='subject',
        values_from='grade'
    )
)

# Pivot longer (wide to long)
long = (df
    >> pivot_longer(
        cols=['Q1_sales', 'Q2_sales', 'Q3_sales', 'Q4_sales'],
        names_to='quarter',
        values_to='sales'
    )
)

# Separate columns
separated = df >> separate('full_name', into=['first', 'last'], sep=' ')

# Unite columns
united = df >> unite('full_date', ['year', 'month', 'day'], sep='-')

Column Selection Helpers

# Select by pattern
df >> select(
    'id',
    starts_with('date_'),      # All columns starting with 'date_'
    ends_with('_amount'),      # All columns ending with '_amount'
    contains('price'),         # All columns containing 'price'
    matches(r'Q\d_sales')      # Regex pattern matching
)

# Column ranges
df >> select('id', 'name:salary')  # Select from 'name' to 'salary'

I/O Operations

# Read from various sources
df = read_csv('data.csv')
df = read_excel('data.xlsx', sheet_name='Sales')
df = read_json('data.json', orient='records')
df = read_parquet('data.parquet')
df = read_sql('SELECT * FROM users', connection)
df = read_clipboard()  # Paste from spreadsheet!

# Write to different formats
df.to_csv('output.csv', index=False)
df.to_excel('report.xlsx', sheet_name='Results')
df.to_parquet('data.parquet', compression='gzip')
df.to_json('data.json', orient='records', lines=True)

๐ŸŽฏ Real-World Examples

Sales Analysis Pipeline

from pipeframe import *

# Load and analyze sales data
analysis = (
    read_csv('sales_data.csv')
    >> filter('date >= "2024-01-01" & revenue > 0')
    >> define(
        profit='revenue - cost',
        margin='profit / revenue * 100',
        quarter='pd.to_datetime(date).dt.quarter'
    )
    >> group_by('product_category', 'quarter')
    >> summarize(
        total_revenue='sum(revenue)',
        total_profit='sum(profit)',
        avg_margin='mean(margin)',
        num_sales='count()'
    )
    >> define(
        profit_per_sale='total_profit / num_sales'
    )
    >> arrange('-total_revenue')
)

# Export results
analysis.to_excel('quarterly_analysis.xlsx', sheet_name='Summary')

Customer Segmentation

# Segment customers by behavior
segments = (df
    >> filter('total_purchases > 0')
    >> define(
        avg_order_value='total_spent / total_purchases',
        recency_days='(pd.Timestamp.now() - last_purchase_date).dt.days',
        segment=case_when(
            ('avg_order_value > 100 & recency_days < 30', 'Premium Active'),
            ('avg_order_value > 100 & recency_days >= 30', 'Premium At Risk'),
            ('recency_days < 30', 'Standard Active'),
            ('recency_days < 90', 'At Risk'),
            default='Churned'
        )
    )
    >> group_by('segment')
    >> summarize(
        customers='count()',
        total_value='sum(total_spent)',
        avg_value='mean(total_spent)'
    )
)

Data Cleaning Pipeline

# Clean and standardize data
clean_data = (
    read_excel('messy_data.xlsx')
    >> filter('id.notna()')  # Remove rows without ID
    >> define(
        # Standardize text fields
        name='name.str.title().str.strip()',
        email='email.str.lower().str.strip()',
        # Parse dates
        signup_date='pd.to_datetime(signup_date)',
        # Fill missing values
        phone='phone.fillna("Not Provided")',
        # Create derived fields
        account_age_days='(pd.Timestamp.now() - signup_date).dt.days'
    )
    >> distinct('email', keep='first')  # Deduplicate by email
    >> arrange('signup_date')
)

๐Ÿ”’ Security Features

PipeFrame includes built-in security features to prevent code injection:

# โœ… Safe expressions are allowed
df >> define(total='price * quantity')
df >> filter('age > 30 & city == "NYC"')

# โŒ Dangerous expressions are blocked
df >> define(bad="__import__('os').system('rm -rf /')")
# PipeFrameExpressionError: Expression contains dangerous pattern

# All string expressions are validated before execution
# - Blocks: __import__, exec(), eval(), compile(), open(), file()
# - Validates expression syntax
# - Uses pandas' restricted eval environment

๐Ÿ“Š Performance

PipeFrame adds minimal overhead while dramatically improving code readability:

Benchmarks (1M rows):

  • Filter operation: ~8% overhead
  • GroupBy aggregation: ~12% overhead
  • Complex pipeline (5 operations): ~10% overhead

Why the overhead is worth it:

  • ๐Ÿง  Reduced cognitive load
  • ๐Ÿ› Fewer bugs from clearer intent
  • โšก Faster development time
  • ๐Ÿ‘ฅ Easier code review
  • ๐Ÿ“š Better maintainability

๐ŸŽ“ Learning Resources


๐Ÿค Contributing

We welcome contributions! Here's how you can help:

  1. ๐Ÿ› Report bugs - Open an issue
  2. ๐Ÿ’ก Suggest features - Share your ideas
  3. ๐Ÿ“ Improve docs - Help others learn
  4. ๐Ÿ”ง Submit PRs - Fix bugs or add features

See CONTRIBUTING.md for guidelines.


๐Ÿ“œ License

MIT License - see LICENSE file for details.


๐Ÿ‘จโ€๐Ÿ’ป Author

Dr. Yasser Mustafa

AI & Data Science Specialist | Theoretical Physics PhD

  • ๐ŸŽ“ PhD in Theoretical Nuclear Physics
  • ๐Ÿ’ผ 10+ years in production AI/ML systems
  • ๐Ÿ”ฌ 48+ research publications
  • ๐Ÿข Experience: Government (Abu Dhabi), Media (Track24), Recruitment (Reed), Energy (ADNOC)
  • ๐Ÿ“ Based in Newcastle Upon Tyne, UK
  • โœ‰๏ธ yasser.mustafan@gmail.com
  • ๐Ÿ”— LinkedIn | GitHub

PipeFrame was born from years of working with data pipelines in production environments, combining the elegance of R's tidyverse with Python's practicality.


๐ŸŒŸ Star History

If PipeFrame helps your work, please consider giving it a star! โญ


๐Ÿ“ˆ Roadmap

Current (v0.2.0)

  • โœ… Core verbs and operators
  • โœ… Security hardening
  • โœ… Comprehensive I/O
  • โœ… Reshape operations
  • โœ… Type hints

Upcoming (v0.3.0)

  • Join operations (left_join, inner_join, etc.)
  • Window functions
  • Time series helpers
  • Enhanced plotting integration
  • Performance optimizations

Future (v1.0.0)

  • Lazy evaluation engine
  • Alternative backends (Polars, DuckDB)
  • Distributed computing support
  • Interactive data exploration tools
  • SQL generation from pipes

๐Ÿ’ฌ Community

  • Issues: Report bugs or request features
  • Discussions: Ask questions, share use cases

Built with โค๏ธ for data scientists who value readability

Make your data speak naturally with PipeFrame ๐Ÿ”„

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pipeframe-0.2.1.tar.gz (91.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pipeframe-0.2.1-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file pipeframe-0.2.1.tar.gz.

File metadata

  • Download URL: pipeframe-0.2.1.tar.gz
  • Upload date:
  • Size: 91.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pipeframe-0.2.1.tar.gz
Algorithm Hash digest
SHA256 47e99c471d343b19db46a57a985d48ad4e7e05dfb647d99779e29f28e65b7ec2
MD5 88459b58092380ca58c2ef1f7e7ba3da
BLAKE2b-256 052b3357e79266df4527ca71ac695ee7c7cd540615e3048beabb69a109b199b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for pipeframe-0.2.1.tar.gz:

Publisher: publish.yml on Yasser03/pipeframe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pipeframe-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: pipeframe-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pipeframe-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 10a416249c2a6c8af5a87b76e8464659957968ac9e8d8e6379e654d189c1b806
MD5 4613d6fd6480b6d5cbe292232b75597a
BLAKE2b-256 2fccd9588be59c31e857cdb3454bd25cded141171241102c61465943a20349e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for pipeframe-0.2.1-py3-none-any.whl:

Publisher: publish.yml on Yasser03/pipeframe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page