Skip to main content

Next-generation native DataFrame for Python - Simple like Excel, Powerful like SQL, Smart like AI

Project description

๐Ÿš€ PyFrameX

Next-Generation Native DataFrame for Python

PyPI version Python 3.8+ License: MIT

Simple like Excel, Powerful like SQL, Smart like AI

PyFrameX is a revolutionary DataFrame engine built from scratch in pure Python. It combines the simplicity of Excel, the power of SQL, and the intelligence of machine learning into one intuitive package.


๐ŸŒŸ What Makes PyFrameX Different?

โŒ The Problem

  • Pandas: Powerful but complicated (.loc, .iloc, .apply confusion)
  • Polars: Fast but too technical for beginners
  • Excel: Simple but limited in scale and automation

โœ… The Solution: PyFrameX

from pyframex import Frame

# Load data - just like Excel
df = Frame("sales.csv")

# Excel-style operations
df["profit"] = df["revenue"] - df["cost"]

# SQL-style queries
df.sql("SELECT region, SUM(revenue) FROM df GROUP BY region")

# AI-powered automation
df.auto_predict(target="sales")

๐ŸŽฏ Key Features

1๏ธโƒฃ Pure Python Native Engine

  • Zero dependencies for core functionality
  • Custom column store implementation
  • Type-aware operations (Int, Float, String, Date, Bool)
  • Automatic type inference

2๏ธโƒฃ Excel-Like Simplicity

# Simple, intuitive operations
df["ratio"] = df["sales"] / df["visits"]
df["status"] = "active"

# No confusing .loc or .iloc needed!

3๏ธโƒฃ Built-in SQL Engine

# Execute SQL queries directly on DataFrames
result = df.sql("""
    SELECT 
        region, 
        SUM(revenue) as total_revenue,
        AVG(profit) as avg_profit
    FROM df 
    WHERE year = 2024 
    GROUP BY region
    ORDER BY total_revenue DESC
    LIMIT 10
""")

4๏ธโƒฃ AI-Powered Automation

# Automatic data cleaning
clean_df = df.auto_clean()

# Automatic predictive modeling
results = df.auto_predict(target="price")
print(f"Accuracy: {results['metrics']['accuracy']}")

# Automatic clustering
clustered = df.auto_cluster(n_clusters=3)

# Automatic feature engineering
enriched = df.auto_feature_engineering()

5๏ธโƒฃ Optimized Performance

  • Lazy evaluation
  • Column-oriented storage
  • Cached statistics
  • Query optimization
  • Filter pushdown

๐Ÿ“ฆ Installation

# Basic installation
pip install pyframex

# With ML capabilities
pip install pyframex[ml]

# Install all features
pip install pyframex[all]

๐Ÿš€ Quick Start

Loading Data

from pyframex import Frame

# From CSV
df = Frame("data.csv")

# From JSON
df = Frame("data.json")

# From dictionary
df = Frame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 70000]
})

# From list of dictionaries
df = Frame([
    {"name": "Alice", "age": 25, "salary": 50000},
    {"name": "Bob", "age": 30, "salary": 60000},
    {"name": "Charlie", "age": 35, "salary": 70000}
])

Basic Operations

# View data
print(df)
print(df.head(10))
print(df.tail(5))

# Get info
print(df.summary())
print(df.shape())  # (rows, columns)
print(df.dtypes())  # Column types

# Select columns
names = df["name"]
subset = df[["name", "salary"]]

# Add/modify columns
df["bonus"] = df["salary"] * 0.1
df["total"] = df["salary"] + df["bonus"]

Filtering

# Excel-style filtering
high_earners = df.filter("salary > 60000")
young_staff = df.filter("age < 30")

# Combined conditions
filtered = df.filter("age > 25 and salary < 70000")

# Using column comparisons
mask = df["age"] > 30
filtered = df.filter(mask)

Sorting & Grouping

# Sort
sorted_df = df.sort("salary", ascending=False)

# Group by
by_region = df.groupby("region").agg({
    "revenue": "sum",
    "orders": "count"
})

# Multiple aggregations
summary = df.groupby(["region", "category"]).agg({
    "revenue": "sum",
    "profit": "mean",
    "orders": "count"
})

SQL Queries

# Simple query
result = df.sql("SELECT name, salary FROM df WHERE age > 30")

# With aggregation
result = df.sql("""
    SELECT 
        region, 
        SUM(revenue) as total,
        AVG(profit) as avg_profit
    FROM df 
    GROUP BY region
""")

# With ordering and limit
result = df.sql("""
    SELECT * FROM df 
    WHERE status = 'active' 
    ORDER BY created_date DESC 
    LIMIT 100
""")

# Explain query plan
from pyframex.query import QueryPlanner
planner = QueryPlanner()
print(planner.explain("SELECT * FROM df WHERE revenue > 1000"))

๐Ÿค– Machine Learning Integration

Auto Clean

# Automatically:
# - Remove duplicates
# - Handle missing values (median/mode imputation)
# - Remove outliers
# - Fix data types
clean_df = df.auto_clean()

Auto Predict

# Automatic model training
results = df.auto_predict(
    target="price",
    test_size=0.2
)

# Results include:
print(results['metrics'])  # Performance metrics
print(results['model'])  # Trained model
print(results['predictions'])  # Test predictions

# Feature importance
for feature, importance in results['metrics']['feature_importance'].items():
    print(f"{feature}: {importance:.4f}")

Auto Cluster

# Automatic clustering
clustered = df.auto_cluster(n_clusters=3)
print(clustered["cluster"].value_counts())

Feature Engineering

# Automatically create:
# - Polynomial features
# - Interaction terms
# - Date extractions
enriched = df.auto_feature_engineering()

Smart Suggestions

# Get transformation suggestions
suggestions = df._ml_engine.suggest_transformations(df)
for suggestion in suggestions:
    print(f"๐Ÿ’ก {suggestion}")

๐Ÿ”ง Advanced Features

Column Operations

# Numeric columns
df["price"].sum()
df["price"].mean()
df["price"].median()
df["price"].min()
df["price"].max()
df["price"].std()  # Standard deviation

# String columns
df["name"].lower()
df["name"].upper()
df["name"].strip()
df["name"].contains("alice")
df["name"].replace("old", "new")
df["name"].len()  # String lengths

# Date columns
df["date"].year()
df["date"].month()
df["date"].day()
df["date"].weekday()

Mathematical Operations

# Column arithmetic
df["total"] = df["price"] * df["quantity"]
df["discount_price"] = df["price"] * 0.9
df["profit"] = df["revenue"] - df["cost"]

# Column-to-column operations
df["ratio"] = df["sales"] / df["visits"]
df["growth"] = df["current"] - df["previous"]

Data Export

# Save to CSV
df.to_csv("output.csv")

# Save to JSON
df.to_json("output.json")

# Convert to dictionary
data_dict = df.to_dict()

๐Ÿ“Š Real-World Examples

Example 1: Sales Analysis

from pyframex import Frame

# Load sales data
df = Frame("sales.csv")

# Calculate profit
df["profit"] = df["revenue"] - df["cost"]
df["margin"] = df["profit"] / df["revenue"]

# Find top performing regions
top_regions = df.sql("""
    SELECT 
        region,
        SUM(revenue) as total_revenue,
        AVG(margin) as avg_margin
    FROM df
    GROUP BY region
    ORDER BY total_revenue DESC
    LIMIT 5
""")

print(top_regions)

Example 2: Customer Segmentation

# Load customer data
customers = Frame("customers.csv")

# Auto-clean data
customers = customers.auto_clean()

# Perform clustering
segmented = customers.auto_cluster(n_clusters=4)

# Analyze clusters
cluster_summary = segmented.groupby("cluster").agg({
    "age": "mean",
    "purchases": "sum",
    "lifetime_value": "mean"
})

print(cluster_summary)

Example 3: Predictive Modeling

# Load historical data
data = Frame("historical_sales.csv")

# Engineer features
data = data.auto_feature_engineering()

# Train model
results = data.auto_predict(target="next_month_sales")

print(f"Model Rยฒ: {results['metrics']['r2']:.4f}")
print(f"RMSE: {results['metrics']['rmse']:.2f}")

# Feature importance
for feature, importance in results['metrics']['feature_importance'].items():
    if importance > 0.05:
        print(f"  {feature}: {importance:.2%}")

๐ŸŽฏ Use Cases

Perfect For:

โœ… Data Analysts - Excel-like simplicity with SQL power
โœ… Data Scientists - Built-in ML with no setup
โœ… Python Beginners - Intuitive, no steep learning curve
โœ… Rapid Prototyping - Fast iteration with auto features
โœ… Educational Projects - Learn data science easily
โœ… Small to Medium Data - Pure Python, no heavy dependencies

Not Ideal For:

โŒ Massive datasets (100M+ rows) - Use Polars/DuckDB
โŒ Distributed computing - Use Spark/Dask
โŒ Production big data pipelines - Use enterprise solutions


๐Ÿ—๏ธ Architecture

PyFrameX consists of 6 core components:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           Frame (Main API)              โ”‚
โ”‚   Simple like Excel, Powerful like SQL  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚                   โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Column Engine  โ”‚  โ”‚  Query Planner โ”‚
โ”‚ - IntColumn    โ”‚  โ”‚  - SQL Parser  โ”‚
โ”‚ - FloatColumn  โ”‚  โ”‚  - Optimizer   โ”‚
โ”‚ - StringColumn โ”‚  โ”‚  - Executor    โ”‚
โ”‚ - DateColumn   โ”‚  โ”‚  - Cache       โ”‚
โ”‚ - BoolColumn   โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   AutoML       โ”‚  โ”‚  Visualizer    โ”‚
โ”‚ - auto_clean   โ”‚  โ”‚  - Charts      โ”‚
โ”‚ - auto_predict โ”‚  โ”‚  - Summaries   โ”‚
โ”‚ - auto_cluster โ”‚  โ”‚  - Reports     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ˆ Performance

PyFrameX is optimized for clarity and moderate-sized datasets:

  • Column-oriented storage for efficient operations
  • Lazy evaluation where possible
  • Cached statistics to avoid recomputation
  • Type-specific optimizations for each column type
  • Query optimization with filter pushdown

Benchmark (1M rows):

  • Loading CSV: ~2-3 seconds
  • Filtering: ~0.1-0.5 seconds
  • Grouping: ~0.5-1 second
  • SQL query: ~0.5-2 seconds

๐Ÿ› ๏ธ CLI Usage

# Show DataFrame info
pyframex info data.csv

# Show first 10 rows
pyframex head data.csv -n 10

# Execute SQL query
pyframex query data.csv "SELECT * FROM df WHERE age > 30"

# Auto-clean data
pyframex clean data.csv cleaned_data.csv

# Show version
pyframex version

๐Ÿค Contributing

Contributions are welcome! Here's how you can help:

  1. Report bugs - Open an issue on GitHub
  2. Suggest features - Describe your use case
  3. Submit PRs - Fix bugs or add features
  4. Write docs - Improve documentation
  5. Share examples - Show how you use PyFrameX

๐Ÿ“ License

MIT License - see LICENSE file for details.


๐Ÿ™ Acknowledgments

PyFrameX is inspired by:

  • Pandas - The gold standard for DataFrame operations
  • Polars - Modern columnar data processing
  • DuckDB - Fast in-process SQL
  • Excel - Universal data manipulation tool

๐Ÿ“ง Contact & Support


๐ŸŽ“ Citation

If you use PyFrameX in your research, please cite:

@software{pyframex2024,
  author = {Bado, Idriss},
  title = {PyFrameX: Next-Generation Native DataFrame for Python},
  year = {2024},
  url = {https://github.com/idrissbado/PyFrameX}
}

โญ Star History

If you find PyFrameX useful, please give it a star on GitHub! โญ


Made with โค๏ธ by Idriss Bado

Simple like Excel, Powerful like SQL, Smart like AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyframex-0.1.0.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyframex-0.1.0-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file pyframex-0.1.0.tar.gz.

File metadata

  • Download URL: pyframex-0.1.0.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for pyframex-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e141c4e37c684356f1280d1ba80d685dd3dd7a9ca2b7a3aa1978d8fc0c442592
MD5 9854214fcd0184a2fb676aec149a7180
BLAKE2b-256 513f24067d13bc94dec56c87d8655566b2a5bf291b018c5eb2d33fb11b5a01bb

See more details on using hashes here.

File details

Details for the file pyframex-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyframex-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.0

File hashes

Hashes for pyframex-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d3a52aad2c9c272b80dae5f2600477e34065decee7391f1469321edca8017cc5
MD5 4de1e176347e3c68885f91ace9ac690a
BLAKE2b-256 dff17fabe9fd4e27e934e705c5fbac650ed6a8e280c799dd7a5285ee067285e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page