Next-generation native DataFrame for Python - Simple like Excel, Powerful like SQL, Smart like AI
Project description
๐ PyFrameX
Next-Generation Native DataFrame for Python
Simple like Excel, Powerful like SQL, Smart like AI
PyFrameX is a revolutionary DataFrame engine built from scratch in pure Python. It combines the simplicity of Excel, the power of SQL, and the intelligence of machine learning into one intuitive package.
๐ What Makes PyFrameX Different?
โ The Problem
- Pandas: Powerful but complicated (
.loc,.iloc,.applyconfusion) - Polars: Fast but too technical for beginners
- Excel: Simple but limited in scale and automation
โ The Solution: PyFrameX
from pyframex import Frame
# Load data - just like Excel
df = Frame("sales.csv")
# Excel-style operations
df["profit"] = df["revenue"] - df["cost"]
# SQL-style queries
df.sql("SELECT region, SUM(revenue) FROM df GROUP BY region")
# AI-powered automation
df.auto_predict(target="sales")
๐ฏ Key Features
1๏ธโฃ Pure Python Native Engine
- Zero dependencies for core functionality
- Custom column store implementation
- Type-aware operations (Int, Float, String, Date, Bool)
- Automatic type inference
2๏ธโฃ Excel-Like Simplicity
# Simple, intuitive operations
df["ratio"] = df["sales"] / df["visits"]
df["status"] = "active"
# No confusing .loc or .iloc needed!
3๏ธโฃ Built-in SQL Engine
# Execute SQL queries directly on DataFrames
result = df.sql("""
SELECT
region,
SUM(revenue) as total_revenue,
AVG(profit) as avg_profit
FROM df
WHERE year = 2024
GROUP BY region
ORDER BY total_revenue DESC
LIMIT 10
""")
4๏ธโฃ AI-Powered Automation
# Automatic data cleaning
clean_df = df.auto_clean()
# Automatic predictive modeling
results = df.auto_predict(target="price")
print(f"Accuracy: {results['metrics']['accuracy']}")
# Automatic clustering
clustered = df.auto_cluster(n_clusters=3)
# Automatic feature engineering
enriched = df.auto_feature_engineering()
5๏ธโฃ Optimized Performance
- Lazy evaluation
- Column-oriented storage
- Cached statistics
- Query optimization
- Filter pushdown
๐ฆ Installation
# Basic installation
pip install pyframex
# With ML capabilities
pip install pyframex[ml]
# Install all features
pip install pyframex[all]
๐ Quick Start
Loading Data
from pyframex import Frame
# From CSV
df = Frame("data.csv")
# From JSON
df = Frame("data.json")
# From dictionary
df = Frame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"salary": [50000, 60000, 70000]
})
# From list of dictionaries
df = Frame([
{"name": "Alice", "age": 25, "salary": 50000},
{"name": "Bob", "age": 30, "salary": 60000},
{"name": "Charlie", "age": 35, "salary": 70000}
])
Basic Operations
# View data
print(df)
print(df.head(10))
print(df.tail(5))
# Get info
print(df.summary())
print(df.shape()) # (rows, columns)
print(df.dtypes()) # Column types
# Select columns
names = df["name"]
subset = df[["name", "salary"]]
# Add/modify columns
df["bonus"] = df["salary"] * 0.1
df["total"] = df["salary"] + df["bonus"]
Filtering
# Excel-style filtering
high_earners = df.filter("salary > 60000")
young_staff = df.filter("age < 30")
# Combined conditions
filtered = df.filter("age > 25 and salary < 70000")
# Using column comparisons
mask = df["age"] > 30
filtered = df.filter(mask)
Sorting & Grouping
# Sort
sorted_df = df.sort("salary", ascending=False)
# Group by
by_region = df.groupby("region").agg({
"revenue": "sum",
"orders": "count"
})
# Multiple aggregations
summary = df.groupby(["region", "category"]).agg({
"revenue": "sum",
"profit": "mean",
"orders": "count"
})
SQL Queries
# Simple query
result = df.sql("SELECT name, salary FROM df WHERE age > 30")
# With aggregation
result = df.sql("""
SELECT
region,
SUM(revenue) as total,
AVG(profit) as avg_profit
FROM df
GROUP BY region
""")
# With ordering and limit
result = df.sql("""
SELECT * FROM df
WHERE status = 'active'
ORDER BY created_date DESC
LIMIT 100
""")
# Explain query plan
from pyframex.query import QueryPlanner
planner = QueryPlanner()
print(planner.explain("SELECT * FROM df WHERE revenue > 1000"))
๐ค Machine Learning Integration
Auto Clean
# Automatically:
# - Remove duplicates
# - Handle missing values (median/mode imputation)
# - Remove outliers
# - Fix data types
clean_df = df.auto_clean()
Auto Predict
# Automatic model training
results = df.auto_predict(
target="price",
test_size=0.2
)
# Results include:
print(results['metrics']) # Performance metrics
print(results['model']) # Trained model
print(results['predictions']) # Test predictions
# Feature importance
for feature, importance in results['metrics']['feature_importance'].items():
print(f"{feature}: {importance:.4f}")
Auto Cluster
# Automatic clustering
clustered = df.auto_cluster(n_clusters=3)
print(clustered["cluster"].value_counts())
Feature Engineering
# Automatically create:
# - Polynomial features
# - Interaction terms
# - Date extractions
enriched = df.auto_feature_engineering()
Smart Suggestions
# Get transformation suggestions
suggestions = df._ml_engine.suggest_transformations(df)
for suggestion in suggestions:
print(f"๐ก {suggestion}")
๐ง Advanced Features
Column Operations
# Numeric columns
df["price"].sum()
df["price"].mean()
df["price"].median()
df["price"].min()
df["price"].max()
df["price"].std() # Standard deviation
# String columns
df["name"].lower()
df["name"].upper()
df["name"].strip()
df["name"].contains("alice")
df["name"].replace("old", "new")
df["name"].len() # String lengths
# Date columns
df["date"].year()
df["date"].month()
df["date"].day()
df["date"].weekday()
Mathematical Operations
# Column arithmetic
df["total"] = df["price"] * df["quantity"]
df["discount_price"] = df["price"] * 0.9
df["profit"] = df["revenue"] - df["cost"]
# Column-to-column operations
df["ratio"] = df["sales"] / df["visits"]
df["growth"] = df["current"] - df["previous"]
Data Export
# Save to CSV
df.to_csv("output.csv")
# Save to JSON
df.to_json("output.json")
# Convert to dictionary
data_dict = df.to_dict()
๐ Real-World Examples
Example 1: Sales Analysis
from pyframex import Frame
# Load sales data
df = Frame("sales.csv")
# Calculate profit
df["profit"] = df["revenue"] - df["cost"]
df["margin"] = df["profit"] / df["revenue"]
# Find top performing regions
top_regions = df.sql("""
SELECT
region,
SUM(revenue) as total_revenue,
AVG(margin) as avg_margin
FROM df
GROUP BY region
ORDER BY total_revenue DESC
LIMIT 5
""")
print(top_regions)
Example 2: Customer Segmentation
# Load customer data
customers = Frame("customers.csv")
# Auto-clean data
customers = customers.auto_clean()
# Perform clustering
segmented = customers.auto_cluster(n_clusters=4)
# Analyze clusters
cluster_summary = segmented.groupby("cluster").agg({
"age": "mean",
"purchases": "sum",
"lifetime_value": "mean"
})
print(cluster_summary)
Example 3: Predictive Modeling
# Load historical data
data = Frame("historical_sales.csv")
# Engineer features
data = data.auto_feature_engineering()
# Train model
results = data.auto_predict(target="next_month_sales")
print(f"Model Rยฒ: {results['metrics']['r2']:.4f}")
print(f"RMSE: {results['metrics']['rmse']:.2f}")
# Feature importance
for feature, importance in results['metrics']['feature_importance'].items():
if importance > 0.05:
print(f" {feature}: {importance:.2%}")
๐ฏ Use Cases
Perfect For:
โ
Data Analysts - Excel-like simplicity with SQL power
โ
Data Scientists - Built-in ML with no setup
โ
Python Beginners - Intuitive, no steep learning curve
โ
Rapid Prototyping - Fast iteration with auto features
โ
Educational Projects - Learn data science easily
โ
Small to Medium Data - Pure Python, no heavy dependencies
Not Ideal For:
โ Massive datasets (100M+ rows) - Use Polars/DuckDB
โ Distributed computing - Use Spark/Dask
โ Production big data pipelines - Use enterprise solutions
๐๏ธ Architecture
PyFrameX consists of 6 core components:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Frame (Main API) โ
โ Simple like Excel, Powerful like SQL โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโดโโโโโโโโโโ
โ โ
โโโโโโโโโผโโโโโโโโโ โโโโโโโโโผโโโโโโโโโ
โ Column Engine โ โ Query Planner โ
โ - IntColumn โ โ - SQL Parser โ
โ - FloatColumn โ โ - Optimizer โ
โ - StringColumn โ โ - Executor โ
โ - DateColumn โ โ - Cache โ
โ - BoolColumn โ โโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโผโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
โ AutoML โ โ Visualizer โ
โ - auto_clean โ โ - Charts โ
โ - auto_predict โ โ - Summaries โ
โ - auto_cluster โ โ - Reports โ
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
๐ Performance
PyFrameX is optimized for clarity and moderate-sized datasets:
- Column-oriented storage for efficient operations
- Lazy evaluation where possible
- Cached statistics to avoid recomputation
- Type-specific optimizations for each column type
- Query optimization with filter pushdown
Benchmark (1M rows):
- Loading CSV: ~2-3 seconds
- Filtering: ~0.1-0.5 seconds
- Grouping: ~0.5-1 second
- SQL query: ~0.5-2 seconds
๐ ๏ธ CLI Usage
# Show DataFrame info
pyframex info data.csv
# Show first 10 rows
pyframex head data.csv -n 10
# Execute SQL query
pyframex query data.csv "SELECT * FROM df WHERE age > 30"
# Auto-clean data
pyframex clean data.csv cleaned_data.csv
# Show version
pyframex version
๐ค Contributing
Contributions are welcome! Here's how you can help:
- Report bugs - Open an issue on GitHub
- Suggest features - Describe your use case
- Submit PRs - Fix bugs or add features
- Write docs - Improve documentation
- Share examples - Show how you use PyFrameX
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
PyFrameX is inspired by:
- Pandas - The gold standard for DataFrame operations
- Polars - Modern columnar data processing
- DuckDB - Fast in-process SQL
- Excel - Universal data manipulation tool
๐ง Contact & Support
- Author: Idriss Bado
- Email: idrissbadoolivier@gmail.com
- GitHub: https://github.com/idrissbado/PyFrameX
- Issues: GitHub Issues
๐ Citation
If you use PyFrameX in your research, please cite:
@software{pyframex2024,
author = {Bado, Idriss},
title = {PyFrameX: Next-Generation Native DataFrame for Python},
year = {2024},
url = {https://github.com/idrissbado/PyFrameX}
}
โญ Star History
If you find PyFrameX useful, please give it a star on GitHub! โญ
Made with โค๏ธ by Idriss Bado
Simple like Excel, Powerful like SQL, Smart like AI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyframex-0.1.0.tar.gz.
File metadata
- Download URL: pyframex-0.1.0.tar.gz
- Upload date:
- Size: 27.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e141c4e37c684356f1280d1ba80d685dd3dd7a9ca2b7a3aa1978d8fc0c442592
|
|
| MD5 |
9854214fcd0184a2fb676aec149a7180
|
|
| BLAKE2b-256 |
513f24067d13bc94dec56c87d8655566b2a5bf291b018c5eb2d33fb11b5a01bb
|
File details
Details for the file pyframex-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pyframex-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3a52aad2c9c272b80dae5f2600477e34065decee7391f1469321edca8017cc5
|
|
| MD5 |
4de1e176347e3c68885f91ace9ac690a
|
|
| BLAKE2b-256 |
dff17fabe9fd4e27e934e705c5fbac650ed6a8e280c799dd7a5285ee067285e4
|