Automated memory optimization for Pandas DataFrames. Same Pandas taste, half the calories (RAM).
Project description
Diet Pandas ๐ผ๐ฅ
Tagline: Same Pandas taste, half the calories (RAM).
๐ฏ The Problem
Pandas is built for safety and ease of use, not memory efficiency. When you load a CSV, standard Pandas defaults to "safe" but wasteful data types:
int64for small integers (wasting 75%+ memory per number)float64for simple metrics (wasting 50% memory per number)objectfor repetitive strings (wasting massive amounts of memory and CPU)
Diet Pandas solves this by acting as a strict nutritionist for your data. It aggressively analyzes data distributions and "downcasts" types to the smallest safe representationโoften reducing memory usage by 50% to 80% without losing information.
๐ Quick Start
Installation
pip install diet-pandas
Basic Usage
import dietpandas as dp
# 1. Drop-in replacement for pandas.read_csv
# Loads faster and uses less RAM automatically
df = dp.read_csv("huge_dataset.csv")
# Diet Complete: Memory reduced by 67.3%
# 450.00MB -> 147.15MB
# 2. Or optimize an existing DataFrame
import pandas as pd
df_heavy = pd.DataFrame({
'year': [2020, 2021, 2022],
'revenue': [1.1, 2.2, 3.3]
})
print(df_heavy.info())
# year int64 (8 bytes each)
# revenue float64 (8 bytes each)
df_light = dp.diet(df_heavy)
# Diet Complete: Memory reduced by 62.5%
# 0.13MB -> 0.05MB
print(df_light.info())
# year uint16 (2 bytes each)
# revenue float32 (4 bytes each)
โจ Features
๐ Fast Loading with Polars Engine
Diet Pandas uses Polars (a blazing-fast DataFrame library) to parse CSV files, then automatically converts to optimized Pandas DataFrames.
import dietpandas as dp
# 5-10x faster than pandas.read_csv AND uses less memory
df = dp.read_csv("large_file.csv")
๐ฏ Intelligent Type Optimization
import dietpandas as dp
# Automatic optimization
df = dp.diet(df_original)
# See detailed memory report
report = dp.get_memory_report(df)
print(report)
# column dtype memory_bytes memory_mb percent_of_total
# 0 large_text category 12589875 12.59 45.2
# 1 user_id uint32 4000000 4.00 14.4
๐ฅ Aggressive Mode (Keto Diet)
For maximum compression, use aggressive mode:
# Safe mode: float64 -> float32 (lossless for most ML tasks)
df = dp.diet(df, aggressive=False)
# Keto mode: float64 -> float16 (extreme compression, some precision loss)
df = dp.diet(df, aggressive=True)
# Diet Complete: Memory reduced by 81.2%
๐ Multiple File Format Support
import dietpandas as dp
# CSV with fast Polars engine
df = dp.read_csv("data.csv")
# Parquet
df = dp.read_parquet("data.parquet")
# Excel
df = dp.read_excel("data.xlsx")
# JSON
df = dp.read_json("data.json")
# HDF5
df = dp.read_hdf("data.h5", key="dataset1")
# Feather
df = dp.read_feather("data.feather")
# All readers automatically optimize memory usage!
๐๏ธ Sparse Data Optimization
For data with many repeated values (zeros, NaNs, or any repeated value):
# Enable sparse optimization for columns with >90% repeated values
df = dp.diet(df, optimize_sparse_cols=True)
# Perfect for: binary features, indicator variables, sparse matrices
๐ DateTime Optimization
Automatically optimizes datetime columns for better memory efficiency:
df = pd.DataFrame({
'date': pd.date_range('2020-01-01', periods=1000000),
'value': range(1000000)
})
df_optimized = dp.diet(df, optimize_datetimes=True)
# DateTime columns automatically optimized
โ Boolean Optimization
Automatically detects and optimizes boolean-like columns:
df = pd.DataFrame({
'is_active': [0, 1, 1, 0, 1], # int64 -> boolean (87.5% memory reduction)
'has_data': ['yes', 'no', 'yes', 'no', 'yes'], # object -> boolean
'approved': ['True', 'False', 'True', 'False', 'True'] # object -> boolean
})
df_optimized = dp.diet(df, optimize_bools=True)
# All three columns converted to memory-efficient boolean type!
Supports multiple boolean representations:
- Numeric:
0,1 - Strings:
'true'/'false','yes'/'no','y'/'n','t'/'f' - Case-insensitive detection
๐๏ธ Column-Specific Control
NEW in v0.3.0! Fine-grained control over optimization:
# Skip specific columns (e.g., IDs, UUIDs)
df = dp.diet(df, skip_columns=['user_id', 'uuid'])
# Force categorical conversion on high-cardinality columns
df = dp.diet(df, force_categorical=['country_code', 'product_sku'])
# Use aggressive mode only for specific columns
df = dp.diet(df, force_aggressive=['approximation_field', 'estimated_value'])
# Combine multiple controls
df = dp.diet(
df,
skip_columns=['id'],
force_categorical=['category'],
force_aggressive=['approx_price']
)
๐ Pre-Flight Analysis
NEW in v0.3.0! Analyze your DataFrame before optimization to see what changes will be made:
import pandas as pd
import dietpandas as dp
df = pd.DataFrame({
'id': range(1000),
'amount': [1.1, 2.2, 3.3] * 333 + [1.1],
'category': ['A', 'B', 'C'] * 333 + ['A']
})
# Analyze without modifying the DataFrame
analysis = dp.analyze(df)
print(analysis)
#
# column current_dtype recommended_dtype current_memory_mb optimized_memory_mb savings_mb savings_percent reasoning
# 0 id int64 uint16 0.008 0.002 0.006 75.0 Integer range 0-999 fits in uint16
# 1 amount float64 float32 0.008 0.004 0.004 50.0 Standard float optimization
# 2 category object category 0.057 0.001 0.056 98.2 Low cardinality (3 unique values)
# Get summary statistics
summary = dp.get_optimization_summary(analysis)
print(summary)
# {
# 'total_columns': 3,
# 'optimizable_columns': 3,
# 'current_memory_mb': 0.073,
# 'optimized_memory_mb': 0.007,
# 'total_savings_mb': 0.066,
# 'total_savings_percent': 90.4
# }
# Quick estimate without detailed analysis
reduction_pct = dp.estimate_memory_reduction(df)
print(f"Estimated reduction: {reduction_pct:.1f}%")
# Estimated reduction: 90.4%
โ ๏ธ Smart Warnings
NEW in v0.3.0! Get helpful warnings about potential issues:
import dietpandas as dp
df = pd.DataFrame({
'id': range(10000), # High cardinality
'value': [1.123456789] * 10000, # Will lose precision in float16
'empty': [None] * 10000 # All NaN column
})
# Warnings are enabled by default
df_optimized = dp.diet(df, aggressive=True, warn_on_issues=True)
# โ ๏ธ Warning: Column 'empty' is entirely NaN - consider dropping it
# โ ๏ธ Warning: Column 'id' has high cardinality (100.0%) - may not benefit from categorical
# โ ๏ธ Warning: Aggressive mode on column 'value' may lose precision (float64 -> float16)
# Disable warnings if you know what you're doing
df_optimized = dp.diet(df, aggressive=True, warn_on_issues=False)
import dietpandas as dp
# CSV (with Polars acceleration)
df = dp.read_csv("data.csv")
# Parquet (with Polars acceleration)
df = dp.read_parquet("data.parquet")
# Excel
df = dp.read_excel("data.xlsx")
# All return optimized Pandas DataFrames
๐งช Technical Details
How It Works
Diet Pandas uses a "Trojan Horse" architecture:
-
Ingestion Layer (The Fast Lane):
- Uses Polars or PyArrow for multi-threaded CSV parsing (5-10x faster)
-
Optimization Layer (The Metabolism):
- Calculates min/max for numeric columns
- Analyzes string cardinality (unique values ratio)
- Maps stats to smallest safe numpy types
-
Conversion Layer (The Result):
- Returns a standard
pandas.DataFrame(100% compatible) - Works seamlessly with Scikit-Learn, PyTorch, XGBoost, Matplotlib
- Returns a standard
Optimization Rules
| Original Type | Optimization | Example |
|---|---|---|
int64 with only 0/1 |
boolean |
NEW! Flags, indicators (87.5% reduction) |
object with 'yes'/'no' |
boolean |
NEW! Survey responses |
int64 with values 0-255 |
uint8 |
User ages, small counts |
int64 with values -100 to 100 |
int8 |
Temperature data |
float64 |
float32 |
Most ML features |
object with <50% unique |
category |
Country names, product categories |
๐ Real-World Performance
import pandas as pd
import dietpandas as dp
# Standard Pandas
df = pd.read_csv("sales_data.csv") # 2.3 GB, 45 seconds
print(df.memory_usage(deep=True).sum() / 1e9) # 2.3 GB
# Diet Pandas
df = dp.read_csv("sales_data.csv") # 0.8 GB, 8 seconds
print(df.memory_usage(deep=True).sum() / 1e9) # 0.8 GB
# Diet Complete: Memory reduced by 65.2%
# 2300.00MB -> 800.00MB
๐๏ธ Advanced Usage
Column-Specific Control NEW!
# Skip optimization for specific columns
df = dp.diet(df, skip_columns=['user_id', 'uuid'])
# Force categorical conversion for high-cardinality columns
df = dp.diet(df, force_categorical=['country_code'])
# Apply aggressive optimization only to specific columns
df = dp.diet(df, force_aggressive=['estimated_value'])
Custom Categorical Threshold
# Convert to category if <30% unique values (default is 50%)
df = dp.diet(df, categorical_threshold=0.3)
Disable Boolean Optimization
# Keep binary columns as integers instead of converting to boolean
df = dp.diet(df, optimize_bools=False)
In-Place Optimization
# Modify DataFrame in place (saves memory)
dp.diet(df, inplace=True)
Disable Optimization for Specific Columns
import pandas as pd
import dietpandas as dp
df = dp.read_csv("data.csv", optimize=False) # Load without optimization
df = df.drop(columns=['id_column']) # Remove high-cardinality columns
df = dp.diet(df) # Now optimize
Verbose Mode
df = dp.diet(df, verbose=True)
# Diet Complete: Memory reduced by 67.3%
# 450.00MB -> 147.15MB
๐งฉ Integration with Data Science Stack
Diet Pandas returns standard Pandas DataFrames, so it works seamlessly with:
import dietpandas as dp
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
# Load optimized data
df = dp.read_csv("train.csv")
# Works with Scikit-Learn
X = df.drop('target', axis=1)
y = df['target']
model = RandomForestClassifier()
model.fit(X, y)
# Works with Matplotlib
df['revenue'].plot()
plt.show()
# Works with any Pandas operation
result = df.groupby('category')['sales'].sum()
๐ Comparison with Alternatives
| Solution | Speed | Memory Savings | Pandas Compatible | Learning Curve |
|---|---|---|---|---|
| Diet Pandas | โกโกโก Fast | ๐ฏ 50-80% | โ 100% | โ None |
| Manual downcasting | ๐ Slow | ๐ฏ 50-80% | โ Yes | โ High |
| Polars | โกโกโก Very Fast | ๐ฏ 60-90% | โ No | โ ๏ธ Medium |
| Dask | โกโก Medium | ๐ฏ Varies | โ ๏ธ Partial | โ ๏ธ Medium |
๐ ๏ธ Development
Setup
git clone https://github.com/yourusername/diet-pandas.git
cd diet-pandas
# Install in development mode
pip install -e ".[dev]"
Running Tests
pytest tests/ -v
Running Examples
python scripts/examples.py
# Or run the interactive demo
python scripts/demo.py
Project Structure
diet-pandas/
โโโ src/
โ โโโ dietpandas/
โ โโโ __init__.py # Public API
โ โโโ core.py # Optimization logic
โ โโโ io.py # Fast I/O with Polars
โโโ tests/
โ โโโ test_core.py # Core function tests
โ โโโ test_io.py # I/O function tests
โโโ scripts/
โ โโโ demo.py # Interactive demo
โ โโโ examples.py # Usage examples
โ โโโ quickstart.py # Setup script
โโโ pyproject.toml # Project configuration
โโโ README.md # Documentation
โโโ CHANGELOG.md # Version history
โโโ CONTRIBUTING.md # Contribution guide
โโโ LICENSE # MIT License
๐ API Reference
Core Functions
diet(df, verbose=True, aggressive=False, categorical_threshold=0.5, inplace=False)
Optimize an existing DataFrame.
Parameters:
df(pd.DataFrame): DataFrame to optimizeverbose(bool): Print memory reduction statisticsaggressive(bool): Use float16 instead of float32 (may lose precision)categorical_threshold(float): Convert to category if unique_ratio < thresholdinplace(bool): Modify DataFrame in place
Returns: Optimized pd.DataFrame
get_memory_report(df)
Get detailed memory usage report per column.
Returns: DataFrame with memory statistics
I/O Functions
read_csv(filepath, optimize=True, aggressive=False, verbose=False, use_polars=True, **kwargs)
Read CSV with automatic optimization.
read_parquet(filepath, optimize=True, aggressive=False, verbose=False, use_polars=True, **kwargs)
Read Parquet with automatic optimization.
read_excel(filepath, optimize=True, aggressive=False, verbose=False, **kwargs)
Read Excel with automatic optimization.
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- Built on top of the excellent Pandas library
- Uses Polars for high-speed CSV parsing
- Inspired by the need for memory-efficient data science workflows
๐ฌ Contact
- GitHub: @luiz826
- Issues: GitHub Issues
Remember: A lean DataFrame is a happy DataFrame! ๐ผ๐ฅ
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diet_pandas-0.3.0.tar.gz.
File metadata
- Download URL: diet_pandas-0.3.0.tar.gz
- Upload date:
- Size: 30.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39389e1ab3e1a12987770444578010136146e0c6f29022ddac9db86cc1215139
|
|
| MD5 |
55e546df28b451e12e697c637c2810a4
|
|
| BLAKE2b-256 |
904d7c645855deb46aa1478b9dc7ed757280ea09925827a44fb6d07c9585ee76
|
File details
Details for the file diet_pandas-0.3.0-py3-none-any.whl.
File metadata
- Download URL: diet_pandas-0.3.0-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d654f1198d5bde2973bada1031c4ac566276fdbc1ebf4768e0b95640ded421d2
|
|
| MD5 |
7a40bbffd75ede69fb0a37e89cecb658
|
|
| BLAKE2b-256 |
b4a6a3a76ae7b8e5c0222f148e2adf929dad1aa4c56fe8abd67caad758a47231
|