Automated memory optimization for Pandas DataFrames. Same Pandas taste, half the calories (RAM).
Project description
Diet Pandas ๐ผ๐ฅ
Tagline: Same Pandas taste, half the calories (RAM).
๐ฏ The Problem
Pandas is built for safety and ease of use, not memory efficiency. When you load a CSV, standard Pandas defaults to "safe" but wasteful data types:
int64for small integers (wasting 75%+ memory per number)float64for simple metrics (wasting 50% memory per number)objectfor repetitive strings (wasting massive amounts of memory and CPU)
Diet Pandas solves this by acting as a strict nutritionist for your data. It aggressively analyzes data distributions and "downcasts" types to the smallest safe representationโoften reducing memory usage by 50% to 80% without losing information.
๐ Quick Start
Installation
pip install diet-pandas
Basic Usage
import dietpandas as dp
# 1. Drop-in replacement for pandas.read_csv
# Loads faster and uses less RAM automatically
df = dp.read_csv("huge_dataset.csv")
# Diet Complete: Memory reduced by 67.3%
# 450.00MB -> 147.15MB
# 2. Or optimize an existing DataFrame
import pandas as pd
df_heavy = pd.DataFrame({
'year': [2020, 2021, 2022],
'revenue': [1.1, 2.2, 3.3]
})
print(df_heavy.info())
# year int64 (8 bytes each)
# revenue float64 (8 bytes each)
df_light = dp.diet(df_heavy)
# Diet Complete: Memory reduced by 62.5%
# 0.13MB -> 0.05MB
print(df_light.info())
# year uint16 (2 bytes each)
# revenue float32 (4 bytes each)
โจ Features
๐ Fast Loading with Polars Engine
Diet Pandas uses Polars (a blazing-fast DataFrame library) to parse CSV files, then automatically converts to optimized Pandas DataFrames.
import dietpandas as dp
# 5-10x faster than pandas.read_csv AND uses less memory
df = dp.read_csv("large_file.csv")
๐ฏ Intelligent Type Optimization
import dietpandas as dp
# Automatic optimization
df = dp.diet(df_original)
# See detailed memory report
report = dp.get_memory_report(df)
print(report)
# column dtype memory_bytes memory_mb percent_of_total
# 0 large_text category 12589875 12.59 45.2
# 1 user_id uint32 4000000 4.00 14.4
๐ฅ Aggressive Mode (Keto Diet)
For maximum compression, use aggressive mode:
# Safe mode: float64 -> float32 (lossless for most ML tasks)
df = dp.diet(df, aggressive=False)
# Keto mode: float64 -> float16 (extreme compression, some precision loss)
df = dp.diet(df, aggressive=True)
# Diet Complete: Memory reduced by 81.2%
๐ Multiple File Format Support
import dietpandas as dp
# CSV with fast Polars engine
df = dp.read_csv("data.csv")
# Parquet
df = dp.read_parquet("data.parquet")
# Excel
df = dp.read_excel("data.xlsx")
# JSON
df = dp.read_json("data.json")
# HDF5
df = dp.read_hdf("data.h5", key="dataset1")
# Feather
df = dp.read_feather("data.feather")
# All readers automatically optimize memory usage!
๐๏ธ Sparse Data Optimization
For data with many repeated values (zeros, NaNs, or any repeated value):
# Enable sparse optimization for columns with >90% repeated values
df = dp.diet(df, optimize_sparse_cols=True)
# Perfect for: binary features, indicator variables, sparse matrices
๐ DateTime Optimization
Automatically optimizes datetime columns for better memory efficiency:
df = pd.DataFrame({
'date': pd.date_range('2020-01-01', periods=1000000),
'value': range(1000000)
})
df_optimized = dp.diet(df, optimize_datetimes=True)
# DateTime columns automatically optimized
import dietpandas as dp
# CSV (with Polars acceleration)
df = dp.read_csv("data.csv")
# Parquet (with Polars acceleration)
df = dp.read_parquet("data.parquet")
# Excel
df = dp.read_excel("data.xlsx")
# All return optimized Pandas DataFrames
๐งช Technical Details
How It Works
Diet Pandas uses a "Trojan Horse" architecture:
-
Ingestion Layer (The Fast Lane):
- Uses Polars or PyArrow for multi-threaded CSV parsing (5-10x faster)
-
Optimization Layer (The Metabolism):
- Calculates min/max for numeric columns
- Analyzes string cardinality (unique values ratio)
- Maps stats to smallest safe numpy types
-
Conversion Layer (The Result):
- Returns a standard
pandas.DataFrame(100% compatible) - Works seamlessly with Scikit-Learn, PyTorch, XGBoost, Matplotlib
- Returns a standard
Optimization Rules
| Original Type | Optimization | Example |
|---|---|---|
int64 with values 0-255 |
uint8 |
User ages, small counts |
int64 with values -100 to 100 |
int8 |
Temperature data |
float64 |
float32 |
Most ML features |
object with <50% unique |
category |
Country names, product categories |
๐ Real-World Performance
import pandas as pd
import dietpandas as dp
# Standard Pandas
df = pd.read_csv("sales_data.csv") # 2.3 GB, 45 seconds
print(df.memory_usage(deep=True).sum() / 1e9) # 2.3 GB
# Diet Pandas
df = dp.read_csv("sales_data.csv") # 0.8 GB, 8 seconds
print(df.memory_usage(deep=True).sum() / 1e9) # 0.8 GB
# Diet Complete: Memory reduced by 65.2%
# 2300.00MB -> 800.00MB
๐๏ธ Advanced Usage
Custom Categorical Threshold
# Convert to category if <30% unique values (default is 50%)
df = dp.diet(df, categorical_threshold=0.3)
In-Place Optimization
# Modify DataFrame in place (saves memory)
dp.diet(df, inplace=True)
Disable Optimization for Specific Columns
import pandas as pd
import dietpandas as dp
df = dp.read_csv("data.csv", optimize=False) # Load without optimization
df = df.drop(columns=['id_column']) # Remove high-cardinality columns
df = dp.diet(df) # Now optimize
Verbose Mode
df = dp.diet(df, verbose=True)
# Diet Complete: Memory reduced by 67.3%
# 450.00MB -> 147.15MB
๐งฉ Integration with Data Science Stack
Diet Pandas returns standard Pandas DataFrames, so it works seamlessly with:
import dietpandas as dp
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
# Load optimized data
df = dp.read_csv("train.csv")
# Works with Scikit-Learn
X = df.drop('target', axis=1)
y = df['target']
model = RandomForestClassifier()
model.fit(X, y)
# Works with Matplotlib
df['revenue'].plot()
plt.show()
# Works with any Pandas operation
result = df.groupby('category')['sales'].sum()
๐ Comparison with Alternatives
| Solution | Speed | Memory Savings | Pandas Compatible | Learning Curve |
|---|---|---|---|---|
| Diet Pandas | โกโกโก Fast | ๐ฏ 50-80% | โ 100% | โ None |
| Manual downcasting | ๐ Slow | ๐ฏ 50-80% | โ Yes | โ High |
| Polars | โกโกโก Very Fast | ๐ฏ 60-90% | โ No | โ ๏ธ Medium |
| Dask | โกโก Medium | ๐ฏ Varies | โ ๏ธ Partial | โ ๏ธ Medium |
๐ ๏ธ Development
Setup
git clone https://github.com/yourusername/diet-pandas.git
cd diet-pandas
# Install in development mode
pip install -e ".[dev]"
Running Tests
pytest tests/ -v
Running Examples
python scripts/examples.py
# Or run the interactive demo
python scripts/demo.py
Project Structure
diet-pandas/
โโโ src/
โ โโโ dietpandas/
โ โโโ __init__.py # Public API
โ โโโ core.py # Optimization logic
โ โโโ io.py # Fast I/O with Polars
โโโ tests/
โ โโโ test_core.py # Core function tests
โ โโโ test_io.py # I/O function tests
โโโ scripts/
โ โโโ demo.py # Interactive demo
โ โโโ examples.py # Usage examples
โ โโโ quickstart.py # Setup script
โโโ pyproject.toml # Project configuration
โโโ README.md # Documentation
โโโ CHANGELOG.md # Version history
โโโ CONTRIBUTING.md # Contribution guide
โโโ LICENSE # MIT License
๐ API Reference
Core Functions
diet(df, verbose=True, aggressive=False, categorical_threshold=0.5, inplace=False)
Optimize an existing DataFrame.
Parameters:
df(pd.DataFrame): DataFrame to optimizeverbose(bool): Print memory reduction statisticsaggressive(bool): Use float16 instead of float32 (may lose precision)categorical_threshold(float): Convert to category if unique_ratio < thresholdinplace(bool): Modify DataFrame in place
Returns: Optimized pd.DataFrame
get_memory_report(df)
Get detailed memory usage report per column.
Returns: DataFrame with memory statistics
I/O Functions
read_csv(filepath, optimize=True, aggressive=False, verbose=False, use_polars=True, **kwargs)
Read CSV with automatic optimization.
read_parquet(filepath, optimize=True, aggressive=False, verbose=False, use_polars=True, **kwargs)
Read Parquet with automatic optimization.
read_excel(filepath, optimize=True, aggressive=False, verbose=False, **kwargs)
Read Excel with automatic optimization.
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- Built on top of the excellent Pandas library
- Uses Polars for high-speed CSV parsing
- Inspired by the need for memory-efficient data science workflows
๐ฌ Contact
- GitHub: @luiz826
- Issues: GitHub Issues
Remember: A lean DataFrame is a happy DataFrame! ๐ผ๐ฅ
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file diet_pandas-0.2.1.tar.gz.
File metadata
- Download URL: diet_pandas-0.2.1.tar.gz
- Upload date:
- Size: 15.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cf5c7d6f490ed28c9b942277f148d674ba7a577fc6d65e873cc3950730b19bb
|
|
| MD5 |
c5514082c4ed163116cfeea071017871
|
|
| BLAKE2b-256 |
f0e5c7f776dd5fd1aa33336cbbecbbf091ff6c52aa0dd862edeac5f1884d2e6f
|
File details
Details for the file diet_pandas-0.2.1-py3-none-any.whl.
File metadata
- Download URL: diet_pandas-0.2.1-py3-none-any.whl
- Upload date:
- Size: 11.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69f4dcc342f9a9225d81875812dc0e8bf5e15bfd4244e1a4ebc674a113a1162e
|
|
| MD5 |
c9d914608620a4bb5af658ccec7e2857
|
|
| BLAKE2b-256 |
d174063ba87c1988cfa16194e5ceee7ea139e015812b62fe790bbb78004ef363
|