Skip to main content

Automated memory optimization for Pandas DataFrames. Same Pandas taste, half the calories (RAM).

Project description

Diet Pandas ๐Ÿผ๐Ÿฅ—

Tagline: Same Pandas taste, half the calories (RAM).

PyPI version Python 3.10+ License: MIT Documentation

๐ŸŽฏ The Problem

Pandas is built for safety and ease of use, not memory efficiency. When you load a CSV, standard Pandas defaults to "safe" but wasteful data types:

  • int64 for small integers (wasting 75%+ memory per number)
  • float64 for simple metrics (wasting 50% memory per number)
  • object for repetitive strings (wasting massive amounts of memory and CPU)

Diet Pandas solves this by acting as a strict nutritionist for your data. It aggressively analyzes data distributions and "downcasts" types to the smallest safe representationโ€”often reducing memory usage by 50% to 80% without losing information.

๐Ÿš€ Quick Start

Installation

pip install diet-pandas

Basic Usage

import dietpandas as dp

# 1. Drop-in replacement for pandas.read_csv
# Loads faster and uses less RAM automatically
df = dp.read_csv("huge_dataset.csv")
# Diet Complete: Memory reduced by 67.3%
#    450.00MB -> 147.15MB

# 2. Or optimize an existing DataFrame
import pandas as pd
df_heavy = pd.DataFrame({
    'year': [2020, 2021, 2022], 
    'revenue': [1.1, 2.2, 3.3]
})

print(df_heavy.info())
# year       int64   (8 bytes each)
# revenue    float64 (8 bytes each)

df_light = dp.diet(df_heavy)
# Diet Complete: Memory reduced by 62.5%
#    0.13MB -> 0.05MB

print(df_light.info())
# year       uint16  (2 bytes each)
# revenue    float32 (4 bytes each)

โœจ Features

๐Ÿƒ Fast Loading with Polars Engine

Diet Pandas uses Polars (a blazing-fast DataFrame library) to parse CSV files, then automatically converts to optimized Pandas DataFrames.

import dietpandas as dp

# 5-10x faster than pandas.read_csv AND uses less memory
df = dp.read_csv("large_file.csv")

๐ŸŽฏ Intelligent Type Optimization

import dietpandas as dp

# Automatic optimization
df = dp.diet(df_original)

# See detailed memory report
report = dp.get_memory_report(df)
print(report)
#         column    dtype  memory_bytes  memory_mb  percent_of_total
# 0  large_text  category      12589875      12.59              45.2
# 1     user_id     uint32       4000000       4.00              14.4

๐Ÿ”ฅ Aggressive Mode (Keto Diet)

For maximum compression, use aggressive mode:

# Safe mode: float64 -> float32 (lossless for most ML tasks)
df = dp.diet(df, aggressive=False)

# Keto mode: float64 -> float16 (extreme compression, some precision loss)
df = dp.diet(df, aggressive=True)
# Diet Complete: Memory reduced by 81.2%

๐Ÿ“Š Multiple File Format Support

import dietpandas as dp

# CSV with fast Polars engine
df = dp.read_csv("data.csv")

# Parquet
df = dp.read_parquet("data.parquet")

# Excel
df = dp.read_excel("data.xlsx")

# JSON
df = dp.read_json("data.json")

# HDF5
df = dp.read_hdf("data.h5", key="dataset1")

# Feather
df = dp.read_feather("data.feather")

# All readers automatically optimize memory usage!

๐Ÿ—œ๏ธ Sparse Data Optimization

For data with many repeated values (zeros, NaNs, or any repeated value):

# Enable sparse optimization for columns with >90% repeated values
df = dp.diet(df, optimize_sparse_cols=True)
# Perfect for: binary features, indicator variables, sparse matrices

๐Ÿ“… DateTime Optimization

Automatically optimizes datetime columns for better memory efficiency:

df = pd.DataFrame({
    'date': pd.date_range('2020-01-01', periods=1000000),
    'value': range(1000000)
})

df_optimized = dp.diet(df, optimize_datetimes=True)
# DateTime columns automatically optimized
import dietpandas as dp

# CSV (with Polars acceleration)
df = dp.read_csv("data.csv")

# Parquet (with Polars acceleration)
df = dp.read_parquet("data.parquet")

# Excel
df = dp.read_excel("data.xlsx")

# All return optimized Pandas DataFrames

๐Ÿงช Technical Details

How It Works

Diet Pandas uses a "Trojan Horse" architecture:

  1. Ingestion Layer (The Fast Lane):

    • Uses Polars or PyArrow for multi-threaded CSV parsing (5-10x faster)
  2. Optimization Layer (The Metabolism):

    • Calculates min/max for numeric columns
    • Analyzes string cardinality (unique values ratio)
    • Maps stats to smallest safe numpy types
  3. Conversion Layer (The Result):

    • Returns a standard pandas.DataFrame (100% compatible)
    • Works seamlessly with Scikit-Learn, PyTorch, XGBoost, Matplotlib

Optimization Rules

Original Type Optimization Example
int64 with values 0-255 uint8 User ages, small counts
int64 with values -100 to 100 int8 Temperature data
float64 float32 Most ML features
object with <50% unique category Country names, product categories

๐Ÿ“ˆ Real-World Performance

import pandas as pd
import dietpandas as dp

# Standard Pandas
df = pd.read_csv("sales_data.csv")  # 2.3 GB, 45 seconds
print(df.memory_usage(deep=True).sum() / 1e9)  # 2.3 GB

# Diet Pandas
df = dp.read_csv("sales_data.csv")  # 0.8 GB, 8 seconds
print(df.memory_usage(deep=True).sum() / 1e9)  # 0.8 GB
# Diet Complete: Memory reduced by 65.2%
#    2300.00MB -> 800.00MB

๐ŸŽ›๏ธ Advanced Usage

Custom Categorical Threshold

# Convert to category if <30% unique values (default is 50%)
df = dp.diet(df, categorical_threshold=0.3)

In-Place Optimization

# Modify DataFrame in place (saves memory)
dp.diet(df, inplace=True)

Disable Optimization for Specific Columns

import pandas as pd
import dietpandas as dp

df = dp.read_csv("data.csv", optimize=False)  # Load without optimization
df = df.drop(columns=['id_column'])  # Remove high-cardinality columns
df = dp.diet(df)  # Now optimize

Verbose Mode

df = dp.diet(df, verbose=True)
# Diet Complete: Memory reduced by 67.3%
#    450.00MB -> 147.15MB

๐Ÿงฉ Integration with Data Science Stack

Diet Pandas returns standard Pandas DataFrames, so it works seamlessly with:

import dietpandas as dp
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Load optimized data
df = dp.read_csv("train.csv")

# Works with Scikit-Learn
X = df.drop('target', axis=1)
y = df['target']
model = RandomForestClassifier()
model.fit(X, y)

# Works with Matplotlib
df['revenue'].plot()
plt.show()

# Works with any Pandas operation
result = df.groupby('category')['sales'].sum()

๐Ÿ†š Comparison with Alternatives

Solution Speed Memory Savings Pandas Compatible Learning Curve
Diet Pandas โšกโšกโšก Fast ๐ŸŽฏ 50-80% โœ… 100% โœ… None
Manual downcasting ๐ŸŒ Slow ๐ŸŽฏ 50-80% โœ… Yes โŒ High
Polars โšกโšกโšก Very Fast ๐ŸŽฏ 60-90% โŒ No โš ๏ธ Medium
Dask โšกโšก Medium ๐ŸŽฏ Varies โš ๏ธ Partial โš ๏ธ Medium

๐Ÿ› ๏ธ Development

Setup

git clone https://github.com/yourusername/diet-pandas.git
cd diet-pandas

# Install in development mode
pip install -e ".[dev]"

Running Tests

pytest tests/ -v

Running Examples

python scripts/examples.py

# Or run the interactive demo
python scripts/demo.py

Project Structure

diet-pandas/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ dietpandas/
โ”‚       โ”œโ”€โ”€ __init__.py      # Public API
โ”‚       โ”œโ”€โ”€ core.py          # Optimization logic
โ”‚       โ””โ”€โ”€ io.py            # Fast I/O with Polars
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ test_core.py         # Core function tests
โ”‚   โ””โ”€โ”€ test_io.py           # I/O function tests
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ demo.py              # Interactive demo
โ”‚   โ”œโ”€โ”€ examples.py          # Usage examples
โ”‚   โ””โ”€โ”€ quickstart.py        # Setup script
โ”œโ”€โ”€ pyproject.toml           # Project configuration
โ”œโ”€โ”€ README.md                # Documentation
โ”œโ”€โ”€ CHANGELOG.md             # Version history
โ”œโ”€โ”€ CONTRIBUTING.md          # Contribution guide
โ””โ”€โ”€ LICENSE                  # MIT License

๐Ÿ“ API Reference

Core Functions

diet(df, verbose=True, aggressive=False, categorical_threshold=0.5, inplace=False)

Optimize an existing DataFrame.

Parameters:

  • df (pd.DataFrame): DataFrame to optimize
  • verbose (bool): Print memory reduction statistics
  • aggressive (bool): Use float16 instead of float32 (may lose precision)
  • categorical_threshold (float): Convert to category if unique_ratio < threshold
  • inplace (bool): Modify DataFrame in place

Returns: Optimized pd.DataFrame

get_memory_report(df)

Get detailed memory usage report per column.

Returns: DataFrame with memory statistics

I/O Functions

read_csv(filepath, optimize=True, aggressive=False, verbose=False, use_polars=True, **kwargs)

Read CSV with automatic optimization.

read_parquet(filepath, optimize=True, aggressive=False, verbose=False, use_polars=True, **kwargs)

Read Parquet with automatic optimization.

read_excel(filepath, optimize=True, aggressive=False, verbose=False, **kwargs)

Read Excel with automatic optimization.

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built on top of the excellent Pandas library
  • Uses Polars for high-speed CSV parsing
  • Inspired by the need for memory-efficient data science workflows

๐Ÿ“ฌ Contact


Remember: A lean DataFrame is a happy DataFrame! ๐Ÿผ๐Ÿฅ—

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diet_pandas-0.2.1.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diet_pandas-0.2.1-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file diet_pandas-0.2.1.tar.gz.

File metadata

  • Download URL: diet_pandas-0.2.1.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for diet_pandas-0.2.1.tar.gz
Algorithm Hash digest
SHA256 9cf5c7d6f490ed28c9b942277f148d674ba7a577fc6d65e873cc3950730b19bb
MD5 c5514082c4ed163116cfeea071017871
BLAKE2b-256 f0e5c7f776dd5fd1aa33336cbbecbbf091ff6c52aa0dd862edeac5f1884d2e6f

See more details on using hashes here.

File details

Details for the file diet_pandas-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: diet_pandas-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for diet_pandas-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 69f4dcc342f9a9225d81875812dc0e8bf5e15bfd4244e1a4ebc674a113a1162e
MD5 c9d914608620a4bb5af658ccec7e2857
BLAKE2b-256 d174063ba87c1988cfa16194e5ceee7ea139e015812b62fe790bbb78004ef363

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page