Skip to main content

Persistent Pandas DataFrame storage and retrieval using a SQL database, HDF5, CSV files, or pickle files.

Project description

TrashPandas Logo

TrashPandas: Persistent Pandas DataFrame Storage and Retrieval

PyPI Latest Release Tests Python Support License: MIT Code style: ruff

What is it?

TrashPandas is a modern Python package that provides persistent Pandas DataFrame storage and retrieval using SQL databases, CSV files, HDF5, or pickle files. Version 1.0.0 brings significant improvements including SQLAlchemy 2.x support, comprehensive type hints, modern Python features, and enhanced error handling.

✨ Main Features

  • Multiple Storage Backends: SQL databases, CSV files, HDF5, and pickle files
  • Preserve Data Integrity: Maintains indexes and data types during storage/retrieval
  • Format Conversion: Transfer DataFrames between different storage formats
  • Modern Python Support: Full type hints, context managers, and iterator protocol
  • Bulk Operations: Efficient batch processing with store_many(), load_many(), delete_many()
  • Compression Support: Optional compression for CSV and pickle storage
  • Comprehensive Error Handling: Custom exception hierarchy with detailed error messages
  • SQLAlchemy 2.x: Full support for the latest SQLAlchemy with async capabilities

🚀 Quick Start

Installation

# Basic installation
pip install trashpandas

# With HDF5 support
pip install trashpandas[hdf5]

# Development dependencies
pip install trashpandas[dev]

Basic Usage

import pandas as pd
import sqlalchemy as sa
import trashpandas as tp

# Create sample data
df = pd.DataFrame({'name': ['Joe', 'Bob', 'John'], 'age': [23, 34, 44]})

# SQL Storage
with tp.SqlStorage('sqlite:///test.db') as storage:
    storage['people'] = df
    loaded_df = storage['people']
    print(f"Stored {len(storage)} tables")

# CSV Storage with compression
csv_storage = tp.CsvStorage('./data', compression='gzip')
csv_storage.store(df, 'people')

# Pickle Storage
pickle_storage = tp.PickleStorage('./pickles', compression='bz2')
pickle_storage.store(df, 'people')

📖 Example Notebooks

Check out these interactive Jupyter notebooks demonstrating TrashPandas features:

All notebooks are fully executed with outputs included. Click the links above to view them on GitHub or open them in Jupyter Notebook/Lab.

📚 API Reference

Storage Classes

SqlStorage

# Create SQL storage
storage = tp.SqlStorage('sqlite:///test.db')
# or with existing engine
engine = sa.create_engine('sqlite:///test.db')
storage = tp.SqlStorage(engine)

# Basic operations
storage.store(df, 'table_name')
df = storage.load('table_name')
storage.delete('table_name')

# Dictionary-like interface
storage['table_name'] = df
df = storage['table_name']
del storage['table_name']

# Bulk operations
storage.store_many({'table1': df1, 'table2': df2})
results = storage.load_many(['table1', 'table2'])
storage.delete_many(['table1', 'table2'])

# Context manager
with storage:
    storage['data'] = df

CsvStorage

# Basic CSV storage
storage = tp.CsvStorage('./data')

# With compression
storage = tp.CsvStorage('./data', compression='gzip')

# Operations
storage.store(df, 'table_name')
df = storage.load('table_name')

PickleStorage

# Basic pickle storage
storage = tp.PickleStorage('./pickles')

# With custom extension and compression
storage = tp.PickleStorage('./pickles', file_extension='.pkl', compression='bz2')

# Operations
storage.store(df, 'table_name')
df = storage.load('table_name')

HdfStorage (Optional)

# Requires: pip install trashpandas[hdf5]
storage = tp.HdfStorage('data.h5')
storage.store(df, 'table_name')
df = storage.load('table_name')

Modern Features

Iterator Protocol

storage = tp.SqlStorage('sqlite:///test.db')

# Iterate over table names
for table_name in storage:
    print(f"Table: {table_name}")

# Check if table exists
if 'my_table' in storage:
    df = storage['my_table']

# Get number of tables
print(f"Total tables: {len(storage)}")

Context Managers

# Automatic resource cleanup
with tp.SqlStorage('sqlite:///test.db') as storage:
    storage['data'] = df
    # Connection automatically closed

Bulk Operations

# Store multiple DataFrames efficiently
dataframes = {
    'users': users_df,
    'orders': orders_df,
    'products': products_df
}
storage.store_many(dataframes)

# Load multiple tables
tables = ['users', 'orders', 'products']
results = storage.load_many(tables)

# Delete multiple tables
storage.delete_many(tables)

Compression Support

# CSV with compression
csv_storage = tp.CsvStorage('./data', compression='gzip')

# Pickle with compression
pickle_storage = tp.PickleStorage('./pickles', compression='bz2')

# Supported compression types: 'gzip', 'bz2', 'xz', 'zstd'

Error Handling

from trashpandas.exceptions import TableNotFoundError, MetadataCorruptedError

try:
    df = storage.load('nonexistent_table')
except TableNotFoundError as e:
    print(f"Table not found: {e.table_name}")
except MetadataCorruptedError as e:
    print(f"Metadata corrupted: {e.details}")

🔄 Migration from 0.x to 1.0

Breaking Changes

  1. SQLAlchemy 2.x Required: Update your SQLAlchemy version

    pip install "SQLAlchemy>=2.0.0"
    
  2. Path Parameters: Storage classes now accept pathlib.Path objects

    # Old
    storage = tp.CsvStorage('/path/to/data')
    
    # New (still works)
    storage = tp.CsvStorage('/path/to/data')
    
    # New (recommended)
    from pathlib import Path
    storage = tp.CsvStorage(Path('/path/to/data'))
    
  3. Method Signatures: Some internal methods have updated signatures

    # Old
    storage.store(df, 'table')
    
    # New (backward compatible)
    storage.store(df, 'table')
    storage.store(df, 'table', schema='my_schema')  # New optional parameter
    

New Features

  1. Context Managers: Use with statements for automatic cleanup
  2. Iterator Protocol: Iterate over storage objects
  3. Bulk Operations: Efficient batch processing
  4. Compression: Optional compression for file-based storage
  5. Better Error Handling: Comprehensive exception hierarchy

🛠️ Development

Setup Development Environment

git clone https://github.com/eddiethedean/trashpandas.git
cd trashpandas
pip install -e ".[dev]"

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=trashpandas

# Run specific test file
pytest tests/test_sql.py

Code Quality

# Linting with ruff
ruff check src tests

# Type checking with mypy
mypy src

# Format code
ruff format src tests

📋 Requirements

  • Python 3.8+
  • pandas >= 1.3.0
  • SQLAlchemy >= 2.0.0
  • h5py >= 3.0.0 (optional, for HDF5 support)

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • pandas for the excellent DataFrame library
  • SQLAlchemy for robust database connectivity
  • h5py for HDF5 support
  • The Python community for inspiration and feedback

TrashPandas - Making DataFrame persistence simple and reliable! 🐼

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trashpandas-1.0.0.tar.gz (40.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trashpandas-1.0.0-py3-none-any.whl (33.8 kB view details)

Uploaded Python 3

File details

Details for the file trashpandas-1.0.0.tar.gz.

File metadata

  • Download URL: trashpandas-1.0.0.tar.gz
  • Upload date:
  • Size: 40.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for trashpandas-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a49ec3190b35444e329b8348f31b4df273fd3bc96ae13e886df8320663b8b2d9
MD5 63ff262f33bf292b3c981e3b539c74da
BLAKE2b-256 fc922c6a66ed93f989570cafd5156a2e9b5c14552a35534752353c0d4ae20f24

See more details on using hashes here.

File details

Details for the file trashpandas-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: trashpandas-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 33.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for trashpandas-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e70cee5079d0e6706d119b9847955563722bad84a805da7738c31ef7cedfc4a4
MD5 4695771f233e22a4907ef38c176f3c1f
BLAKE2b-256 74ffae703fb0bf4559a172190fac82f58f7ea1492446870be5a3d747debe8477

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page