Persistent Pandas DataFrame storage and retrieval using a SQL database, HDF5, CSV files, or pickle files.
Project description
TrashPandas: Persistent Pandas DataFrame Storage and Retrieval
What is it?
TrashPandas is a modern Python package that provides persistent Pandas DataFrame storage and retrieval using SQL databases, CSV files, HDF5, or pickle files. Version 1.0.2 brings significant improvements including SQLAlchemy 2.x support, comprehensive type hints, modern Python features, enhanced error handling, and improved CI/CD reliability.
✨ Main Features
- Multiple Storage Backends: SQL databases, CSV files, HDF5, and pickle files
- Preserve Data Integrity: Maintains indexes and data types during storage/retrieval
- Format Conversion: Transfer DataFrames between different storage formats
- Modern Python Support: Full type hints, context managers, and iterator protocol
- Bulk Operations: Efficient batch processing with
store_many(),load_many(),delete_many() - Compression Support: Optional compression for CSV and pickle storage
- Comprehensive Error Handling: Custom exception hierarchy with detailed error messages
- Schema Validation: Robust validation for SQL schema names and metadata
- Cross-Platform Compatibility: Tested on multiple Python versions (3.8-3.12) and operating systems
- Comprehensive Testing: 252+ tests with 76% code coverage
🚀 Quick Start
Installation
# Basic installation
pip install trashpandas
# With HDF5 support
pip install trashpandas[hdf5]
# Development dependencies
pip install trashpandas[dev]
Basic Usage
import pandas as pd
import sqlalchemy as sa
import trashpandas as tp
# Create sample data
df = pd.DataFrame({'name': ['Joe', 'Bob', 'John'], 'age': [23, 34, 44]})
# SQL Storage
with tp.SqlStorage('sqlite:///test.db') as storage:
storage['people'] = df
loaded_df = storage['people']
print(f"Stored {len(storage)} tables")
# CSV Storage with compression
csv_storage = tp.CsvStorage('./data', compression='gzip')
csv_storage.store(df, 'people')
# Pickle Storage
pickle_storage = tp.PickleStorage('./pickles', compression='bz2')
pickle_storage.store(df, 'people')
📖 Example Notebooks
Check out these interactive Jupyter notebooks demonstrating TrashPandas features:
- Basic Usage - Introduction to CSV, SQL, and Pickle storage
- Advanced Features - Compression, bulk operations, and data type preservation
- Format Conversion - Converting DataFrames between different storage formats
- Query Capabilities - Advanced SQL querying with WHERE clauses and filtering
All notebooks are fully executed with outputs included. Click the links above to view them on GitHub or open them in Jupyter Notebook/Lab.
📚 API Reference
Storage Classes
SqlStorage
# Create SQL storage
storage = tp.SqlStorage('sqlite:///test.db')
# or with existing engine
engine = sa.create_engine('sqlite:///test.db')
storage = tp.SqlStorage(engine)
# Basic operations
storage.store(df, 'table_name')
df = storage.load('table_name')
storage.delete('table_name')
# Dictionary-like interface
storage['table_name'] = df
df = storage['table_name']
del storage['table_name']
# Bulk operations
storage.store_many({'table1': df1, 'table2': df2})
results = storage.load_many(['table1', 'table2'])
storage.delete_many(['table1', 'table2'])
# Context manager
with storage:
storage['data'] = df
CsvStorage
# Basic CSV storage
storage = tp.CsvStorage('./data')
# With compression
storage = tp.CsvStorage('./data', compression='gzip')
# Operations
storage.store(df, 'table_name')
df = storage.load('table_name')
PickleStorage
# Basic pickle storage
storage = tp.PickleStorage('./pickles')
# With custom extension and compression
storage = tp.PickleStorage('./pickles', file_extension='.pkl', compression='bz2')
# Operations
storage.store(df, 'table_name')
df = storage.load('table_name')
HdfStorage (Optional)
# Requires: pip install trashpandas[hdf5]
storage = tp.HdfStorage('data.h5')
storage.store(df, 'table_name')
df = storage.load('table_name')
Modern Features
Iterator Protocol
storage = tp.SqlStorage('sqlite:///test.db')
# Iterate over table names
for table_name in storage:
print(f"Table: {table_name}")
# Check if table exists
if 'my_table' in storage:
df = storage['my_table']
# Get number of tables
print(f"Total tables: {len(storage)}")
Context Managers
# Automatic resource cleanup
with tp.SqlStorage('sqlite:///test.db') as storage:
storage['data'] = df
# Connection automatically closed
Bulk Operations
# Store multiple DataFrames efficiently
dataframes = {
'users': users_df,
'orders': orders_df,
'products': products_df
}
storage.store_many(dataframes)
# Load multiple tables
tables = ['users', 'orders', 'products']
results = storage.load_many(tables)
# Delete multiple tables
storage.delete_many(tables)
Compression Support
# CSV with compression
csv_storage = tp.CsvStorage('./data', compression='gzip')
# Pickle with compression
pickle_storage = tp.PickleStorage('./pickles', compression='bz2')
# Supported compression types: 'gzip', 'bz2', 'xz', 'zstd'
Error Handling
from trashpandas.exceptions import TableNotFoundError, MetadataCorruptedError
try:
df = storage.load('nonexistent_table')
except TableNotFoundError as e:
print(f"Table not found: {e.table_name}")
except MetadataCorruptedError as e:
print(f"Metadata corrupted: {e.details}")
🔄 Migration from 0.x to 1.0.2
Breaking Changes
-
SQLAlchemy 2.x Required: Update your SQLAlchemy version
pip install "SQLAlchemy>=2.0.0"
-
Path Parameters: Storage classes now accept
pathlib.Pathobjects# Old storage = tp.CsvStorage('/path/to/data') # New (still works) storage = tp.CsvStorage('/path/to/data') # New (recommended) from pathlib import Path storage = tp.CsvStorage(Path('/path/to/data'))
-
Method Signatures: Some internal methods have updated signatures
# Old storage.store(df, 'table') # New (backward compatible) storage.store(df, 'table') storage.store(df, 'table', schema='my_schema') # New optional parameter
New Features in 1.0.2
- Enhanced Schema Validation: Improved validation for SQL schema names and metadata
- Better Error Handling: More specific exception types and detailed error messages
- Improved Compatibility: Fixed numpy/PyTables compatibility issues across Python versions
- Robust CI/CD: Comprehensive testing across Python 3.8-3.12 with reliable builds
- Context Managers: Use
withstatements for automatic cleanup - Iterator Protocol: Iterate over storage objects
- Bulk Operations: Efficient batch processing
- Compression: Optional compression for file-based storage
🛠️ Development
Setup Development Environment
git clone https://github.com/eddiethedean/trashpandas.git
cd trashpandas
pip install -e ".[dev]"
Running Tests
# Run all tests with tox (recommended)
tox
# Run specific Python version
tox -e py311
tox -e py312
# Run linting
tox -e lint
# Run type checking
tox -e mypy
# Run with coverage
tox -e coverage
# Or run pytest directly
pytest
pytest --cov=trashpandas
pytest tests/test_sql.py
Code Quality
# Linting with ruff
ruff check src tests
# Type checking with mypy
mypy src
# Format code
ruff format src tests
📋 Requirements
- Python 3.8+
- pandas >= 1.3.0
- SQLAlchemy >= 2.0.0
- h5py >= 3.10.0 (optional, for HDF5 support)
- tables >= 3.8.0 (optional, for HDF5 support)
- numpy >= 1.21.0, < 2.0.0 (for PyTables compatibility)
📝 Recent Changes (v1.0.2)
🐛 Bug Fixes
- Fixed schema validation in
load_metadata_sqlfunction - Resolved numpy/PyTables compatibility issues across Python versions
- Improved error handling for edge cases in metadata operations
🔧 Improvements
- Enhanced CI/CD pipeline with comprehensive testing
- Added proper dependency version constraints for stability
- Improved cross-platform compatibility testing
- Better error messages and exception handling
🧪 Testing
- All 252 tests passing with 76% code coverage
- Comprehensive testing across Python 3.8-3.12
- Robust CI/CD with reliable builds on GitHub Actions
🤝 Contributing
Contributions are welcome! Please see our Contributing Guide for details.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- pandas for the excellent DataFrame library
- SQLAlchemy for robust database connectivity
- h5py for HDF5 support
- The Python community for inspiration and feedback
TrashPandas - Making DataFrame persistence simple and reliable! 🐼
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trashpandas-1.0.2.tar.gz.
File metadata
- Download URL: trashpandas-1.0.2.tar.gz
- Upload date:
- Size: 47.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9370abe61e9d1e2a8a54563abe1ca4376616b6754e35e3dfe439abf611a2f442
|
|
| MD5 |
348dfafeda4268482e3d049b15ab9e48
|
|
| BLAKE2b-256 |
652b401cd550be72ce04df28864cc8f4270c8ff78e353a539c6b02cb3b0311f9
|
File details
Details for the file trashpandas-1.0.2-py3-none-any.whl.
File metadata
- Download URL: trashpandas-1.0.2-py3-none-any.whl
- Upload date:
- Size: 34.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60f3647343a45aa974b6e6620ce7ccf27a99d9d830ffee46e876706630613f5a
|
|
| MD5 |
8c978deaba6aa3c7883c1e5c811ea738
|
|
| BLAKE2b-256 |
1995995bf9633fc0f69a17f523fb647470bdf04ecbd5ecd2bec5f97f908b6fbc
|