A high-performance CSV to PostgreSQL data loader with chunked processing and error handling
Project description
BulkFlow
A high-performance Python package for efficiently loading large CSV datasets into PostgreSQL databases. Features chunked processing, automatic resume capability, and comprehensive error handling.
Key Features
- 🚀 High Performance: Optimized chunk-based processing for handling large datasets efficiently
- 🔄 Resume Capability: Automatically resume interrupted imports from the last successful position
- 🛡️ Error Resilience: Comprehensive error handling with detailed logging and failed row tracking
- 🔍 Data Validation: Preview data before import and validate row structure
- 📊 Progress Tracking: Real-time progress updates with ETA and processing speed
- 🔄 Duplicate Handling: Smart handling of duplicate records
- 🔌 Connection Pooling: Efficient database connection management
- 📝 Detailed Logging: Comprehensive logging of all operations and errors
Installation
pip install bulkflow
Quick Start
- Create a database configuration file (
db_config.json
):
{
"dbname": "your_database",
"user": "your_username",
"password": "your_password",
"host": "localhost",
"port": "5432"
}
- Run the import:
bulkflow path/to/your/file.csv your_table_name
Project Structure
bulkflow/
├── src/
│ ├── models/ # Data models
│ ├── processors/ # Core processing logic
│ ├── database/ # Database operations
│ └── utils/ # Utility functions
Usage Examples
Basic Usage
from bulkflow import process_file
db_config = {
"dbname": "your_database",
"user": "your_username",
"password": "your_password",
"host": "localhost",
"port": "5432"
}
process_file(file_path, db_config, table_name)
CLI Usage
# Basic usage
bulkflow data.csv target_table
# Custom config file
bulkflow data.csv target_table --config my_config.json
Error Handling
BulkFlow provides comprehensive error handling:
-
Failed Rows File:
failed_rows_YYYYMMDD_HHMMSS.csv
- Records individual row failures
- Includes row number, content, error reason, and timestamp
-
Import State File:
import_state.json
- Tracks overall import progress
- Enables resume capability
- Records failed chunk information
Performance Optimization
BulkFlow automatically optimizes performance by:
- Calculating optimal chunk sizes based on available memory
- Using connection pooling for database operations
- Implementing efficient duplicate handling strategies
- Minimizing memory usage through streaming processing
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Inspired by the need for robust, production-ready data import solutions
- Built with modern Python best practices
- Designed for real-world use cases and large-scale data processing
Support
If you encounter any issues or have questions:
- Check the Issues page
- Create a new issue if your problem isn't already listed
- Provide as much context as possible in your issue description
- Try to fix the issue yourself and submit a Pull Request if you can
Author
Created and maintained by Chris Willingham
AI Contribution
The majority of this project's code was generated using AI assistance, specifically:
- Cline - AI coding assistant
- Claude 3.5 Sonnet (new) - Large language model by Anthropic
- In fact... the entire project was generated by AI, i'm kinda freakin out right now!
- even the name was generated by AI... I'm not sure if I count as the author, all hail our robot overlords!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file bulkflow-0.1.4.tar.gz
.
File metadata
- Download URL: bulkflow-0.1.4.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1d081a5fdbce17468681e6778fa827862ccea9730d4b492d663645f3734197aa |
|
MD5 | b497a102f52961409677dce13c643ed5 |
|
BLAKE2b-256 | 0f0baeaa2e099dac9bf8b3aca3147457df8961e5642ac39b8ff177a427fc9243 |
Provenance
The following attestation bundles were made for bulkflow-0.1.4.tar.gz
:
Publisher:
publish.yml
on clwillingham/bulkflow
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
bulkflow-0.1.4.tar.gz
- Subject digest:
1d081a5fdbce17468681e6778fa827862ccea9730d4b492d663645f3734197aa
- Sigstore transparency entry: 148295214
- Sigstore integration time:
- Predicate type:
File details
Details for the file bulkflow-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: bulkflow-0.1.4-py3-none-any.whl
- Upload date:
- Size: 14.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3647a228adc432606d4065251c3211c83ab88209536dce8c6d6a4e3d5d49751 |
|
MD5 | a761874131bcb1d32a62614f7053dbf9 |
|
BLAKE2b-256 | 2c725d52a6ed8a18777f51d56afe08b3c2f697eb5d881514adc8ddf647dbccbe |
Provenance
The following attestation bundles were made for bulkflow-0.1.4-py3-none-any.whl
:
Publisher:
publish.yml
on clwillingham/bulkflow
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
bulkflow-0.1.4-py3-none-any.whl
- Subject digest:
d3647a228adc432606d4065251c3211c83ab88209536dce8c6d6a4e3d5d49751
- Sigstore transparency entry: 148295215
- Sigstore integration time:
- Predicate type: