Skip to main content

A high-performance CSV to PostgreSQL data loader with chunked processing and error handling

Project description

BulkFlow

A high-performance Python package for efficiently loading large CSV datasets into PostgreSQL databases. Features chunked processing, automatic resume capability, and comprehensive error handling.

Key Features

  • 🚀 High Performance: Optimized chunk-based processing for handling large datasets efficiently
  • 🔄 Resume Capability: Automatically resume interrupted imports from the last successful position
  • 🛡️ Error Resilience: Comprehensive error handling with detailed logging and failed row tracking
  • 🔍 Data Validation: Preview data before import and validate row structure
  • 📊 Progress Tracking: Real-time progress updates with ETA and processing speed
  • 🔄 Duplicate Handling: Smart handling of duplicate records
  • 🔌 Connection Pooling: Efficient database connection management
  • 📝 Detailed Logging: Comprehensive logging of all operations and errors

Installation

pip install bulkflow

Quick Start

  1. Create a database configuration file (db_config.json):
{
    "dbname": "your_database",
    "user": "your_username",
    "password": "your_password",
    "host": "localhost",
    "port": "5432"
}
  1. Run the import:
bulkflow path/to/your/file.csv your_table_name

Project Structure

bulkflow/
├── src/
│   ├── models/          # Data models
│   ├── processors/      # Core processing logic
│   ├── database/        # Database operations
│   └── utils/          # Utility functions

Usage Examples

Basic Usage

from bulkflow import process_file

db_config = {
    "dbname": "your_database",
    "user": "your_username",
    "password": "your_password",
    "host": "localhost",
    "port": "5432"
}

process_file(file_path, db_config, table_name)

CLI Usage

# Basic usage
bulkflow data.csv target_table

# Custom config file
bulkflow data.csv target_table --config my_config.json

Error Handling

BulkFlow provides comprehensive error handling:

  1. Failed Rows File: failed_rows_YYYYMMDD_HHMMSS.csv

    • Records individual row failures
    • Includes row number, content, error reason, and timestamp
  2. Import State File: import_state.json

    • Tracks overall import progress
    • Enables resume capability
    • Records failed chunk information

Performance Optimization

BulkFlow automatically optimizes performance by:

  • Calculating optimal chunk sizes based on available memory
  • Using connection pooling for database operations
  • Implementing efficient duplicate handling strategies
  • Minimizing memory usage through streaming processing

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Inspired by the need for robust, production-ready data import solutions
  • Built with modern Python best practices
  • Designed for real-world use cases and large-scale data processing

Support

If you encounter any issues or have questions:

  1. Check the Issues page
  2. Create a new issue if your problem isn't already listed
  3. Provide as much context as possible in your issue description
  4. Try to fix the issue yourself and submit a Pull Request if you can

Author

Created and maintained by Chris Willingham

AI Contribution

The majority of this project's code was generated using AI assistance, specifically:

  • Cline - AI coding assistant
  • Claude 3.5 Sonnet (new) - Large language model by Anthropic
  • In fact... the entire project was generated by AI, i'm kinda freakin out right now!
  • even the name was generated by AI... I'm not sure if I count as the author, all hail our robot overlords!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bulkflow-0.1.4.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

bulkflow-0.1.4-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file bulkflow-0.1.4.tar.gz.

File metadata

  • Download URL: bulkflow-0.1.4.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for bulkflow-0.1.4.tar.gz
Algorithm Hash digest
SHA256 1d081a5fdbce17468681e6778fa827862ccea9730d4b492d663645f3734197aa
MD5 b497a102f52961409677dce13c643ed5
BLAKE2b-256 0f0baeaa2e099dac9bf8b3aca3147457df8961e5642ac39b8ff177a427fc9243

See more details on using hashes here.

Provenance

The following attestation bundles were made for bulkflow-0.1.4.tar.gz:

Publisher: publish.yml on clwillingham/bulkflow

Attestations:

File details

Details for the file bulkflow-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: bulkflow-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for bulkflow-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d3647a228adc432606d4065251c3211c83ab88209536dce8c6d6a4e3d5d49751
MD5 a761874131bcb1d32a62614f7053dbf9
BLAKE2b-256 2c725d52a6ed8a18777f51d56afe08b3c2f697eb5d881514adc8ddf647dbccbe

See more details on using hashes here.

Provenance

The following attestation bundles were made for bulkflow-0.1.4-py3-none-any.whl:

Publisher: publish.yml on clwillingham/bulkflow

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page