Skip to main content

A high-performance CSV to PostgreSQL data loader with chunked processing and error handling

Project description

BulkFlow

A high-performance Python package for efficiently loading large CSV datasets into PostgreSQL databases. Features chunked processing, automatic resume capability, and comprehensive error handling.

Key Features

  • 🚀 High Performance: Optimized chunk-based processing for handling large datasets efficiently
  • 🔄 Resume Capability: Automatically resume interrupted imports from the last successful position
  • 🛡️ Error Resilience: Comprehensive error handling with detailed logging and failed row tracking
  • 🔍 Data Validation: Preview data before import and validate row structure
  • 📊 Progress Tracking: Real-time progress updates with ETA and processing speed
  • 🔄 Duplicate Handling: Smart handling of duplicate records
  • 🔌 Connection Pooling: Efficient database connection management
  • 📝 Detailed Logging: Comprehensive logging of all operations and errors

Installation

pip install bulkflow

Quick Start

  1. Create a database configuration file (db_config.json):
{
    "dbname": "your_database",
    "user": "your_username",
    "password": "your_password",
    "host": "localhost",
    "port": "5432"
}
  1. Run the import:
bulkflow path/to/your/file.csv your_table_name

Project Structure

bulkflow/
├── src/
│   ├── models/          # Data models
│   ├── processors/      # Core processing logic
│   ├── database/        # Database operations
│   └── utils/          # Utility functions

Usage Examples

Basic Usage

from bulkflow import process_file

db_config = {
    "dbname": "your_database",
    "user": "your_username",
    "password": "your_password",
    "host": "localhost",
    "port": "5432"
}

process_file(file_path, db_config, table_name)

CLI Usage

# Basic usage
bulkflow data.csv target_table

# Custom config file
bulkflow data.csv target_table --config my_config.json

Error Handling

BulkFlow provides comprehensive error handling:

  1. Failed Rows File: failed_rows_YYYYMMDD_HHMMSS.csv

    • Records individual row failures
    • Includes row number, content, error reason, and timestamp
  2. Import State File: import_state.json

    • Tracks overall import progress
    • Enables resume capability
    • Records failed chunk information

Performance Optimization

BulkFlow automatically optimizes performance by:

  • Calculating optimal chunk sizes based on available memory
  • Using connection pooling for database operations
  • Implementing efficient duplicate handling strategies
  • Minimizing memory usage through streaming processing

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Inspired by the need for robust, production-ready data import solutions
  • Built with modern Python best practices
  • Designed for real-world use cases and large-scale data processing

Support

If you encounter any issues or have questions:

  1. Check the Issues page
  2. Create a new issue if your problem isn't already listed
  3. Provide as much context as possible in your issue description
  4. Try to fix the issue yourself and submit a Pull Request if you can

Author

Created and maintained by Chris Willingham

AI Contribution

The majority of this project's code was generated using AI assistance, specifically:

  • Cline - AI coding assistant
  • Claude 3.5 Sonnet (new) - Large language model by Anthropic

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bulkflow-0.1.2.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

bulkflow-0.1.2-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file bulkflow-0.1.2.tar.gz.

File metadata

  • Download URL: bulkflow-0.1.2.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for bulkflow-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ad16ef74840801467849e8fbc838e983904c68c33fe0f039d81da129db366f7e
MD5 461f85ed73e0e7da897a10c94ef34cd6
BLAKE2b-256 85b2da4b6c1f868394b470376d8e1539f22cde62c0829b4fad67c94169f3ef2c

See more details on using hashes here.

Provenance

The following attestation bundles were made for bulkflow-0.1.2.tar.gz:

Publisher: publish.yml on clwillingham/bulkflow

Attestations:

File details

Details for the file bulkflow-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: bulkflow-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for bulkflow-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7add0d6ac9d14486714e02aaa405c4d734a419516c47c2700d2d1aec7c52b4e1
MD5 389324094f56dc0c9d38cb0c5e93e5d0
BLAKE2b-256 7eb06fcfb1663d93de03989dde3b892ddbab3bfc74f44ea41a740c099f3dc4e4

See more details on using hashes here.

Provenance

The following attestation bundles were made for bulkflow-0.1.2-py3-none-any.whl:

Publisher: publish.yml on clwillingham/bulkflow

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page