Skip to main content

A high-performance CSV to PostgreSQL data loader with chunked processing and error handling

Project description

BulkFlow

A high-performance Python package for efficiently loading large CSV datasets into PostgreSQL databases. Features chunked processing, automatic resume capability, and comprehensive error handling.

Key Features

  • 🚀 High Performance: Optimized chunk-based processing for handling large datasets efficiently
  • 🔄 Resume Capability: Automatically resume interrupted imports from the last successful position
  • 🛡️ Error Resilience: Comprehensive error handling with detailed logging and failed row tracking
  • 🔍 Data Validation: Preview data before import and validate row structure
  • 📊 Progress Tracking: Real-time progress updates with ETA and processing speed
  • 🔄 Duplicate Handling: Smart handling of duplicate records
  • 🔌 Connection Pooling: Efficient database connection management
  • 📝 Detailed Logging: Comprehensive logging of all operations and errors

Installation

pip install bulkflow

Quick Start

  1. Create a database configuration file (db_config.json):
{
    "dbname": "your_database",
    "user": "your_username",
    "password": "your_password",
    "host": "localhost",
    "port": "5432"
}
  1. Run the import:
python -m bulkflow.main path/to/your/file.csv your_table_name

Project Structure

bulkflow/
├── src/
│   ├── models/          # Data models
│   ├── processors/      # Core processing logic
│   ├── database/        # Database operations
│   └── utils/          # Utility functions

Usage Examples

Basic Usage

from bulkflow import process_file

db_config = {
    "dbname": "your_database",
    "user": "your_username",
    "password": "your_password",
    "host": "localhost",
    "port": "5432"
}

process_file(file_path, db_config, table_name)

CLI Usage

# Basic usage
python -m bulkflow.main data.csv target_table

# Custom config file
python -m bulkflow.main data.csv target_table --config my_config.json

Error Handling

BulkFlow provides comprehensive error handling:

  1. Failed Rows File: failed_rows_YYYYMMDD_HHMMSS.csv

    • Records individual row failures
    • Includes row number, content, error reason, and timestamp
  2. Import State File: import_state.json

    • Tracks overall import progress
    • Enables resume capability
    • Records failed chunk information

Performance Optimization

BulkFlow automatically optimizes performance by:

  • Calculating optimal chunk sizes based on available memory
  • Using connection pooling for database operations
  • Implementing efficient duplicate handling strategies
  • Minimizing memory usage through streaming processing

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Inspired by the need for robust, production-ready data import solutions
  • Built with modern Python best practices
  • Designed for real-world use cases and large-scale data processing

Support

If you encounter any issues or have questions:

  1. Check the Issues page
  2. Create a new issue if your problem isn't already listed
  3. Provide as much context as possible in your issue description

Author

Created and maintained by Chris Willingham

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bulkflow-0.1.0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

bulkflow-0.1.0-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file bulkflow-0.1.0.tar.gz.

File metadata

  • Download URL: bulkflow-0.1.0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for bulkflow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b8e0b5694ec893230cbe9468a73704870d1c7069a8556212dbde083d3d77f9a6
MD5 c6987bd394fdb0afa32b82851d23001d
BLAKE2b-256 271f2f86b256103e7c0eab6d6c7fafbcc5918022e2cf4157400362fc34d705fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for bulkflow-0.1.0.tar.gz:

Publisher: publish.yml on clwillingham/bulkflow

Attestations:

File details

Details for the file bulkflow-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bulkflow-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for bulkflow-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f77137390e3615830452d0624d69bc0067a28993724b2eb17b5b1ae300f3391
MD5 9c67289f771e6380678f8f374e20c669
BLAKE2b-256 4e0d1b39545fb038384ad75b52efbe0c7c504e8b02b3584f4d0e918fa39bbe31

See more details on using hashes here.

Provenance

The following attestation bundles were made for bulkflow-0.1.0-py3-none-any.whl:

Publisher: publish.yml on clwillingham/bulkflow

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page