Skip to main content

A lightweight, Databricks-style autoloader using Polars and SQLite.

Project description

🚀 OpenAutoLoader

PyPI version License: MIT Python 3.12+ Powered by Polars

OpenAutoLoader is a high-performance, incremental data ingestion engine. It bridges the gap between raw cloud storage and production-ready Delta Lakes using the lightning-fast Polars Rust engine.

Stop writing complex Spark jobs for simple file ingestion. OpenAutoLoader provides a "Databricks-style" Auto Loader experience in a lightweight Python package.


💡 Why OpenAutoLoader?

Traditional ingestion often requires heavy JVM clusters (Spark) or manual file tracking. OpenAutoLoader changes that:

  • Zero-Spark Overhead: Runs on standard Python environments with Rust-level performance.
  • Exactly-Once Processing: Integrated SQLite checkpointing ensures no duplicate data, even if a job restarts.
  • Schema First: Automatically infers, saves, and enforces JSON schema contracts to prevent data corruption.
  • Cloud Native: A single API for Local, S3, Azure Blob (ABFSS), and GCS.

🛠️ Installation

# Core (Local files only)
pip install open-auto-loader

# Full Cloud Support (Recommended)
pip install "open-auto-loader[all]"

🚀 Quick Start: S3 to Delta Lake

from open_auto_loader import OpenAutoLoader

# Define your cloud credentials
storage_options = {
    "aws_access_key_id": "YOUR_ACCESS_KEY",
    "aws_secret_access_key": "YOUR_SECRET_KEY",
    "region": "ap-south-1"
}

# Initialize the loader
loader = OpenAutoLoader(
    source="s3://my-raw-bucket/incoming_logs/",
    target="s3://my-silver-bucket/tables/user_logs",
    check_point="./metadata/checkpoints.db",
    schema_path="./metadata/schemas/",
    storage_options=storage_options
)

# Run the ingestion batch
loader.run(batch_id="daily_run_2026_03_18")

🏗️ Architecture: How it Works

  1. Scanner: Uses fsspec to identify new files since the last successful batch_id.
  2. Schema Guard: Checks the file header against the stored JSON contract in schema_path.
  3. Polars Engine: Streams the data using sink_delta(), minimizing memory footprint.
  4. Metadata Injection: Automatically adds _batch_id, _processed_at, and _source_file to every row for full auditability.
  5. Committer: Updates the SQLite checkpoint only after a successful Delta write.

📋 Compatibility Matrix

Feature Local AWS S3 Azure Blob Google GCS
Incremental Loading
Schema Enforcement
Service Principal Auth N/A
Streaming Sink

🤝 Contributing

Contributions are welcome! Whether it's a bug fix, a new cloud provider, or performance tuning, feel free to open a PR.

Created with ❤️ by Nitish Katkade

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_auto_loader-0.2.0.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

open_auto_loader-0.2.0-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file open_auto_loader-0.2.0.tar.gz.

File metadata

  • Download URL: open_auto_loader-0.2.0.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.8

File hashes

Hashes for open_auto_loader-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0ad091a7e074c97dcb5f5d9f5c6efdc201d9723b7faa187ece41ed709acd23fc
MD5 f55fec58216a0901b9220300976a6bc4
BLAKE2b-256 670dd918a12de8fde0d628881d137add91dc73ef4da9b05c50b50a401defcd97

See more details on using hashes here.

File details

Details for the file open_auto_loader-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for open_auto_loader-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c1e64412ce78a24f9e8a42884d71bca5c534abe9786541da427af86d9c585ee0
MD5 bb5185055cc005fe7cddc43baa57e62a
BLAKE2b-256 995f64d2b38f4fbfa99508a35cbef4599cfa48d87fb43515d216a4b287e7b3a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page