A lightweight, Databricks-style autoloader using Polars and SQLite.
Project description
🚀 OpenAutoLoader
OpenAutoLoader is a high-performance, incremental data ingestion engine. It bridges the gap between raw cloud storage and production-ready Delta Lakes using the lightning-fast Polars Rust engine.
Stop writing complex Spark jobs for simple file ingestion. OpenAutoLoader provides a "Databricks-style" Auto Loader experience in a lightweight Python package.
💡 Why OpenAutoLoader?
Traditional ingestion often requires heavy JVM clusters (Spark) or manual file tracking. OpenAutoLoader changes that:
- Zero-Spark Overhead: Runs on standard Python environments with Rust-level performance.
- Exactly-Once Processing: Integrated SQLite checkpointing ensures no duplicate data, even if a job restarts.
- Schema First: Automatically infers, saves, and enforces JSON schema contracts to prevent data corruption.
- Cloud Native: A single API for Local, S3, Azure Blob (ABFSS), and GCS.
🛠️ Installation
# Core (Local files only)
pip install open-auto-loader
# Full Cloud Support (Recommended)
pip install "open-auto-loader[all]"
🚀 Quick Start: S3 to Delta Lake
from open_auto_loader import OpenAutoLoader
# Define your cloud credentials
storage_options = {
"aws_access_key_id": "YOUR_ACCESS_KEY",
"aws_secret_access_key": "YOUR_SECRET_KEY",
"region": "ap-south-1"
}
# Initialize the loader
loader = OpenAutoLoader(
source="s3://my-raw-bucket/incoming_logs/",
target="s3://my-silver-bucket/tables/user_logs",
check_point="./metadata/checkpoints.db",
schema_path="./metadata/schemas/",
storage_options=storage_options
)
# Run the ingestion batch
loader.run(batch_id="daily_run_2026_03_18")
🏗️ Architecture: How it Works
- Scanner: Uses
fsspecto identify new files since the last successfulbatch_id. - Schema Guard: Checks the file header against the stored JSON contract in
schema_path. - Polars Engine: Streams the data using
sink_delta(), minimizing memory footprint. - Metadata Injection: Automatically adds
_batch_id,_processed_at, and_source_fileto every row for full auditability. - Committer: Updates the SQLite checkpoint only after a successful Delta write.
📋 Compatibility Matrix
| Feature | Local | AWS S3 | Azure Blob | Google GCS |
|---|---|---|---|---|
| Incremental Loading | ✅ | ✅ | ✅ | ✅ |
| Schema Enforcement | ✅ | ✅ | ✅ | ✅ |
| Service Principal Auth | N/A | ✅ | ✅ | ✅ |
| Streaming Sink | ✅ | ✅ | ✅ | ✅ |
🤝 Contributing
Contributions are welcome! Whether it's a bug fix, a new cloud provider, or performance tuning, feel free to open a PR.
Created with ❤️ by Nitish Katkade
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_auto_loader-0.2.0.tar.gz.
File metadata
- Download URL: open_auto_loader-0.2.0.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ad091a7e074c97dcb5f5d9f5c6efdc201d9723b7faa187ece41ed709acd23fc
|
|
| MD5 |
f55fec58216a0901b9220300976a6bc4
|
|
| BLAKE2b-256 |
670dd918a12de8fde0d628881d137add91dc73ef4da9b05c50b50a401defcd97
|
File details
Details for the file open_auto_loader-0.2.0-py3-none-any.whl.
File metadata
- Download URL: open_auto_loader-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1e64412ce78a24f9e8a42884d71bca5c534abe9786541da427af86d9c585ee0
|
|
| MD5 |
bb5185055cc005fe7cddc43baa57e62a
|
|
| BLAKE2b-256 |
995f64d2b38f4fbfa99508a35cbef4599cfa48d87fb43515d216a4b287e7b3a5
|