A lightweight, Databricks-style autoloader using Polars and SQLite.
Project description
🚀 OpenAutoLoader
OpenAutoLoader is a high-performance, incremental data ingestion library for Python. It provides a "Set and Forget" experience for ingesting raw files from Local Storage, AWS S3, Azure Blob, and GCP into professional Delta Lake tables, built entirely on the Polars engine.
✨ Key Features
- Incremental Loading: SQLite-backed checkpoint system ensures files are processed exactly once.
- Multi-Cloud Support: Native support for
s3://,abfss://, andgs://protocols. - Schema Governance: Automatically bootstraps and enforces a strict JSON contract to prevent data poisoning.
- Streaming Execution: Leverages Polars'
sink_delta(streaming=True)to process datasets larger than RAM. - Audit-Ready: Automatically injects metadata:
_batch_id,_processed_at, and_file_path.
🛠️ Architecture
OpenAutoLoader uses a modular architecture designed for extensibility:
| Component | Responsibility |
|---|---|
OpenAutoLoader |
The Orchestrator. Coordinates discovery, validation, and execution. |
FileScanner |
Discovery. Uses fsspec to recursively find new files across cloud providers. |
PolarsEngine |
Execution. Handles LazyFrame transformations and high-speed Delta sinks. |
SchemaManager |
Governance. Serializes the data contract to JSON and validates new batches. |
CheckPointManager |
Persistence. Tracks processed file paths to prevent duplicate ingestion. |
🚀 Quick Start
1. Installation
# Core library
pip install open_auto_loader
# With Cloud Drivers (Optional)
pip install s3fs adlfs gcsfs
2. Cloud Ingestion (AWS S3 Example)
from open_auto_loader import OpenAutoLoader
# Define your cloud credentials
storage_options = {
"aws_access_key_id": "YOUR_KEY",
"aws_secret_access_key": "YOUR_SECRET",
"aws_region": "ap-south-1"
}
loader = OpenAutoLoader(
source="s3://my-raw-bucket/incoming/",
target="s3://my-silver-bucket/tables/users",
check_point="./metadata", # Checkpoints stay local for speed
schema_path="./contracts", # Schemas stay local for governance
format_type="csv",
storage_options=storage_options
)
loader.run(batch_id="daily_batch_001")
☁️ Supported Cloud Protocols
| Provider | Protocol | Required Driver | storage_options keys |
|---|---|---|---|
| Local | file:// |
None | None |
| AWS S3 | s3:// |
s3fs |
aws_access_key_id, aws_region |
| Azure Blob | abfss:// |
adlfs |
account_name, account_key |
| GCP GCS | gs:// |
gcsfs |
token (path to JSON key) |
📋 Schema Management
OpenAutoLoader implements Schema Locking:
- Bootstrap: On the first run, the library infers types from the first file found and saves a
schema_contract.json. - Enforcement: Every subsequent file is validated against this contract before processing.
- Type Safety: If a file exhibits Type Drift (e.g., an Integer column arriving as a String), the batch is aborted to maintain target table integrity.
📜 License
MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_auto_loader-0.1.0.tar.gz.
File metadata
- Download URL: open_auto_loader-0.1.0.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04d1ddd8be38bd21856dcfe3fbe69bfadcd8b13e27e989858be3d31b80334c37
|
|
| MD5 |
71643fbe982957d9d297774cb99cfbd4
|
|
| BLAKE2b-256 |
5e28010f9c5c82e3c046290b0043033ad0a3ea793972150f2ec9848bc35d7f6c
|
File details
Details for the file open_auto_loader-0.1.0-py3-none-any.whl.
File metadata
- Download URL: open_auto_loader-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73a87d99654f3839f4b56e8f6482358d17e2d5419418d06db8e076dd477f0c7b
|
|
| MD5 |
56f8ed7cc7a554cd5e6fc6cdd0ec0f76
|
|
| BLAKE2b-256 |
84ac1ad3ffcdd0227afe712d0e80f03aa8c55373542c639ab1d5d66a2e00d2ff
|