Skip to main content

A lightweight, Databricks-style autoloader using Polars and SQLite.

Project description

🚀 OpenAutoLoader

OpenAutoLoader is a high-performance, incremental data ingestion library for Python. It provides a "Set and Forget" experience for ingesting raw files from Local Storage, AWS S3, Azure Blob, and GCP into professional Delta Lake tables, built entirely on the Polars engine.


✨ Key Features

  • Incremental Loading: SQLite-backed checkpoint system ensures files are processed exactly once.
  • Multi-Cloud Support: Native support for s3://, abfss://, and gs:// protocols.
  • Schema Governance: Automatically bootstraps and enforces a strict JSON contract to prevent data poisoning.
  • Streaming Execution: Leverages Polars' sink_delta(streaming=True) to process datasets larger than RAM.
  • Audit-Ready: Automatically injects metadata: _batch_id, _processed_at, and _file_path.

🛠️ Architecture

OpenAutoLoader uses a modular architecture designed for extensibility:

Component Responsibility
OpenAutoLoader The Orchestrator. Coordinates discovery, validation, and execution.
FileScanner Discovery. Uses fsspec to recursively find new files across cloud providers.
PolarsEngine Execution. Handles LazyFrame transformations and high-speed Delta sinks.
SchemaManager Governance. Serializes the data contract to JSON and validates new batches.
CheckPointManager Persistence. Tracks processed file paths to prevent duplicate ingestion.

🚀 Quick Start

1. Installation

# Core library
pip install open_auto_loader

# With Cloud Drivers (Optional)
pip install s3fs adlfs gcsfs

2. Cloud Ingestion (AWS S3 Example)

from open_auto_loader import OpenAutoLoader

# Define your cloud credentials
storage_options = {
    "aws_access_key_id": "YOUR_KEY",
    "aws_secret_access_key": "YOUR_SECRET",
    "aws_region": "ap-south-1"
}

loader = OpenAutoLoader(
    source="s3://my-raw-bucket/incoming/",
    target="s3://my-silver-bucket/tables/users",
    check_point="./metadata",       # Checkpoints stay local for speed
    schema_path="./contracts",      # Schemas stay local for governance
    format_type="csv",
    storage_options=storage_options
)

loader.run(batch_id="daily_batch_001")

☁️ Supported Cloud Protocols

Provider Protocol Required Driver storage_options keys
Local file:// None None
AWS S3 s3:// s3fs aws_access_key_id, aws_region
Azure Blob abfss:// adlfs account_name, account_key
GCP GCS gs:// gcsfs token (path to JSON key)

📋 Schema Management

OpenAutoLoader implements Schema Locking:

  1. Bootstrap: On the first run, the library infers types from the first file found and saves a schema_contract.json.
  2. Enforcement: Every subsequent file is validated against this contract before processing.
  3. Type Safety: If a file exhibits Type Drift (e.g., an Integer column arriving as a String), the batch is aborted to maintain target table integrity.

📜 License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_auto_loader-0.1.0.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

open_auto_loader-0.1.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file open_auto_loader-0.1.0.tar.gz.

File metadata

  • Download URL: open_auto_loader-0.1.0.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.8

File hashes

Hashes for open_auto_loader-0.1.0.tar.gz
Algorithm Hash digest
SHA256 04d1ddd8be38bd21856dcfe3fbe69bfadcd8b13e27e989858be3d31b80334c37
MD5 71643fbe982957d9d297774cb99cfbd4
BLAKE2b-256 5e28010f9c5c82e3c046290b0043033ad0a3ea793972150f2ec9848bc35d7f6c

See more details on using hashes here.

File details

Details for the file open_auto_loader-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for open_auto_loader-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 73a87d99654f3839f4b56e8f6482358d17e2d5419418d06db8e076dd477f0c7b
MD5 56f8ed7cc7a554cd5e6fc6cdd0ec0f76
BLAKE2b-256 84ac1ad3ffcdd0227afe712d0e80f03aa8c55373542c639ab1d5d66a2e00d2ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page