Skip to main content

Config-first, Spark-native ETL/ML engine with a modular plugin system

Project description

Ubunye Engine

Ubunye (oo-BOON-yeh) — isiZulu for "unity"

One framework. Every pipeline. Any environment.

DocsQuickstartWhy UbunyeCommunity


Hey there 👋

If you've ever spent more time wiring up Spark boilerplate than writing actual data logic — this is for you.

Ubunye Engine lets you define your entire data pipeline in a simple YAML config and a Python file. No Spark session setup. No connection management. No environment-specific scripts. Just tell the engine what you want, and it handles the how.

It works on your laptop, on YARN, on Kubernetes, on Databricks — same config, same code, everywhere.


Quickstart

pip install ubunye-engine

Scaffold your first pipeline:

ubunye init -d ./pipelines -u fraud_detection -p ingestion -t claim_etl

You'll get:

pipelines/
  fraud_detection/
    ingestion/
      claim_etl/
        config.yaml               ← tell the engine what to do
        transformations.py        ← your logic goes here
        notebooks/
          claim_etl_dev.ipynb     ← interactive dev notebook

Open config.yaml and describe your pipeline:

MODEL: "etl"
VERSION: "0.1.0"

ENGINE:
  profiles:
    dev:
      spark_conf:
        spark.master: "local[*]"
    prod:
      spark_conf:
        spark.master: "yarn"

CONFIG:
  inputs:
    raw_claims:
      format: hive
      db_name: fraud_db
      tbl_name: raw_claims

  transform:
    type: noop

  outputs:
    bronze:
      format: delta
      table: main.fraud.bronze_claims
      mode: overwrite

Add your logic in transformations.py:

def transform(df):
    return df.filter("claim_amount > 0").dropDuplicates(["claim_id"])

Run it:

ubunye run -d ./pipelines -u fraud_detection -p ingestion -t claim_etl -m dev

Or from Python (Databricks notebooks, scripts):

import ubunye

outputs = ubunye.run_task(
    task_dir="./pipelines/fraud_detection/ingestion/claim_etl",
    mode="dev",
)

That's it. You just built and ran a pipeline. Same config runs in production — just swap the mode. The Python API auto-detects and reuses an active SparkSession on Databricks.


Why Ubunye

We've all been there. You join a new team, open the repo, and find five Spark projects — each structured differently, each with its own way of handling configs, credentials, and deployment. One uses a JSON file, another has everything hardcoded, a third has a 300-line bash script that "Dave wrote and it just works."

Ubunye says: let's agree on how pipelines look. One folder structure. One config format. One CLI. Whether you're building an ETL job, a feature pipeline, or an ML training run.

Without Ubunye With Ubunye
Every project looks different One standard: use_case / pipeline / task
Spark setup scattered everywhere Engine handles it from YAML config
Credentials hardcoded or inconsistent {{ env.DB_PASSWORD }} everywhere
"Works on my machine" Same config runs local, YARN, K8s, Databricks
New teammate needs a week to onboard ubunye init and they're running in minutes

How It Works

Three simple ideas:

Config over code. Your pipeline is a YAML file. Inputs, outputs, Spark settings, orchestration — all declared, not coded.

Plugins for everything. The format field in your config selects a connector — hive, jdbc, delta, s3, unity, rest_api, and more. Need a new data source? Add a plugin.

Folders as architecture. Pipelines are organized as project / use_case / pipeline / task. The CLI uses this for scaffolding, execution, and discovery:

pipelines/
  fraud_detection/
    ingestion/
      claim_etl/
      policy_etl/
    feature_engineering/
      claim_features/
    risk_scoring/
      train_model/
      score_claims/

What Can You Build With It

ETL pipelines — move data between Hive, JDBC databases, Delta Lake, S3, REST APIs. Config-driven, scheduled, reproducible.

ML training and inference — define your model behind a simple contract, let the engine handle versioning, storage, and deployment.

RAG document pipelines — ingest documents, extract text, chunk, compute embeddings, load into a vector store. All from YAML.

Feature engineering — compute features once, write to a shared table, reuse across use cases.

Data drift detection — monitor feature distributions between runs, flag when things shift.

Check out the Patterns section in our docs for full examples.


Connectors

Format Read Write Description
hive Apache Hive tables
jdbc PostgreSQL, MySQL, Teradata, and more
delta Delta Lake (standalone or Unity Catalog)
s3 S3, HDFS, or local filesystem
unity Databricks Unity Catalog
binary Binary files (images, PDFs)
rest_api REST APIs with pagination and auth

Want to add one? See the plugin guide.


Run Anywhere

Same pipeline, no changes:

Environment How
Local spark.master: "local[*]" in config
YARN / Hadoop spark.master: "yarn" in config
Kubernetes spark.master: "k8s://..." in config
Databricks Python API (ubunye.run_task()) or Asset Bundles
AWS EMR Via EMR Steps

Jinja Templating

All config values support Jinja2:

# Environment variables
password: "{{ env.DB_PASSWORD }}"

# CLI variables (--var ds=2025-01-01)
path: "s3a://bucket/{{ ds }}/"

# Defaults
path: "s3a://bucket/{{ ds | default('2025-01-01') }}/"

CLI

ubunye init     -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # scaffold
ubunye validate -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # check config
ubunye plan     -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # preview plan
ubunye run      -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # execute
ubunye test run -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # test mode
ubunye lineage list -d ./pipelines -u <use_case> -p <pipeline> -t <task>  # run history
ubunye models list -u <use_case> -m <model> -s <store>                 # model versions

Python API

import ubunye

# Run from Databricks or any Python environment
outputs = ubunye.run_task(task_dir="./pipelines/...", mode="DEV", dt="2024-06-01")

# Multiple tasks
results = ubunye.run_pipeline(
    usecase_dir="./pipelines", usecase="fraud", package="etl",
    tasks=["claim_etl", "features"], mode="DEV",
)

What Ubunye Is Not

It's not an agent framework — use LangChain or CrewAI for that. It's not an orchestrator — use Airflow, Prefect, or Dagster. It's not a compute engine — it runs on Spark.

Ubunye is the standardization layer between your data sources and your applications. It makes the plumbing boring so you can focus on what matters.


Roadmap

  • Config-driven ETL pipelines
  • Multi-environment profiles
  • Jinja templating
  • Plugin-based connectors
  • CLI scaffolding and execution
  • Pydantic config validation
  • ML model contract
  • Model registry with versioning
  • Lineage tracking
  • Python API for Databricks
  • Databricks Asset Bundles deployment
  • Dev notebook scaffolding
  • Data drift detection
  • ubunye deploy CLI command

Get Involved

We'd love your help. Whether it's a new connector, a bug fix, a typo, or just telling us what you're building — all contributions matter.


License

MIT License


Built with 🇿🇦 by Ubunye AI Ecosystems

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ubunye_engine-0.1.5.tar.gz (77.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ubunye_engine-0.1.5-py3-none-any.whl (94.8 kB view details)

Uploaded Python 3

File details

Details for the file ubunye_engine-0.1.5.tar.gz.

File metadata

  • Download URL: ubunye_engine-0.1.5.tar.gz
  • Upload date:
  • Size: 77.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ubunye_engine-0.1.5.tar.gz
Algorithm Hash digest
SHA256 addc3fd4de09695963f15d47410f239e29b40b87c25c5e83a18e59ef241e80bf
MD5 db5cf2255e4e1b762403d106d80beef9
BLAKE2b-256 aa3efd14eb3d2ee6bdccd5fa4e79c026dee532790f480e7176b90bc6ae0d12cd

See more details on using hashes here.

File details

Details for the file ubunye_engine-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: ubunye_engine-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 94.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ubunye_engine-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c8b227411bf5c7c15a52991a17a5c326bd86d532b2f9c0148ed3e08beb93358c
MD5 d570aa554fc087ad9e00fc42b4bc2d3f
BLAKE2b-256 9c3405aba1f657aca2472e4636bcabe60f0f1835d24ca5723ce9fee904d59db5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page