ubunye-engine

Config-first, Spark-native ETL/ML engine with a modular plugin system

Project description

Ubunye Engine

Ubunye (oo-BOON-yeh) — isiZulu for "unity"

One framework. Every pipeline. Any environment.

Docs • Quickstart • Why Ubunye • Community

Hey there 👋

If you've ever spent more time wiring up Spark boilerplate than writing actual data logic — this is for you.

Ubunye Engine lets you define your entire data pipeline in a simple YAML config and a Python file. No Spark session setup. No connection management. No environment-specific scripts. Just tell the engine what you want, and it handles the how.

It works on your laptop, on YARN, on Kubernetes, on Databricks — same config, same code, everywhere.

Quickstart

pip install ubunye-engine

Scaffold your first pipeline:

ubunye init -d ./pipelines -u fraud_detection -p ingestion -t claim_etl

You'll get:

pipelines/
  fraud_detection/
    ingestion/
      claim_etl/
        config.yaml               ← tell the engine what to do
        transformations.py        ← your logic goes here
        notebooks/
          claim_etl_dev.ipynb     ← interactive dev notebook

Open config.yaml and describe your pipeline:

MODEL: "etl"
VERSION: "0.1.0"

ENGINE:
  profiles:
    dev:
      spark_conf:
        spark.master: "local[*]"
    prod:
      spark_conf:
        spark.master: "yarn"

CONFIG:
  inputs:
    raw_claims:
      format: hive
      db_name: fraud_db
      tbl_name: raw_claims

  transform:
    type: noop

  outputs:
    bronze:
      format: delta
      table: main.fraud.bronze_claims
      mode: overwrite

Add your logic in transformations.py:

def transform(df):
    return df.filter("claim_amount > 0").dropDuplicates(["claim_id"])

Run it:

ubunye run -d ./pipelines -u fraud_detection -p ingestion -t claim_etl -m dev

Or from Python (Databricks notebooks, scripts):

import ubunye

outputs = ubunye.run_task(
    task_dir="./pipelines/fraud_detection/ingestion/claim_etl",
    mode="dev",
)

That's it. You just built and ran a pipeline. Same config runs in production — just swap the mode. The Python API auto-detects and reuses an active SparkSession on Databricks.

Why Ubunye

We've all been there. You join a new team, open the repo, and find five Spark projects — each structured differently, each with its own way of handling configs, credentials, and deployment. One uses a JSON file, another has everything hardcoded, a third has a 300-line bash script that "Dave wrote and it just works."

Ubunye says: let's agree on how pipelines look. One folder structure. One config format. One CLI. Whether you're building an ETL job, a feature pipeline, or an ML training run.

Without Ubunye	With Ubunye
Every project looks different	One standard: `use_case / pipeline / task`
Spark setup scattered everywhere	Engine handles it from YAML config
Credentials hardcoded or inconsistent	`{{ env.DB_PASSWORD }}` everywhere
"Works on my machine"	Same config runs local, YARN, K8s, Databricks
New teammate needs a week to onboard	`ubunye init` and they're running in minutes

How It Works

Three simple ideas:

Config over code. Your pipeline is a YAML file. Inputs, outputs, Spark settings, orchestration — all declared, not coded.

Plugins for everything. The format field in your config selects a connector — hive, jdbc, delta, s3, unity, rest_api, and more. Need a new data source? Add a plugin.

Folders as architecture. Pipelines are organized as project / use_case / pipeline / task. The CLI uses this for scaffolding, execution, and discovery:

pipelines/
  fraud_detection/
    ingestion/
      claim_etl/
      policy_etl/
    feature_engineering/
      claim_features/
    risk_scoring/
      train_model/
      score_claims/

What Can You Build With It

ETL pipelines — move data between Hive, JDBC databases, Delta Lake, S3, REST APIs. Config-driven, scheduled, reproducible.

ML training and inference — define your model behind a simple contract, let the engine handle versioning, storage, and deployment.

RAG document pipelines — ingest documents, extract text, chunk, compute embeddings, load into a vector store. All from YAML.

Feature engineering — compute features once, write to a shared table, reuse across use cases.

Data drift detection — monitor feature distributions between runs, flag when things shift.

Check out the Patterns section in our docs for full examples.

Connectors

Format	Read	Write	Description
`hive`	✓	✓	Apache Hive tables
`jdbc`	✓	✓	PostgreSQL, MySQL, Teradata, and more
`delta`	✓	✓	Delta Lake (standalone or Unity Catalog)
`s3`	✓	✓	S3, HDFS, or local filesystem
`unity`	✓	✓	Databricks Unity Catalog
`binary`	✓		Binary files (images, PDFs)
`rest_api`	✓	✓	REST APIs with pagination and auth

Want to add one? See the plugin guide.

Run Anywhere

Same pipeline, no changes:

Environment	How
Local	`spark.master: "local[*]"` in config
YARN / Hadoop	`spark.master: "yarn"` in config
Kubernetes	`spark.master: "k8s://..."` in config
Databricks	Python API (`ubunye.run_task()`) or Asset Bundles
AWS EMR	Via EMR Steps

Jinja Templating

All config values support Jinja2:

# Environment variables
password: "{{ env.DB_PASSWORD }}"

# CLI variables (--var ds=2025-01-01)
path: "s3a://bucket/{{ ds }}/"

# Defaults
path: "s3a://bucket/{{ ds | default('2025-01-01') }}/"

CLI

ubunye init     -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # scaffold
ubunye validate -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # check config
ubunye plan     -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # preview plan
ubunye run      -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # execute
ubunye test run -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # test mode
ubunye lineage list -d ./pipelines -u <use_case> -p <pipeline> -t <task>  # run history
ubunye models list -u <use_case> -m <model> -s <store>                 # model versions

Python API

import ubunye

# Run from Databricks or any Python environment
outputs = ubunye.run_task(task_dir="./pipelines/...", mode="DEV", dt="2024-06-01")

# Multiple tasks
results = ubunye.run_pipeline(
    usecase_dir="./pipelines", usecase="fraud", package="etl",
    tasks=["claim_etl", "features"], mode="DEV",
)

What Ubunye Is Not

It's not an agent framework — use LangChain or CrewAI for that. It's not an orchestrator — use Airflow, Prefect, or Dagster. It's not a compute engine — it runs on Spark.

Ubunye is the standardization layer between your data sources and your applications. It makes the plumbing boring so you can focus on what matters.

Roadmap

Config-driven ETL pipelines
Multi-environment profiles
Jinja templating
Plugin-based connectors
CLI scaffolding and execution
Pydantic config validation
ML model contract
Model registry with versioning
Lineage tracking
Python API for Databricks
Databricks Asset Bundles deployment
Dev notebook scaffolding
Data drift detection
ubunye deploy CLI command

Get Involved

We'd love your help. Whether it's a new connector, a bug fix, a typo, or just telling us what you're building — all contributions matter.

🐛 Report a bug
💡 Request a feature
📖 Read the contributing guide
⭐ Star the repo if you find it useful — it helps more than you'd think

License

MIT License

Built with 🇿🇦 by Ubunye AI Ecosystems

Project details

Release history Release notifications | RSS feed

0.1.7

Apr 21, 2026

0.1.6

Apr 15, 2026

This version

0.1.5

Apr 15, 2026

0.1.3

Mar 20, 2026

0.1.1

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ubunye_engine-0.1.5.tar.gz (77.4 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ubunye_engine-0.1.5-py3-none-any.whl (94.8 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file ubunye_engine-0.1.5.tar.gz.

File metadata

Download URL: ubunye_engine-0.1.5.tar.gz
Upload date: Apr 15, 2026
Size: 77.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ubunye_engine-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`addc3fd4de09695963f15d47410f239e29b40b87c25c5e83a18e59ef241e80bf`
MD5	`db5cf2255e4e1b762403d106d80beef9`
BLAKE2b-256	`aa3efd14eb3d2ee6bdccd5fa4e79c026dee532790f480e7176b90bc6ae0d12cd`

See more details on using hashes here.

File details

Details for the file ubunye_engine-0.1.5-py3-none-any.whl.

File metadata

Download URL: ubunye_engine-0.1.5-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 94.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ubunye_engine-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c8b227411bf5c7c15a52991a17a5c326bd86d532b2f9c0148ed3e08beb93358c`
MD5	`d570aa554fc087ad9e00fc42b4bc2d3f`
BLAKE2b-256	`9c3405aba1f657aca2472e4636bcabe60f0f1835d24ca5723ce9fee904d59db5`

See more details on using hashes here.

ubunye-engine 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

One framework. Every pipeline. Any environment.

Hey there 👋

Quickstart

Why Ubunye

How It Works

What Can You Build With It

Connectors

Run Anywhere

Jinja Templating

CLI

Python API

What Ubunye Is Not

Roadmap

Get Involved

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes