ubunye-engine

Config-first, Spark-native ETL/ML engine with a modular plugin system

Project description

Ubunye Engine

Ubunye (oo-BOON-yeh) — isiZulu for "unity"

One framework. Every pipeline. Any environment.

Docs • Quickstart • Why Ubunye • Community

Hey there 👋

A data pipeline is a program that moves data from one place to another — a database to a file, a REST API to a data warehouse — and usually reshapes the data along the way. Building one from scratch is mostly plumbing: wire up the connection, juggle credentials, learn a framework's quirks, write the same "read → transform → write" scaffold for the tenth time this year. It's a lot of glue code standing between you and the three lines that actually matter.

Ubunye Engine writes that plumbing for you. You describe the pipeline in a short YAML file and put your transformation in a normal Python class. Ubunye takes care of connections, the compute engine (Apache Spark), and the read/write loop.

Same pipeline runs on your laptop today and on a production cluster tomorrow — no code changes.

Quickstart

Install it:

pip install ubunye-engine

Scaffold a new pipeline folder:

ubunye init -d ./pipelines -u demo -p starter -t filter_adults

You get:

pipelines/demo/starter/filter_adults/
  config.yaml              ← describes the pipeline (inputs, outputs, settings)
  transformations.py       ← your code goes here
  notebooks/               ← an interactive dev notebook for exploring

ubunye init gives you a working starting point you can customise. For a minimal run-it-on-your-laptop example, edit config.yaml to read a local CSV and write Parquet:

CONFIG:
  inputs:
    people:
      format: s3              # generic file reader; "file://" paths work too
      file_format: csv
      path: "file:///tmp/people.csv"
      options:
        header: "true"
        inferSchema: "true"

  outputs:
    adults:
      format: s3
      file_format: parquet
      path: "file:///tmp/adults/"
      mode: overwrite

Then open transformations.py and write your logic:

from typing import Any, Dict
from ubunye.core.interfaces import Task


class FilterAdults(Task):
    """Keep only rows where age is 18 or older."""

    def transform(self, sources: Dict[str, Any]) -> Dict[str, Any]:
        people = sources["people"]
        return {"adults": people.filter("age >= 18")}

Two things to notice:

sources["people"] matches the inputs.people name from the YAML.
The return key "adults" matches the outputs.adults name.

Run it:

ubunye run -d ./pipelines -u demo -p starter -t filter_adults

That's the whole loop. Ubunye reads /tmp/people.csv, hands you a Spark DataFrame, and writes whatever you return to /tmp/adults/.

Running on Databricks? Call it from a notebook instead:

import ubunye
outputs = ubunye.run_task(task_dir="./pipelines/demo/starter/filter_adults")

Ubunye detects Databricks' active Spark session and reuses it — same pipeline, no code change.

Want to see a realistic end-to-end example — Kaggle Titanic CSV → survival-rate Parquet, with tests and CI? See examples/production/titanic_local/.

Why Ubunye

We've all been there. You join a new team, open the repo, and find five Spark projects — each structured differently, each with its own way of handling configs, credentials, and deployment. One uses a JSON file, another has everything hardcoded, a third has a 300-line bash script that "Dave wrote and it just works."

Ubunye says: let's agree on how pipelines look. One folder structure. One config format. One CLI. Whether you're building an ETL job, a feature pipeline, or an ML training run.

Without Ubunye	With Ubunye
Every project looks different	One standard: `use_case / pipeline / task`
Spark setup scattered everywhere	Engine handles it from YAML config
Credentials hardcoded or inconsistent	`{{ env.DB_PASSWORD }}` everywhere
"Works on my machine"	Same config runs local, YARN, K8s, Databricks
New teammate needs a week to onboard	`ubunye init` and they're running in minutes

How It Works

Three simple ideas:

Config over code. Your pipeline is a YAML file. Inputs, outputs, Spark settings, scheduling — all declared, not coded.

Plugins for everything. The format field in your config picks which connector to use. A connector is a small Python class that knows how to read from or write to one specific place (a database, a REST API, a cloud bucket). Built-ins include hive, jdbc, delta, s3, unity, and rest_api. Need a new data source? Write one and register it — Ubunye discovers plugins automatically.

Folders as architecture. Pipelines are organized as project / use_case / pipeline / task. The CLI uses this structure for scaffolding, execution, and discovery:

pipelines/
  fraud_detection/
    ingestion/
      claim_etl/
      policy_etl/
    feature_engineering/
      claim_features/
    risk_scoring/
      train_model/
      score_claims/

What Can You Build With It

ETL pipelines — move data between Hive, JDBC databases, Delta Lake, S3, REST APIs. Config-driven, scheduled, reproducible.

ML training and inference — define your model behind a simple contract, let the engine handle versioning, storage, and deployment.

RAG document pipelines — ingest documents, extract text, chunk, compute embeddings, load into a vector store. All from YAML.

Feature engineering — compute features once, write to a shared table, reuse across use cases.

Data drift detection — monitor feature distributions between runs, flag when things shift.

Check out the Patterns section in our docs for full examples.

Examples

Six fully worked pipelines live in examples/production/. Each one is self-contained — its own README, tests, and CI workflow — so you can copy a folder, tweak the config, and have something running in minutes.

Example	What it shows	Where it runs
`titanic_local/`	Simplest end-to-end: Kaggle Titanic CSV → survival rate by passenger class, saved as Parquet. Start here.	Your laptop
`titanic_databricks/`	Same business logic, same file — just a different config. Shows how little changes when you move to the cloud.	Databricks Community Edition
`titanic_multitask_local/`	Two tasks chained: one cleans the data, the next summarises it. Shows `ubunye run -t task1 -t task2`.	Your laptop
`titanic_multitask_databricks/`	Same chain, running on Databricks with Unity Catalog tables instead of local Parquet.	Databricks
`titanic_ml_databricks/`	The full ML lifecycle: train a classifier, log to MLflow, promote through the model registry, score new rows.	Databricks
`jhb_weather_databricks/`	REST API ingestion (Open-Meteo, no auth) → Unity Catalog Delta table, on a schedule.	Databricks

Not sure which one to open? Read examples/production/README.md — it walks through picking a runtime and what the Community Edition / paid workspace differences look like.

Connectors

Format	Read	Write	Description
`hive`	✓	✓	Apache Hive tables
`jdbc`	✓	✓	PostgreSQL, MySQL, Teradata, and more
`delta`	✓	✓	Delta Lake (standalone or Unity Catalog)
`s3`	✓	✓	S3, HDFS, or local filesystem
`unity`	✓	✓	Databricks Unity Catalog
`binary`	✓		Binary files (images, PDFs)
`rest_api`	✓	✓	REST APIs with pagination and auth

Want to add one? See the plugin guide.

Run Anywhere

The same pipeline runs on every Spark-compatible environment. You only change the spark.master setting — the rest is identical:

Where you run it	What to set
Your laptop	`spark.master: "local[*]"`
Hadoop / YARN cluster	`spark.master: "yarn"`
Kubernetes	`spark.master: "k8s://..."`
Databricks notebooks or jobs	Call `ubunye.run_task()` from Python — Ubunye picks up the active session
AWS EMR	Runs as an EMR Step

Don't recognise some of these? That's fine — you only need one. If you're starting out, local[*] runs Spark on your own machine with no setup.

Jinja Templating

Anywhere a string appears in your YAML, you can plug in a variable using {{ … }} syntax (this is called Jinja templating). That's how you keep secrets out of your config, change paths per environment, and inject the run date from the CLI:

# Environment variables
password: "{{ env.DB_PASSWORD }}"

# CLI variables (--var ds=2025-01-01)
path: "s3a://bucket/{{ ds }}/"

# Defaults
path: "s3a://bucket/{{ ds | default('2025-01-01') }}/"

CLI

ubunye init     -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # scaffold
ubunye validate -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # check config
ubunye plan     -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # preview plan
ubunye run      -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # execute
ubunye test run -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # test mode
ubunye lineage list -d ./pipelines -u <use_case> -p <pipeline> -t <task>  # run history
ubunye models list -u <use_case> -m <model> -s <store>                 # model versions

Python API

import ubunye

# Run from Databricks or any Python environment
outputs = ubunye.run_task(task_dir="./pipelines/...", mode="DEV", dt="2024-06-01")

# Multiple tasks
results = ubunye.run_pipeline(
    usecase_dir="./pipelines", usecase="fraud", package="etl",
    tasks=["claim_etl", "features"], mode="DEV",
)

What Ubunye Is Not

It's not an agent framework — use LangChain or CrewAI for that. It's not an orchestrator — use Airflow, Prefect, or Dagster. It's not a compute engine — it runs on Spark.

Ubunye is the standardization layer between your data sources and your applications. It makes the plumbing boring so you can focus on what matters.

Roadmap

Config-driven ETL pipelines
Multi-environment profiles
Jinja templating
Plugin-based connectors
CLI scaffolding and execution
Pydantic config validation
ML model contract
Model registry with versioning
Lineage tracking
Python API for Databricks
Databricks Asset Bundles deployment
Dev notebook scaffolding
Data drift detection
ubunye deploy CLI command

Get Involved

We'd love your help. Whether it's a new connector, a bug fix, a typo, or just telling us what you're building — all contributions matter.

🐛 Report a bug
💡 Request a feature
📖 Read the contributing guide
⭐ Star the repo if you find it useful — it helps more than you'd think

License

MIT License

Built with 🇿🇦 by Ubunye AI Ecosystems

Project details

Release history Release notifications | RSS feed

This version

0.1.7

Apr 21, 2026

0.1.6

Apr 15, 2026

0.1.5

Apr 15, 2026

0.1.3

Mar 20, 2026

0.1.1

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ubunye_engine-0.1.7.tar.gz (82.3 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ubunye_engine-0.1.7-py3-none-any.whl (98.9 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file ubunye_engine-0.1.7.tar.gz.

File metadata

Download URL: ubunye_engine-0.1.7.tar.gz
Upload date: Apr 21, 2026
Size: 82.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ubunye_engine-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`3cdfb1c86f624b537de7aa90ab9c088996eb732d37ad592a5a306288bcf8dc2e`
MD5	`bd3268dcb79c3a1d7c79a799e088dde2`
BLAKE2b-256	`1f2d7b2fe68f2bb02841c1e0189cd7b2cd1e1bed227819a974b71921f8e94d98`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ubunye_engine-0.1.7.tar.gz:

Publisher: publish_pypip.yml on ubunye-ai-ecosystems/ubunye_engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ubunye_engine-0.1.7.tar.gz
- Subject digest: 3cdfb1c86f624b537de7aa90ab9c088996eb732d37ad592a5a306288bcf8dc2e
- Sigstore transparency entry: 1348506600
- Sigstore integration time: Apr 21, 2026
Source repository:
- Permalink: ubunye-ai-ecosystems/ubunye_engine@024e2782f343ca59bbb676d77db0c3b7eb253633
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/ubunye-ai-ecosystems
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_pypip.yml@024e2782f343ca59bbb676d77db0c3b7eb253633
- Trigger Event: push

File details

Details for the file ubunye_engine-0.1.7-py3-none-any.whl.

File metadata

Download URL: ubunye_engine-0.1.7-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 98.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ubunye_engine-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`10945dffad9de68ba21365f35a5507393024f34a2d86bfa14d1267ac6e43914e`
MD5	`43f960451aa86d9ff2a308f2bf62c8be`
BLAKE2b-256	`f3f6643e1002e7154095d55595498606fab6a498a70c9a75b41b14ba2375cdad`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ubunye_engine-0.1.7-py3-none-any.whl:

Publisher: publish_pypip.yml on ubunye-ai-ecosystems/ubunye_engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ubunye_engine-0.1.7-py3-none-any.whl
- Subject digest: 10945dffad9de68ba21365f35a5507393024f34a2d86bfa14d1267ac6e43914e
- Sigstore transparency entry: 1348506907
- Sigstore integration time: Apr 21, 2026
Source repository:
- Permalink: ubunye-ai-ecosystems/ubunye_engine@024e2782f343ca59bbb676d77db0c3b7eb253633
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/ubunye-ai-ecosystems
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish_pypip.yml@024e2782f343ca59bbb676d77db0c3b7eb253633
- Trigger Event: push

ubunye-engine 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

One framework. Every pipeline. Any environment.

Hey there 👋

Quickstart

Why Ubunye

How It Works

What Can You Build With It

Examples

Connectors

Run Anywhere

Jinja Templating

CLI

Python API

What Ubunye Is Not

Roadmap

Get Involved

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance