Config-first, Spark-native ETL/ML engine with a modular plugin system
Project description
Ubunye (oo-BOON-yeh) — isiZulu for "unity"
One framework. Every pipeline. Any environment.
Docs • Quickstart • Why Ubunye • Community
Hey there 👋
A data pipeline is a program that moves data from one place to another — a database to a file, a REST API to a data warehouse — and usually reshapes the data along the way. Building one from scratch is mostly plumbing: wire up the connection, juggle credentials, learn a framework's quirks, write the same "read → transform → write" scaffold for the tenth time this year. It's a lot of glue code standing between you and the three lines that actually matter.
Ubunye Engine writes that plumbing for you. You describe the pipeline in a short YAML file and put your transformation in a normal Python class. Ubunye takes care of connections, the compute engine (Apache Spark), and the read/write loop.
Same pipeline runs on your laptop today and on a production cluster tomorrow — no code changes.
Quickstart
Install it:
pip install ubunye-engine
Scaffold a new pipeline folder:
ubunye init -d ./pipelines -u demo -p starter -t filter_adults
You get:
pipelines/demo/starter/filter_adults/
config.yaml ← describes the pipeline (inputs, outputs, settings)
transformations.py ← your code goes here
notebooks/ ← an interactive dev notebook for exploring
ubunye init gives you a working starting point you can customise. For a minimal run-it-on-your-laptop example, edit config.yaml to read a local CSV and write Parquet:
CONFIG:
inputs:
people:
format: s3 # generic file reader; "file://" paths work too
file_format: csv
path: "file:///tmp/people.csv"
options:
header: "true"
inferSchema: "true"
outputs:
adults:
format: s3
file_format: parquet
path: "file:///tmp/adults/"
mode: overwrite
Then open transformations.py and write your logic:
from typing import Any, Dict
from ubunye.core.interfaces import Task
class FilterAdults(Task):
"""Keep only rows where age is 18 or older."""
def transform(self, sources: Dict[str, Any]) -> Dict[str, Any]:
people = sources["people"]
return {"adults": people.filter("age >= 18")}
Two things to notice:
sources["people"]matches theinputs.peoplename from the YAML.- The return key
"adults"matches theoutputs.adultsname.
Run it:
ubunye run -d ./pipelines -u demo -p starter -t filter_adults
That's the whole loop. Ubunye reads /tmp/people.csv, hands you a Spark DataFrame, and writes whatever you return to /tmp/adults/.
Running on Databricks? Call it from a notebook instead:
import ubunye
outputs = ubunye.run_task(task_dir="./pipelines/demo/starter/filter_adults")
Ubunye detects Databricks' active Spark session and reuses it — same pipeline, no code change.
Want to see a realistic end-to-end example — Kaggle Titanic CSV → survival-rate Parquet, with tests and CI? See examples/production/titanic_local/.
Why Ubunye
We've all been there. You join a new team, open the repo, and find five Spark projects — each structured differently, each with its own way of handling configs, credentials, and deployment. One uses a JSON file, another has everything hardcoded, a third has a 300-line bash script that "Dave wrote and it just works."
Ubunye says: let's agree on how pipelines look. One folder structure. One config format. One CLI. Whether you're building an ETL job, a feature pipeline, or an ML training run.
| Without Ubunye | With Ubunye |
|---|---|
| Every project looks different | One standard: use_case / pipeline / task |
| Spark setup scattered everywhere | Engine handles it from YAML config |
| Credentials hardcoded or inconsistent | {{ env.DB_PASSWORD }} everywhere |
| "Works on my machine" | Same config runs local, YARN, K8s, Databricks |
| New teammate needs a week to onboard | ubunye init and they're running in minutes |
How It Works
Three simple ideas:
Config over code. Your pipeline is a YAML file. Inputs, outputs, Spark settings, scheduling — all declared, not coded.
Plugins for everything. The format field in your config picks which connector to use. A connector is a small Python class that knows how to read from or write to one specific place (a database, a REST API, a cloud bucket). Built-ins include hive, jdbc, delta, s3, unity, and rest_api. Need a new data source? Write one and register it — Ubunye discovers plugins automatically.
Folders as architecture. Pipelines are organized as project / use_case / pipeline / task. The CLI uses this structure for scaffolding, execution, and discovery:
pipelines/
fraud_detection/
ingestion/
claim_etl/
policy_etl/
feature_engineering/
claim_features/
risk_scoring/
train_model/
score_claims/
What Can You Build With It
ETL pipelines — move data between Hive, JDBC databases, Delta Lake, S3, REST APIs. Config-driven, scheduled, reproducible.
ML training and inference — define your model behind a simple contract, let the engine handle versioning, storage, and deployment.
RAG document pipelines — ingest documents, extract text, chunk, compute embeddings, load into a vector store. All from YAML.
Feature engineering — compute features once, write to a shared table, reuse across use cases.
Data drift detection — monitor feature distributions between runs, flag when things shift.
Check out the Patterns section in our docs for full examples.
Examples
Six fully worked pipelines live in examples/production/. Each one is self-contained — its own README, tests, and CI workflow — so you can copy a folder, tweak the config, and have something running in minutes.
| Example | What it shows | Where it runs |
|---|---|---|
titanic_local/ |
Simplest end-to-end: Kaggle Titanic CSV → survival rate by passenger class, saved as Parquet. Start here. | Your laptop |
titanic_databricks/ |
Same business logic, same file — just a different config. Shows how little changes when you move to the cloud. | Databricks Community Edition |
titanic_multitask_local/ |
Two tasks chained: one cleans the data, the next summarises it. Shows ubunye run -t task1 -t task2. |
Your laptop |
titanic_multitask_databricks/ |
Same chain, running on Databricks with Unity Catalog tables instead of local Parquet. | Databricks |
titanic_ml_databricks/ |
The full ML lifecycle: train a classifier, log to MLflow, promote through the model registry, score new rows. | Databricks |
jhb_weather_databricks/ |
REST API ingestion (Open-Meteo, no auth) → Unity Catalog Delta table, on a schedule. | Databricks |
Not sure which one to open? Read examples/production/README.md — it walks through picking a runtime and what the Community Edition / paid workspace differences look like.
Connectors
| Format | Read | Write | Description |
|---|---|---|---|
hive |
✓ | ✓ | Apache Hive tables |
jdbc |
✓ | ✓ | PostgreSQL, MySQL, Teradata, and more |
delta |
✓ | ✓ | Delta Lake (standalone or Unity Catalog) |
s3 |
✓ | ✓ | S3, HDFS, or local filesystem |
unity |
✓ | ✓ | Databricks Unity Catalog |
binary |
✓ | Binary files (images, PDFs) | |
rest_api |
✓ | ✓ | REST APIs with pagination and auth |
Want to add one? See the plugin guide.
Run Anywhere
The same pipeline runs on every Spark-compatible environment. You only change the spark.master setting — the rest is identical:
| Where you run it | What to set |
|---|---|
| Your laptop | spark.master: "local[*]" |
| Hadoop / YARN cluster | spark.master: "yarn" |
| Kubernetes | spark.master: "k8s://..." |
| Databricks notebooks or jobs | Call ubunye.run_task() from Python — Ubunye picks up the active session |
| AWS EMR | Runs as an EMR Step |
Don't recognise some of these? That's fine — you only need one. If you're starting out, local[*] runs Spark on your own machine with no setup.
Jinja Templating
Anywhere a string appears in your YAML, you can plug in a variable using {{ … }} syntax (this is called Jinja templating). That's how you keep secrets out of your config, change paths per environment, and inject the run date from the CLI:
# Environment variables
password: "{{ env.DB_PASSWORD }}"
# CLI variables (--var ds=2025-01-01)
path: "s3a://bucket/{{ ds }}/"
# Defaults
path: "s3a://bucket/{{ ds | default('2025-01-01') }}/"
CLI
ubunye init -d ./pipelines -u <use_case> -p <pipeline> -t <task> # scaffold
ubunye validate -d ./pipelines -u <use_case> -p <pipeline> -t <task> # check config
ubunye plan -d ./pipelines -u <use_case> -p <pipeline> -t <task> # preview plan
ubunye run -d ./pipelines -u <use_case> -p <pipeline> -t <task> # execute
ubunye test run -d ./pipelines -u <use_case> -p <pipeline> -t <task> # test mode
ubunye lineage list -d ./pipelines -u <use_case> -p <pipeline> -t <task> # run history
ubunye models list -u <use_case> -m <model> -s <store> # model versions
Python API
import ubunye
# Run from Databricks or any Python environment
outputs = ubunye.run_task(task_dir="./pipelines/...", mode="DEV", dt="2024-06-01")
# Multiple tasks
results = ubunye.run_pipeline(
usecase_dir="./pipelines", usecase="fraud", package="etl",
tasks=["claim_etl", "features"], mode="DEV",
)
What Ubunye Is Not
It's not an agent framework — use LangChain or CrewAI for that. It's not an orchestrator — use Airflow, Prefect, or Dagster. It's not a compute engine — it runs on Spark.
Ubunye is the standardization layer between your data sources and your applications. It makes the plumbing boring so you can focus on what matters.
Roadmap
- Config-driven ETL pipelines
- Multi-environment profiles
- Jinja templating
- Plugin-based connectors
- CLI scaffolding and execution
- Pydantic config validation
- ML model contract
- Model registry with versioning
- Lineage tracking
- Python API for Databricks
- Databricks Asset Bundles deployment
- Dev notebook scaffolding
- Data drift detection
-
ubunye deployCLI command
Get Involved
We'd love your help. Whether it's a new connector, a bug fix, a typo, or just telling us what you're building — all contributions matter.
- 🐛 Report a bug
- 💡 Request a feature
- 📖 Read the contributing guide
- ⭐ Star the repo if you find it useful — it helps more than you'd think
License
Built with 🇿🇦 by Ubunye AI Ecosystems
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ubunye_engine-0.1.7.tar.gz.
File metadata
- Download URL: ubunye_engine-0.1.7.tar.gz
- Upload date:
- Size: 82.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cdfb1c86f624b537de7aa90ab9c088996eb732d37ad592a5a306288bcf8dc2e
|
|
| MD5 |
bd3268dcb79c3a1d7c79a799e088dde2
|
|
| BLAKE2b-256 |
1f2d7b2fe68f2bb02841c1e0189cd7b2cd1e1bed227819a974b71921f8e94d98
|
Provenance
The following attestation bundles were made for ubunye_engine-0.1.7.tar.gz:
Publisher:
publish_pypip.yml on ubunye-ai-ecosystems/ubunye_engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ubunye_engine-0.1.7.tar.gz -
Subject digest:
3cdfb1c86f624b537de7aa90ab9c088996eb732d37ad592a5a306288bcf8dc2e - Sigstore transparency entry: 1348506600
- Sigstore integration time:
-
Permalink:
ubunye-ai-ecosystems/ubunye_engine@024e2782f343ca59bbb676d77db0c3b7eb253633 -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/ubunye-ai-ecosystems
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_pypip.yml@024e2782f343ca59bbb676d77db0c3b7eb253633 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ubunye_engine-0.1.7-py3-none-any.whl.
File metadata
- Download URL: ubunye_engine-0.1.7-py3-none-any.whl
- Upload date:
- Size: 98.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10945dffad9de68ba21365f35a5507393024f34a2d86bfa14d1267ac6e43914e
|
|
| MD5 |
43f960451aa86d9ff2a308f2bf62c8be
|
|
| BLAKE2b-256 |
f3f6643e1002e7154095d55595498606fab6a498a70c9a75b41b14ba2375cdad
|
Provenance
The following attestation bundles were made for ubunye_engine-0.1.7-py3-none-any.whl:
Publisher:
publish_pypip.yml on ubunye-ai-ecosystems/ubunye_engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ubunye_engine-0.1.7-py3-none-any.whl -
Subject digest:
10945dffad9de68ba21365f35a5507393024f34a2d86bfa14d1267ac6e43914e - Sigstore transparency entry: 1348506907
- Sigstore integration time:
-
Permalink:
ubunye-ai-ecosystems/ubunye_engine@024e2782f343ca59bbb676d77db0c3b7eb253633 -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/ubunye-ai-ecosystems
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish_pypip.yml@024e2782f343ca59bbb676d77db0c3b7eb253633 -
Trigger Event:
push
-
Statement type: