Config-first, Spark-native ETL/ML engine with a modular plugin system
Project description
Ubunye (oo-BOON-yeh) — isiZulu for "unity"
One framework. Every pipeline. Any environment.
Docs • Quickstart • Why Ubunye • Community
Hey there 👋
If you've ever spent more time wiring up Spark boilerplate than writing actual data logic — this is for you.
Ubunye Engine lets you define your entire data pipeline in a simple YAML config and a Python file. No Spark session setup. No connection management. No environment-specific scripts. Just tell the engine what you want, and it handles the how.
It works on your laptop, on YARN, on Kubernetes, on Databricks — same config, same code, everywhere.
Quickstart
pip install ubunye-engine
Scaffold your first pipeline:
ubunye init -d ./pipelines -u fraud_detection -p ingestion -t claim_etl
You'll get:
pipelines/
fraud_detection/
ingestion/
claim_etl/
config.yaml ← tell the engine what to do
transformations.py ← your logic goes here
Open config.yaml and describe your pipeline:
MODEL: "etl"
VERSION: "0.1.0"
ENGINE:
profiles:
dev:
spark_conf:
spark.master: "local[*]"
prod:
spark_conf:
spark.master: "yarn"
CONFIG:
inputs:
raw_claims:
format: hive
db_name: fraud_db
tbl_name: raw_claims
transform:
type: noop
outputs:
bronze:
format: delta
table: main.fraud.bronze_claims
mode: overwrite
Add your logic in transformations.py:
def transform(df):
return df.filter("claim_amount > 0").dropDuplicates(["claim_id"])
Run it:
ubunye run -d ./pipelines -u fraud_detection -p ingestion -t claim_etl --profile dev
That's it. You just built and ran a pipeline. Same config runs in production — just swap --profile prod.
Why Ubunye
We've all been there. You join a new team, open the repo, and find five Spark projects — each structured differently, each with its own way of handling configs, credentials, and deployment. One uses a JSON file, another has everything hardcoded, a third has a 300-line bash script that "Dave wrote and it just works."
Ubunye says: let's agree on how pipelines look. One folder structure. One config format. One CLI. Whether you're building an ETL job, a feature pipeline, or an ML training run.
| Without Ubunye | With Ubunye |
|---|---|
| Every project looks different | One standard: use_case / pipeline / task |
| Spark setup scattered everywhere | Engine handles it from YAML config |
| Credentials hardcoded or inconsistent | {{ env.DB_PASSWORD }} everywhere |
| "Works on my machine" | Same config runs local, YARN, K8s, Databricks |
| New teammate needs a week to onboard | ubunye init and they're running in minutes |
How It Works
Three simple ideas:
Config over code. Your pipeline is a YAML file. Inputs, outputs, Spark settings, orchestration — all declared, not coded.
Plugins for everything. The format field in your config selects a connector — hive, jdbc, delta, s3, unity, rest_api, and more. Need a new data source? Add a plugin.
Folders as architecture. Pipelines are organized as project / use_case / pipeline / task. The CLI uses this for scaffolding, execution, and discovery:
pipelines/
fraud_detection/
ingestion/
claim_etl/
policy_etl/
feature_engineering/
claim_features/
risk_scoring/
train_model/
score_claims/
What Can You Build With It
ETL pipelines — move data between Hive, JDBC databases, Delta Lake, S3, REST APIs. Config-driven, scheduled, reproducible.
ML training and inference — define your model behind a simple contract, let the engine handle versioning, storage, and deployment.
RAG document pipelines — ingest documents, extract text, chunk, compute embeddings, load into a vector store. All from YAML.
Feature engineering — compute features once, write to a shared table, reuse across use cases.
Data drift detection — monitor feature distributions between runs, flag when things shift.
Check out the Patterns section in our docs for full examples.
Connectors
| Format | Read | Write | Description |
|---|---|---|---|
hive |
✓ | ✓ | Apache Hive tables |
jdbc |
✓ | ✓ | PostgreSQL, MySQL, Teradata, and more |
delta |
✓ | ✓ | Delta Lake (standalone or Unity Catalog) |
s3 |
✓ | ✓ | S3, HDFS, or local filesystem |
unity |
✓ | ✓ | Databricks Unity Catalog |
binary |
✓ | Binary files (images, PDFs) | |
rest_api |
✓ | ✓ | REST APIs with pagination and auth |
Want to add one? See the plugin guide.
Run Anywhere
Same pipeline, no changes:
| Environment | Just set |
|---|---|
| Local | spark.master: "local[*]" |
| YARN / Hadoop | spark.master: "yarn" |
| Kubernetes | spark.master: "k8s://..." |
| Databricks | Via ORCHESTRATION config |
| AWS EMR | Via EMR Steps |
Jinja Templating
All config values support Jinja2:
# Environment variables
password: "{{ env.DB_PASSWORD }}"
# CLI variables (--var ds=2025-01-01)
path: "s3a://bucket/{{ ds }}/"
# Defaults
path: "s3a://bucket/{{ ds | default('2025-01-01') }}/"
CLI
ubunye init -d ./pipelines -u <use_case> -p <pipeline> -t <task> # scaffold
ubunye run -d ./pipelines -u <use_case> -p <pipeline> -t <task> # execute
ubunye validate -d ./pipelines -u <use_case> -p <pipeline> -t <task> # check config
What Ubunye Is Not
It's not an agent framework — use LangChain or CrewAI for that. It's not an orchestrator — use Airflow, Prefect, or Dagster. It's not a compute engine — it runs on Spark.
Ubunye is the standardization layer between your data sources and your applications. It makes the plumbing boring so you can focus on what matters.
Roadmap
- Config-driven ETL pipelines
- Multi-environment profiles
- Jinja templating
- Plugin-based connectors
- CLI scaffolding and execution
- Pydantic config validation
- ML model contract
- Model registry with versioning
- Data drift detection
- Lineage tracking
Get Involved
We'd love your help. Whether it's a new connector, a bug fix, a typo, or just telling us what you're building — all contributions matter.
- 🐛 Report a bug
- 💡 Request a feature
- 📖 Read the contributing guide
- ⭐ Star the repo if you find it useful — it helps more than you'd think
License
Built with 🇿🇦 by Ubunye AI Ecosystems
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ubunye_engine-0.1.1.tar.gz.
File metadata
- Download URL: ubunye_engine-0.1.1.tar.gz
- Upload date:
- Size: 68.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1fe1f5396130eb2f7d16d72969deb91caf0aebe2e2fd30fe94f1b3f4e266fd0
|
|
| MD5 |
46fe1f3bd4ecbdf509757b52c08150d9
|
|
| BLAKE2b-256 |
5c2e88e8c27d7abc1bc021e29d2eb20c878ef0bee469baca11bfc9c1b8bd4247
|
File details
Details for the file ubunye_engine-0.1.1-py3-none-any.whl.
File metadata
- Download URL: ubunye_engine-0.1.1-py3-none-any.whl
- Upload date:
- Size: 82.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a8eb92479c54c3807d0217e6f404f594fc58feb0987d27258e21fe441a099f8
|
|
| MD5 |
70c890a8b1cab0bef994ba24d9d66e63
|
|
| BLAKE2b-256 |
52674b0d3c11978990adb88a694317763497abd8b001302082b4af49f0214e89
|