Skip to main content

Config-first, Spark-native ETL/ML engine with a modular plugin system

Project description

Ubunye Engine

Ubunye (oo-BOON-yeh) — isiZulu for "unity"

One framework. Every pipeline. Any environment.

DocsQuickstartWhy UbunyeCommunity


Hey there 👋

A data pipeline is a program that moves data from one place to another — a database to a file, a REST API to a data warehouse — and usually reshapes the data along the way. Building one from scratch is mostly plumbing: wire up the connection, juggle credentials, learn a framework's quirks, write the same "read → transform → write" scaffold for the tenth time this year. It's a lot of glue code standing between you and the three lines that actually matter.

Ubunye Engine writes that plumbing for you. You describe the pipeline in a short YAML file and put your transformation in a normal Python class. Ubunye takes care of connections, the compute engine (Apache Spark), and the read/write loop.

Same pipeline runs on your laptop today and on a production cluster tomorrow — no code changes.


Quickstart

Install it:

pip install ubunye-engine

Scaffold a new pipeline folder:

ubunye init -d ./pipelines -u demo -p starter -t filter_adults

You get:

pipelines/demo/starter/filter_adults/
  config.yaml              ← describes the pipeline (inputs, outputs, settings)
  transformations.py       ← your code goes here
  notebooks/               ← an interactive dev notebook for exploring

ubunye init gives you a working starting point you can customise. For a minimal run-it-on-your-laptop example, edit config.yaml to read a local CSV and write Parquet:

CONFIG:
  inputs:
    people:
      format: s3              # generic file reader; "file://" paths work too
      file_format: csv
      path: "file:///tmp/people.csv"
      options:
        header: "true"
        inferSchema: "true"

  outputs:
    adults:
      format: s3
      file_format: parquet
      path: "file:///tmp/adults/"
      mode: overwrite

Then open transformations.py and write your logic:

from typing import Any, Dict
from ubunye.core.interfaces import Task


class FilterAdults(Task):
    """Keep only rows where age is 18 or older."""

    def transform(self, sources: Dict[str, Any]) -> Dict[str, Any]:
        people = sources["people"]
        return {"adults": people.filter("age >= 18")}

Two things to notice:

  • sources["people"] matches the inputs.people name from the YAML.
  • The return key "adults" matches the outputs.adults name.

Run it:

ubunye run -d ./pipelines -u demo -p starter -t filter_adults

That's the whole loop. Ubunye reads /tmp/people.csv, hands you a Spark DataFrame, and writes whatever you return to /tmp/adults/.

Running on Databricks? Call it from a notebook instead:

import ubunye
outputs = ubunye.run_task(task_dir="./pipelines/demo/starter/filter_adults")

Ubunye detects Databricks' active Spark session and reuses it — same pipeline, no code change.

Want to see a realistic end-to-end example — Kaggle Titanic CSV → survival-rate Parquet, with tests and CI? See examples/production/titanic_local/.


Why Ubunye

We've all been there. You join a new team, open the repo, and find five Spark projects — each structured differently, each with its own way of handling configs, credentials, and deployment. One uses a JSON file, another has everything hardcoded, a third has a 300-line bash script that "Dave wrote and it just works."

Ubunye says: let's agree on how pipelines look. One folder structure. One config format. One CLI. Whether you're building an ETL job, a feature pipeline, or an ML training run.

Without Ubunye With Ubunye
Every project looks different One standard: use_case / pipeline / task
Spark setup scattered everywhere Engine handles it from YAML config
Credentials hardcoded or inconsistent {{ env.DB_PASSWORD }} everywhere
"Works on my machine" Same config runs local, YARN, K8s, Databricks
New teammate needs a week to onboard ubunye init and they're running in minutes

How It Works

Three simple ideas:

Config over code. Your pipeline is a YAML file. Inputs, outputs, Spark settings, scheduling — all declared, not coded.

Plugins for everything. The format field in your config picks which connector to use. A connector is a small Python class that knows how to read from or write to one specific place (a database, a REST API, a cloud bucket). Built-ins include hive, jdbc, delta, s3, unity, and rest_api. Need a new data source? Write one and register it — Ubunye discovers plugins automatically.

Folders as architecture. Pipelines are organized as project / use_case / pipeline / task. The CLI uses this structure for scaffolding, execution, and discovery:

pipelines/
  fraud_detection/
    ingestion/
      claim_etl/
      policy_etl/
    feature_engineering/
      claim_features/
    risk_scoring/
      train_model/
      score_claims/

What Can You Build With It

ETL pipelines — move data between Hive, JDBC databases, Delta Lake, S3, REST APIs. Config-driven, scheduled, reproducible.

ML training and inference — define your model behind a simple contract, let the engine handle versioning, storage, and deployment.

RAG document pipelines — ingest documents, extract text, chunk, compute embeddings, load into a vector store. All from YAML.

Feature engineering — compute features once, write to a shared table, reuse across use cases.

Data drift detection — monitor feature distributions between runs, flag when things shift.

Check out the Patterns section in our docs for full examples.


Examples

Six fully worked pipelines live in examples/production/. Each one is self-contained — its own README, tests, and CI workflow — so you can copy a folder, tweak the config, and have something running in minutes.

Example What it shows Where it runs
titanic_local/ Simplest end-to-end: Kaggle Titanic CSV → survival rate by passenger class, saved as Parquet. Start here. Your laptop
titanic_databricks/ Same business logic, same file — just a different config. Shows how little changes when you move to the cloud. Databricks Community Edition
titanic_multitask_local/ Two tasks chained: one cleans the data, the next summarises it. Shows ubunye run -t task1 -t task2. Your laptop
titanic_multitask_databricks/ Same chain, running on Databricks with Unity Catalog tables instead of local Parquet. Databricks
titanic_ml_databricks/ The full ML lifecycle: train a classifier, log to MLflow, promote through the model registry, score new rows. Databricks
jhb_weather_databricks/ REST API ingestion (Open-Meteo, no auth) → Unity Catalog Delta table, on a schedule. Databricks

Not sure which one to open? Read examples/production/README.md — it walks through picking a runtime and what the Community Edition / paid workspace differences look like.


Connectors

Format Read Write Description
hive Apache Hive tables
jdbc PostgreSQL, MySQL, Teradata, and more
delta Delta Lake (standalone or Unity Catalog)
s3 S3, HDFS, or local filesystem
unity Databricks Unity Catalog
binary Binary files (images, PDFs)
rest_api REST APIs with pagination and auth

Want to add one? See the plugin guide.


Run Anywhere

The same pipeline runs on every Spark-compatible environment. You only change the spark.master setting — the rest is identical:

Where you run it What to set
Your laptop spark.master: "local[*]"
Hadoop / YARN cluster spark.master: "yarn"
Kubernetes spark.master: "k8s://..."
Databricks notebooks or jobs Call ubunye.run_task() from Python — Ubunye picks up the active session
AWS EMR Runs as an EMR Step

Don't recognise some of these? That's fine — you only need one. If you're starting out, local[*] runs Spark on your own machine with no setup.


Jinja Templating

Anywhere a string appears in your YAML, you can plug in a variable using {{ … }} syntax (this is called Jinja templating). That's how you keep secrets out of your config, change paths per environment, and inject the run date from the CLI:

# Environment variables
password: "{{ env.DB_PASSWORD }}"

# CLI variables (--var ds=2025-01-01)
path: "s3a://bucket/{{ ds }}/"

# Defaults
path: "s3a://bucket/{{ ds | default('2025-01-01') }}/"

CLI

ubunye init     -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # scaffold
ubunye validate -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # check config
ubunye plan     -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # preview plan
ubunye run      -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # execute
ubunye test run -d ./pipelines -u <use_case> -p <pipeline> -t <task>   # test mode
ubunye lineage list -d ./pipelines -u <use_case> -p <pipeline> -t <task>  # run history
ubunye models list -u <use_case> -m <model> -s <store>                 # model versions

Python API

import ubunye

# Run from Databricks or any Python environment
outputs = ubunye.run_task(task_dir="./pipelines/...", mode="DEV", dt="2024-06-01")

# Multiple tasks
results = ubunye.run_pipeline(
    usecase_dir="./pipelines", usecase="fraud", package="etl",
    tasks=["claim_etl", "features"], mode="DEV",
)

What Ubunye Is Not

It's not an agent framework — use LangChain or CrewAI for that. It's not an orchestrator — use Airflow, Prefect, or Dagster. It's not a compute engine — it runs on Spark.

Ubunye is the standardization layer between your data sources and your applications. It makes the plumbing boring so you can focus on what matters.


Roadmap

  • Config-driven ETL pipelines
  • Multi-environment profiles
  • Jinja templating
  • Plugin-based connectors
  • CLI scaffolding and execution
  • Pydantic config validation
  • ML model contract
  • Model registry with versioning
  • Lineage tracking
  • Python API for Databricks
  • Databricks Asset Bundles deployment
  • Dev notebook scaffolding
  • Data drift detection
  • ubunye deploy CLI command

Get Involved

We'd love your help. Whether it's a new connector, a bug fix, a typo, or just telling us what you're building — all contributions matter.


License

MIT License


Built with 🇿🇦 by Ubunye AI Ecosystems

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ubunye_engine-0.1.7.tar.gz (82.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ubunye_engine-0.1.7-py3-none-any.whl (98.9 kB view details)

Uploaded Python 3

File details

Details for the file ubunye_engine-0.1.7.tar.gz.

File metadata

  • Download URL: ubunye_engine-0.1.7.tar.gz
  • Upload date:
  • Size: 82.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ubunye_engine-0.1.7.tar.gz
Algorithm Hash digest
SHA256 3cdfb1c86f624b537de7aa90ab9c088996eb732d37ad592a5a306288bcf8dc2e
MD5 bd3268dcb79c3a1d7c79a799e088dde2
BLAKE2b-256 1f2d7b2fe68f2bb02841c1e0189cd7b2cd1e1bed227819a974b71921f8e94d98

See more details on using hashes here.

Provenance

The following attestation bundles were made for ubunye_engine-0.1.7.tar.gz:

Publisher: publish_pypip.yml on ubunye-ai-ecosystems/ubunye_engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ubunye_engine-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: ubunye_engine-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 98.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ubunye_engine-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 10945dffad9de68ba21365f35a5507393024f34a2d86bfa14d1267ac6e43914e
MD5 43f960451aa86d9ff2a308f2bf62c8be
BLAKE2b-256 f3f6643e1002e7154095d55595498606fab6a498a70c9a75b41b14ba2375cdad

See more details on using hashes here.

Provenance

The following attestation bundles were made for ubunye_engine-0.1.7-py3-none-any.whl:

Publisher: publish_pypip.yml on ubunye-ai-ecosystems/ubunye_engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page