Skip to main content

A DAG execution engine for pipeline orchestration.

Project description

dagabaaz

A Python library that runs multi-step workflows as directed acyclic graphs. You define the steps and their dependencies. The engine figures out what to run next, routes data between steps, and handles failures.

pip install dagabaaz

Requires Python 3.12+. Optional: google-re2 for ReDoS-safe regex in pipe expressions.

Why This Exists

Most DAG engines (Airflow, Prefect, Dagster) are platforms. They own the scheduler, the database, the UI, and the execution runtime. If you're building a product where pipelines are a feature rather than the whole product, you don't want a platform. You want a library you call from your own code.

Persistence and dispatch are behind a Protocol. Bring your own database and queue.

How It Works

Each step in a pipeline is called a node. Nodes produce artifacts (files or data records) that flow to downstream nodes. A single execution of a pipeline is called a run.

When a run starts:

  1. Root nodes (no dependencies) get tasks dispatched.
  2. Your workers execute tasks. Each task produces artifacts.
  3. When all tasks at a node finish, the engine finds downstream nodes whose dependencies are now satisfied.
  4. For each ready node, artifacts are collected from upstream, optional edge filters are applied, and new tasks are dispatched.
  5. This repeats until every node is done or a failure stops the run.

The engine does not execute tasks itself. Your DagStore implementation handles persistence and queue operations. The engine calls methods on it to dispatch work, check progress, and record outcomes.

Quick Start

1. Define a pipeline

A pipeline is a list of DagNode objects. Each node has a slug (unique ID), a plugin name, and optional dependencies.

from dagabaaz.models import DagNode
from dagabaaz.constants import FanMode

nodes = [
    DagNode(slug="source", plugin="fetch"),
    DagNode(slug="process", plugin="transform", depends_on=["source"]),
    DagNode(
        slug="export",
        plugin="export",
        depends_on=["process"],
        fan_mode=FanMode.AGGREGATE,
    ),
]

2. Implement DagStore

The engine talks to your infrastructure through the DagStore protocol. It has 20 methods covering task dispatch, state queries, and lifecycle transitions. See store.py for the full protocol.

Here are the three most important ones:

class MyStore:
    def get_barrier_state(self, run_id, node_index):
        # Return (run_status, total_tasks, completed_tasks)
        ...

    def try_claim_node_launch(self, run_id, node_index) -> bool:
        # Returns True if this call claimed the node
        ...

    def dispatch_task(self, run_id, node_index, plugin_name, input_artifact_id) -> str:
        # Create task record, push to your job queue, return task_id
        ...

3. Start a run

from dagabaaz.orchestrator import start_run

root_indices = start_run(store, run_id="run-1", nodes=nodes)
# The engine found root nodes and called store.dispatch_task for each one.

4. Handle task completion

After your worker executes a task, call back into the engine so it can dispatch the next steps:

from dagabaaz.orchestrator import on_task_complete, OrchestratorCallbacks

callbacks = OrchestratorCallbacks(
    on_run_completed=lambda run_id: print(f"Run {run_id} done"),
    on_run_failed=lambda run_id: print(f"Run {run_id} failed"),
    on_run_crashed=lambda run_id: print(f"Run {run_id} crashed"),
)

on_task_complete(
    store,
    task_id="task-1",
    callbacks=callbacks,
    # No routing nodes in this example. See Terminology for passthrough.
    resolve_passthrough=lambda plugin: False,
)

The engine checks if the completed task's node is fully done, finds which downstream nodes are now ready, and dispatches them.

Pipeline Patterns

Linear

Three steps in sequence. Each runs after its dependency completes.

nodes = [
    DagNode(slug="fetch", plugin="fetch"),
    DagNode(slug="transform", plugin="transform", depends_on=["fetch"]),
    DagNode(slug="export", plugin="export", depends_on=["transform"]),
]

Fan-out / scatter-gather

A source produces multiple files. Two branches process each file in parallel. The merge node collects results that came from the same original file. If the source produced 10 files and there are 2 branches, the merge node gets 10 tasks, each with 2 results.

nodes = [
    DagNode(slug="source", plugin="fetch"),
    DagNode(slug="branch_a", plugin="process_a", depends_on=["source"]),
    DagNode(slug="branch_b", plugin="process_b", depends_on=["source"]),
    DagNode(
        slug="merge",
        plugin="merge",
        depends_on=["branch_a", "branch_b"],
        fan_mode=FanMode.GROUPED,
    ),
]

Conditional routing with edge filters

Edge filters route artifacts to different branches by type. Video files go to one branch, subtitles to another.

from dagabaaz.models import DagNode, EdgeFilter, FilterRule
from dagabaaz.constants import FanMode, FilterOperator

nodes = [
    DagNode(slug="source", plugin="fetch"),
    DagNode(
        slug="video",
        plugin="transcode",
        depends_on=["source"],
        fan_mode=FanMode.AGGREGATE,
        edge_filters={
            "source": EdgeFilter(
                rules=[FilterRule(field="file_type", operator=FilterOperator.EQ, value="video")]
            )
        },
    ),
    DagNode(
        slug="subtitle",
        plugin="parse_subs",
        depends_on=["source"],
        edge_filters={
            "source": EdgeFilter(
                rules=[FilterRule(field="file_type", operator=FilterOperator.EQ, value="subtitle")]
            )
        },
    ),
]

When source produces a mix of .mp4 and .srt files, the engine routes each type to the correct branch. If a branch receives no artifacts (e.g., no subtitles), it is marked filtered and does not block downstream nodes.

Integrating with a Job Queue

The engine dispatches work but does not execute it. Your DagStore.dispatch_task pushes a job to whatever queue you use. When a worker finishes, it calls back into the engine.

Dispatch. The engine calls store.dispatch_task(...). Your implementation inserts a task row, pushes a job to the queue with the task ID as payload, and returns the task ID. Skipped, filtered, and failed tasks don't go through the queue. They get pre-terminal rows inserted directly.

Consume. A worker pulls a job from the queue, extracts the task ID, runs the plugin, and produces artifacts. Use build_task_input to assemble the data the plugin needs:

from dagabaaz.task_input import build_task_input

input_data = build_task_input(
    store,
    run_id="run-1",
    node_index=1,
    input_artifact_id="artifact-xyz",
    nodes=nodes,
)

Complete. The worker calls on_task_complete. The engine checks whether the node is fully done, finds which downstream nodes are now ready, and dispatches them.

On failure, on_task_failed marks the run as failed and cancels remaining tasks. On infrastructure crash, on_task_crashed does the same with a different terminal status.

Concurrency

The engine assumes on_task_complete is called one at a time per run.

With Postgres, use SELECT pg_advisory_xact_lock(hashtext(run_id)) before calling the engine. The engine's try_claim_node_launch (INSERT ON CONFLICT DO NOTHING) prevents double-launches if two workers race. try_claim_run_terminal (UPDATE WHERE status NOT IN terminal_set) ensures exactly one caller wins a state transition.

With Redis, use a distributed lock per run ID. With a single worker, no locking is needed.

Crash recovery

If a worker dies mid-task, the queue's visibility timeout returns the job to pending. The next worker re-runs the task and re-enters the engine. Idempotent claims prevent double-launches.

If your queue is Postgres-native (e.g., postkit, graphile-worker, or pgboss), the task row insert and queue push happen in the same transaction, so dispatch is atomic.

Terminology

Node: A single step in a pipeline. Each node wraps a plugin and declares which other nodes it depends on.

Artifact: A piece of data produced by a task. Usually a file, but can be a data record. Artifacts flow between nodes.

Task: One concrete execution of a node. A single node can spawn multiple tasks depending on its fan mode.

Run: One execution of a pipeline. Contains all tasks across all nodes. Has its own lifecycle (running, completed, failed, etc.).

Fan mode: Controls how a node receives artifacts from upstream. Single (one task per artifact), aggregate (one task gets all artifacts), or grouped (one task per group of related artifacts). Set on the node, not the plugin.

Origin artifact: Every artifact remembers which root artifact started its processing chain. This is how grouped mode knows which artifacts belong together.

Edge filter: Rules on an edge that decide which artifacts pass through. All rules must match (AND logic). Can filter by file type, extension, size, name, or metadata fields.

Input binding: How a task input field gets its value. Four sources: upstream artifact field, literal config value, user-provided run input, or an expression template with transforms.

Barrier sync: A node runs only when all of its dependencies have all of their tasks finished. If any task fails, the run fails.

Skipped: A node is skipped when its upstream is dead. Cascades: if B is skipped, everything downstream of B is also skipped.

Filtered: A node is filtered when it has no artifacts to work with. Does not cascade. Downstream nodes still try to collect artifacts.

Passthrough: When a node has no artifacts, the engine can look further upstream to find some. But it only looks through routing nodes (like gates). If a processing node has no output, the engine treats that as intentional and stops looking.

Terminal state: A final state with no further transitions. Tasks: completed, failed, crashed, cancelled, skipped, filtered. Runs: completed, failed, crashed, cancelled.

Expression Language

Input bindings can use {namespace.key | pipe} expressions:

"{source.file_path}"                              # artifact field
"{source.title | upper | truncate(50)}"           # with transforms
"{list(branch_a.url, branch_b.url) | join(,)}"   # multiple sources
"{input.api_url | required}"                      # run input
"{config.output_format | default(mp4)}"           # config value

30 built-in pipes: upper, lower, trim, title, replace, strip, lstrip, rstrip, default, required, first, last, nth, join, basename, dirname, stem, ext, urlencode, urldecode, int, string, truncate, prepend, append, match, json_get, flatten, compact, pad.

Expressions are validated at pipeline save time and evaluated at task execution time. No eval().

Module Map

Module Purpose
orchestrator Decides which nodes to run next after a task completes
topology Caches dependency graphs so they aren't rebuilt on every task completion
graph Graph algorithms: which nodes are ready, collecting artifacts from upstream
filter Evaluates edge filter rules and groups artifacts by origin
models Data types: DagNode, DagArtifact, EdgeFilter, binding sources
constants Enums and status sets (RunStatus, TaskStatus, FanMode)
store DagStore and TaskInputStore protocols
bindings Resolves how each task input field gets its value
expressions Parses and evaluates {slug.key | pipe} templates
pipes 30 built-in transform functions for expressions
task_input Assembles the input data dict passed to plugins on the worker side
schema Generates the input form schema for a pipeline
retry Computes which nodes to re-run after a failure
plugins PluginMeta protocol for plugin metadata

Design Decisions

Protocol, not base class. DagStore is a typing.Protocol, not a base class. Your store implementation just needs the right method signatures.

Fan mode belongs to the node, not the plugin. The pipeline author decides how a node aggregates its inputs, not the plugin author. Plugins can suggest a default, but the node definition is what the engine uses.

Filtered != skipped. If a node has no artifacts to work with, it's marked filtered and downstream nodes still run. If a node's upstream is dead, it's marked skipped and everything downstream is also skipped. Mixing these up causes either false cascades or tasks running on missing data.

Passthrough-aware artifact lookup. When a node's immediate dependency has no artifacts, the engine walks up the dependency chain looking for the nearest ancestor that does. But it only walks through routing nodes. Processing nodes are barriers.

No barrel exports. Import from submodules directly: from dagabaaz.models import DagNode. The module path tells you where each symbol lives.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dagabaaz-0.0.2.tar.gz (44.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dagabaaz-0.0.2-py3-none-any.whl (52.9 kB view details)

Uploaded Python 3

File details

Details for the file dagabaaz-0.0.2.tar.gz.

File metadata

  • Download URL: dagabaaz-0.0.2.tar.gz
  • Upload date:
  • Size: 44.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dagabaaz-0.0.2.tar.gz
Algorithm Hash digest
SHA256 aab34ec4aa8c6befa2f9b43756a477bb684275a66628057497ba1c818e97a932
MD5 1bce98599b4659cf63edb473416a5ea7
BLAKE2b-256 b6cdc5dee0d25d0c2511f0dd98d0e0798fc8039a79db47d1d96de20367f58cdc

See more details on using hashes here.

File details

Details for the file dagabaaz-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: dagabaaz-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 52.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dagabaaz-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7b6b158f6e7970f57a2be8d9c2cad6d882be6d3b90111f4f98f3eb93873f41e2
MD5 bda822abee1ea32ab898043dfbc99431
BLAKE2b-256 57502d28d59eeae015e8149f46a7f81c8ef4976b9ecf6887513d97b19c61c963

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page