Skip to main content

Lightweight DB-backed coordination primitive with leases and fencing tokens

Project description

Sentinel

Distributed execution is hard to get right. Workers crash mid-flight. Retries overlap. Processes freeze while holding a lock. Side effects partially succeed and leave you guessing.

Most tools respond to this by pretending it isn't a problem — they retry silently, hide uncertainty, and hope the work was idempotent. Sentinel doesn't. It gives you a coordination layer built around an honest model of what can go wrong, and explicit tools for handling it when it does.

At its core, Sentinel is a PostgreSQL-backed execution primitive that guarantees one active execution generation at a time, rejects stale workers with fencing tokens, and surfaces uncertain outcomes instead of burying them.


Philosophy

The dominant pattern in distributed task execution is optimistic: assume work is safe to retry, hide failures behind automatic replays, and let the application figure out the mess when duplicates show up downstream.

That works until it doesn't. And when it doesn't, you're debugging a payment that charged twice, an invoice that sent three times, or a downstream system in an inconsistent state you can't easily reconstruct.

Sentinel starts from a different assumption: some work is not safe to replay, and your coordination layer should know the difference.

When a worker crashes mid-execution, Sentinel doesn't guess. It marks the execution state as uncertain and hands that back to you. You decide whether to reset and retry, force-complete, or escalate. That's not a limitation — that's the correct behavior for correctness-sensitive systems.

A few things Sentinel will never do:

  • Silently replay work it can't verify completed
  • Pretend uncertainty doesn't exist to give you a cleaner API
  • Guarantee something it can't actually guarantee

If that trade-off doesn't fit your use case, if your work is truly idempotent and automatic retries are fine — Sentinel may be more ceremony than you need. It's worth being honest about that.


What Sentinel Is Good At

  • Payment processing and financial operations
  • Webhook ingestion and deduplication
  • Distributed task ownership across competing workers
  • Long-running jobs where you need heartbeat-backed liveness
  • Workflows where the cost of a duplicate is higher than the cost of a manual reconciliation

What Sentinel Is Not

  • A general-purpose task queue (use Celery, Dramatiq, or similar)
  • A distributed transaction system
  • A guarantee against duplicate side effects in downstream services
  • A replacement for idempotency keys at the API layer

Sentinel coordinates execution. What happens inside that execution, whether your database write is transactional, whether your API call is idempotent is still your responsibility.


Where Sentinel Fits

Sentinel lives at the boundary between work arriving and work executing, after your queue or stream delivers an event, and before your code touches the outside world.

Kafka / SQS / Flink ↓ event delivered to your worker ↓ Sentinel ← coordination happens here ↓ side effect runs (charge card, send email, write DB, call API)

Kafka can guarantee exactly-once delivery to your consumer. It cannot guarantee exactly-once execution of what your consumer does next. Sentinel closes that gap.

If your worker crashes after Kafka commits the offset but before the payment goes through, Kafka considers the job done. Sentinel is what catches it.


Why Not Just Use...

Temporal

Temporal is a full workflow engine. It manages retries, timelines, activity state, and long-running saga orchestration. It's powerful and the right tool for complex multi-step workflows.

Sentinel is not that. It's a single primitive — coordinate execution of one unit of work, surface the outcome honestly. No workflow DSL, no activity workers, no server to operate. If you're already running Temporal, Sentinel is probably redundant. If you just need to ensure a payment handler doesn't double-execute, Temporal is a lot of infrastructure for a narrow problem.

Kafka

Kafka is a durable distributed log. It solves delivery and ordering. It does not solve execution. Sentinel is what you reach for after Kafka has done its job — when the message is in your worker and you need to guarantee what happens next.

etcd / ZooKeeper

Both are distributed coordination systems built for infrastructure concerns, leader election, cluster membership, service discovery. They're designed to be run as part of your platform, not called from application code. Using etcd for execution coordination means building the lease model, fencing tokens, and execution state tracking yourself on top of a general-purpose primitive. Sentinel is that layer already built, opinionated, and pointed at application-level execution rather than infrastructure coordination.

Redis (SETNX / Redlock)

Redis-based locking is common and fast. It also has well-documented failure modes, Redlock in particular has been the subject of serious distributed systems criticism around clock skew and network partition behavior. More importantly, Redis locks give you mutual exclusion but not execution state. You still have to model claimed vs executing vs completed yourself, and you still have to handle the uncertain outcome when a lock expires mid-execution. Sentinel does all of that. Redis support is on the roadmap as a backend option, but the coordination semantics will remain the same.

The honest version: Sentinel is an opinionated, lightweight primitive that makes one specific bet — that explicit uncertainty handling is worth more than automatic retries for correctness-sensitive work. If that bet fits your problem, it's significantly less infrastructure than the alternatives. If it doesn't, use something else.


Installation

pip install sentinel-coordination

Requires Python 3.9+ and a PostgreSQL database.


Database Setup

from sentinel import init_db

conn = get_conn()
init_db(conn)
conn.close()

This creates the coordination tables Sentinel needs. Safe to run multiple times.


Getting Started

import psycopg
from sentinel import Sentinel

def get_conn():
    return psycopg.connect("postgresql://postgres:postgres@localhost/testdb")

sentinel = Sentinel(
    get_conn=get_conn,
    default_ttl_ms=3000
)

The Once API

sentinel.once() is the primary interface. Given a key and a function, it guarantees that function runs at most once per key across any number of competing workers and returns the cached result to anyone else who asks.

def process_payment():
    charge_card(amount=99_00, customer_id="cus_abc")
    return {"ok": True, "payment_id": "pay_123"}

result = sentinel.once(
    key="payment-order-789",
    fn=process_payment,
    ttl_ms=3000,
    hard_ttl_ms=30000
)

Reading the result

if result.success:
    # Execution completed. result.response has your return value.
    print(result.response)

elif result.cached:
    # A previous worker already completed this. Same result, no re-execution.
    print("Already done:", result.response)

elif result.status == "executing" and result.execution_alive:
    print("Execution currently in progress")

elif result.status == "executing" and not result.execution_alive:
    # A worker claimed this and hasn't finished. We don't know the outcome.
    # Don't retry blindly. Read the reconciliation section below.
    print("Execution outcome uncertain — reconciliation required")

Why result.status == "executing" matters

This is the state most systems hide from you. It surfaces when a worker claimed execution, entered the side-effect zone, and then disappeared, crashed, froze, timed out. The work may have completed. It may have half-completed. Sentinel doesn't know, and it won't pretend otherwise.

What you do with that is up to you. That's the point.


Execution States

Every execution tracked by Sentinel moves through four states:

State Meaning
claimed Work has been claimed. Execution hasn't started. Safe to reset and retry.
executing Execution has started. Side effects may be in flight. Replay is potentially unsafe.
completed Execution finished. Result is cached and reusable.
reconciling Execution entered recovery mode. Automatic progress is blocked until reconciliation resolves execution truth.

The claimedexecuting transition is the important one. Before that boundary, a reset is safe. After it, you're in uncertain territory and Sentinel will tell you so.


Reconciliation

When execution ends up in an uncertain state, Sentinel gives you explicit tools to resolve it rather than forcing a guess.

# reconcile — sets state to reconciling, force_complete and reset_to_claimed can only be used after setting state to reconciling
sentinel.reconcile.reconcile(key="payment-order-789")

# Mark as complete with a known result — use when you can verify externally
sentinel.reconcile.force_complete(key="payment-order-789", response={"ok": True})

# Manually advance to executing — for custom recovery flows
sentinel.reconcile.reset_to_claimed(key="payment-order-789")

The typical reconciliation pattern:

  1. Detect status == "executing" on a result
  2. Use reconcile to start reconciliation
  3. Check your downstream system (did the payment go through?)
  4. If yes: force_complete with the known result
  5. If no or unknown: reset_to_claimed and let it retry

This is more work than a silent retry. It's also the only approach that doesn't risk charging a customer twice.


Leases

If you need lower-level coordination without the full execution lifecycle, the lease API gives you a distributed mutex with heartbeat renewal and fencing token protection.

with sentinel.lease(
    key="invoice-123",
    ttl_ms=3000,
    hard_ttl_ms=30000
) as lease:

    if lease is None:
        print("Already held by another worker")
        return

    # Lease is held. Heartbeats renew it automatically up to hard_ttl_ms.
    do_work()

Leases are useful when you want coordination without tracking execution state, for example, ensuring only one worker processes a polling loop at a time.


Fencing Tokens

Every lease acquisition generates a monotonically increasing fencing token. Sentinel uses this to reject stale workers, if a worker pauses (GC, network partition, slow disk) and comes back after its lease has expired and been re-acquired by someone else, its operations will be rejected.

This protects against a class of bugs that are easy to miss: the worker that thinks it still holds the lease but doesn't.

Fencing tokens are only effective if downstream state transitions validate them.

If your execution modifies shared state outside Sentinel — for example updating a database row, processing a workflow step, or mutating application-owned execution state — you should include the fencing token in the write condition.

Example:

UPDATE payments SET status = 'completed' WHERE payment_id = %s AND sentinel_leases.fencing_token = %s;

This prevents stale workers from overwriting newer authoritative execution generations.

Sentinel enforces fencing internally for lease coordination and canonical execution completion, but downstream systems must also participate in fencing validation if they maintain mutable execution state.

This is a necessary distributed systems practice whenever execution authority can change over time.


TTL and Hard TTL

sentinel.once(
    key="...",
    fn=fn,
    ttl_ms=3000,       # Heartbeat interval and lease window
    hard_ttl_ms=30000  # Absolute maximum lifetime of this execution
)

ttl_ms controls how often the heartbeat needs to renew the lease. hard_ttl_ms is the ceiling, no matter how healthy the heartbeat, execution cannot extend past this point.

For short work, they can be equal. For long-running jobs, use a short ttl_ms to detect dead workers quickly and a large hard_ttl_ms to give live workers room to finish.

If you omit hard_ttl_ms, it defaults to ttl_ms meaning heartbeat extension won't meaningfully extend the lease. This is intentional: explicit is better than surprising behavior for long-running work.


Namespaces

If you're running multiple systems against the same database, namespaces keep your coordination keys isolated.

sentinel = Sentinel(
    get_conn=get_conn,
    namespace="payments"
)

Tradeoffs

Sentinel makes specific choices that won't suit everyone.

PostgreSQL only. The coordination layer runs on PostgreSQL. If you need Redis-backed coordination or want to avoid adding DB load for execution state, Sentinel isn't the right fit today. Redis support is on the roadmap.

Explicit over automatic. Uncertain states are surfaced, not resolved for you. This is a feature for correctness-sensitive systems and friction for everything else.

Python only. No Go client, no multi-language support yet. If your workers are polyglot, you'll need a different solution or a coordination service layer in front of Sentinel. Go client currently on the roadmap.

No built-in retries. Sentinel coordinates execution. It doesn't implement retry logic, backoff, or dead-letter queues. You bring those or compose them yourself.

Not a queue. Sentinel doesn't dispatch work or schedule tasks. It coordinates execution of work you've already routed to a worker.


Known Failure Boundaries

Sentinel intentionally prevents automatic re-execution once work has crossed the execution boundary.

If a worker enters the executing state and then crashes, freezes, loses heartbeat authority, or disappears before canonical completion occurs, Sentinel will not automatically restart the work, even after the lease expires.

This is intentional.

At that point, Sentinel cannot safely determine whether the side effect:

  • fully completed,
  • partially completed,
  • or never completed at all.

Instead of risking duplicate execution, Sentinel preserves the execution state and requires explicit reconciliation.

This creates an important tradeoff:

  • Sentinel prevents overlapping or duplicate authoritative execution
  • But uncertain execution outcomes may require reconciliation logic before progress can continue

This is why expired executing states surface as reconciliation-required rather than automatically resetting back to claimed.

Sentinel chooses correctness of execution authority over automatic replay.


Project Status

Sentinel is early-stage software under active development. The core execution semantics are stabilizing, but APIs and reconciliation flows may evolve as the project matures.


Roadmap

  • Retry support with configurable backoff
  • Redis-backed coordination
  • Async support
  • Append-only execution logs
  • Stronger reconciliation tooling
  • Metrics and observability hooks

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentinel_coordination-0.2.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sentinel_coordination-0.2.0-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file sentinel_coordination-0.2.0.tar.gz.

File metadata

  • Download URL: sentinel_coordination-0.2.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for sentinel_coordination-0.2.0.tar.gz
Algorithm Hash digest
SHA256 51a67bad4de1526e1e728f536fec49af3c60b1decc87a464999cb82f6ccce8d2
MD5 a771a947a5fa93eae0ab626a1eb0c4b2
BLAKE2b-256 8b5499fe9a70601b95abc08fdc933d75c2f938693c6c376bad8b26ae24921615

See more details on using hashes here.

File details

Details for the file sentinel_coordination-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sentinel_coordination-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8d35a706a27f8193605ff815c11eb1f934c4f007e7894a203320f3c0cd1c6975
MD5 75f913ffb480443e333bee677701e7b9
BLAKE2b-256 96059ab86f9671689642db8364125ccaa3d788a303715b1350c6c5db2b7610d4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page