Lightweight DB-backed coordination primitive with leases and fencing tokens
Project description
Sentinel
Distributed execution is hard to get right. Workers crash mid-flight. Retries overlap. Processes freeze while holding a lock. Side effects partially succeed and leave you guessing.
Most tools respond to this by pretending it isn't a problem — they retry silently, hide uncertainty, and hope the work was idempotent. Sentinel doesn't. It gives you a coordination layer built around an honest model of what can go wrong, and explicit tools for handling it when it does.
At its core, Sentinel is a PostgreSQL-backed execution primitive that guarantees one active execution generation at a time, rejects stale workers with fencing tokens, and surfaces uncertain outcomes instead of burying them.
Philosophy
The dominant pattern in distributed task execution is optimistic: assume work is safe to retry, hide failures behind automatic replays, and let the application figure out the mess when duplicates show up downstream.
That works until it doesn't. And when it doesn't, you're debugging a payment that charged twice, an invoice that sent three times, or a downstream system in an inconsistent state you can't easily reconstruct.
Sentinel starts from a different assumption: some work is not safe to replay, and your coordination layer should know the difference.
When a worker crashes mid-execution, Sentinel doesn't guess. It marks the execution state as uncertain and hands that back to you. You decide whether to reset and retry, force-complete, or escalate. That's not a limitation — that's the correct behavior for correctness-sensitive systems.
A few things Sentinel will never do:
- Silently replay work it can't verify completed
- Pretend uncertainty doesn't exist to give you a cleaner API
- Guarantee something it can't actually guarantee
If that trade-off doesn't fit your use case, if your work is truly idempotent and automatic retries are fine — Sentinel may be more ceremony than you need. It's worth being honest about that.
What Sentinel Is Good At
- Payment processing and financial operations
- Webhook ingestion and deduplication
- Distributed task ownership across competing workers
- Long-running jobs where you need heartbeat-backed liveness
- Workflows where the cost of a duplicate is higher than the cost of a manual reconciliation
What Sentinel Is Not
- A general-purpose task queue (use Celery, Dramatiq, or similar)
- A distributed transaction system
- A guarantee against duplicate side effects in downstream services
- A replacement for idempotency keys at the API layer
Sentinel coordinates execution. What happens inside that execution, whether your database write is transactional, whether your API call is idempotent is still your responsibility.
Where Sentinel Fits
Sentinel lives at the boundary between work arriving and work executing, after your queue or stream delivers an event, and before your code touches the outside world.
Kafka / SQS / Flink ↓ event delivered to your worker ↓ Sentinel ← coordination happens here ↓ side effect runs (charge card, send email, write DB, call API)
Kafka can guarantee exactly-once delivery to your consumer. It cannot guarantee exactly-once execution of what your consumer does next. Sentinel closes that gap.
If your worker crashes after Kafka commits the offset but before the payment goes through, Kafka considers the job done. Sentinel is what catches it.
Why Not Just Use...
Temporal
Temporal is a full workflow engine. It manages retries, timelines, activity state, and long-running saga orchestration. It's powerful and the right tool for complex multi-step workflows.
Sentinel is not that. It's a single primitive — coordinate execution of one unit of work, surface the outcome honestly. No workflow DSL, no activity workers, no server to operate. If you're already running Temporal, Sentinel is probably redundant. If you just need to ensure a payment handler doesn't double-execute, Temporal is a lot of infrastructure for a narrow problem.
Kafka
Kafka is a durable distributed log. It solves delivery and ordering. It does not solve execution. Sentinel is what you reach for after Kafka has done its job — when the message is in your worker and you need to guarantee what happens next.
etcd / ZooKeeper
Both are distributed coordination systems built for infrastructure concerns, leader election, cluster membership, service discovery. They're designed to be run as part of your platform, not called from application code. Using etcd for execution coordination means building the lease model, fencing tokens, and execution state tracking yourself on top of a general-purpose primitive. Sentinel is that layer already built, opinionated, and pointed at application-level execution rather than infrastructure coordination.
Redis (SETNX / Redlock)
Redis-based locking is common and fast. It also has well-documented failure modes, Redlock in particular has been the subject of serious distributed systems criticism around clock skew and network partition behavior. More importantly, Redis locks give you mutual exclusion but not execution state. You still have to model claimed vs executing vs completed yourself, and you still have to handle the uncertain outcome when a lock expires mid-execution. Sentinel does all of that. Redis support is on the roadmap as a backend option, but the coordination semantics will remain the same.
The honest version: Sentinel is an opinionated, lightweight primitive that makes one specific bet — that explicit uncertainty handling is worth more than automatic retries for correctness-sensitive work. If that bet fits your problem, it's significantly less infrastructure than the alternatives. If it doesn't, use something else.
Installation
pip install sentinel-coordination
Requires Python 3.9+ and a PostgreSQL database.
Database Setup
from sentinel import init_db
conn = get_conn()
init_db(conn)
conn.close()
This creates the coordination tables Sentinel needs. Safe to run multiple times.
Getting Started
import psycopg
from sentinel import Sentinel
def get_conn():
return psycopg.connect("postgresql://postgres:postgres@localhost/testdb")
sentinel = Sentinel(
get_conn=get_conn,
default_ttl_ms=3000
)
The Once API
sentinel.once() is the primary interface. Given a key and a function, it guarantees that function runs at most once per key across any number of competing workers and returns the cached result to anyone else who asks.
def process_payment():
charge_card(amount=99_00, customer_id="cus_abc")
return {"ok": True, "payment_id": "pay_123"}
result = sentinel.once(
key="payment-order-789",
fn=process_payment,
ttl_ms=3000,
hard_ttl_ms=30000
)
Reading the result
if result.success:
# Execution completed. result.response has your return value.
print(result.response)
elif result.cached:
# A previous worker already completed this. Same result, no re-execution.
print("Already done:", result.response)
elif result.status == "executing" and result.execution_alive:
print("Execution currently in progress")
elif result.status == "executing" and not result.execution_alive:
# A worker claimed this and hasn't finished. We don't know the outcome.
# Don't retry blindly. Read the reconciliation section below.
print("Execution outcome uncertain — reconciliation required")
Why result.status == "executing" matters
This is the state most systems hide from you. It surfaces when a worker claimed execution, entered the side-effect zone, and then disappeared, crashed, froze, timed out. The work may have completed. It may have half-completed. Sentinel doesn't know, and it won't pretend otherwise.
What you do with that is up to you. That's the point.
Execution States
Every execution tracked by Sentinel moves through four states:
| State | Meaning |
|---|---|
claimed |
Work has been claimed. Execution hasn't started. Safe to reset and retry. |
executing |
Execution has started. Side effects may be in flight. Replay is potentially unsafe. |
completed |
Execution finished. Result is cached and reusable. |
reconciling |
Execution entered recovery mode. Automatic progress is blocked until reconciliation resolves execution truth. |
The claimed → executing transition is the important one. Before that boundary, a reset is safe. After it, you're in uncertain territory and Sentinel will tell you so.
Reconciliation
When execution ends up in an uncertain state, Sentinel gives you explicit tools to resolve it rather than forcing a guess.
# reconcile — sets state to reconciling, force_complete and reset_to_claimed can only be used after setting state to reconciling
sentinel.reconcile.reconcile(key="payment-order-789")
# Mark as complete with a known result — use when you can verify externally
sentinel.reconcile.force_complete(key="payment-order-789", response={"ok": True})
# Manually advance to executing — for custom recovery flows
sentinel.reconcile.reset_to_claimed(key="payment-order-789")
The typical reconciliation pattern:
- Detect
status == "executing"on a result - Use
reconcileto start reconciliation - Check your downstream system (did the payment go through?)
- If yes:
force_completewith the known result - If no or unknown:
reset_to_claimedand let it retry
This is more work than a silent retry. It's also the only approach that doesn't risk charging a customer twice.
Leases
If you need lower-level coordination without the full execution lifecycle, the lease API gives you a distributed mutex with heartbeat renewal and fencing token protection.
with sentinel.lease(
key="invoice-123",
ttl_ms=3000,
hard_ttl_ms=30000
) as lease:
if lease is None:
print("Already held by another worker")
return
# Lease is held. Heartbeats renew it automatically up to hard_ttl_ms.
do_work()
Leases are useful when you want coordination without tracking execution state, for example, ensuring only one worker processes a polling loop at a time.
Fencing Tokens
Every lease acquisition generates a monotonically increasing fencing token. Sentinel uses this to reject stale workers, if a worker pauses (GC, network partition, slow disk) and comes back after its lease has expired and been re-acquired by someone else, its operations will be rejected.
This protects against a class of bugs that are easy to miss: the worker that thinks it still holds the lease but doesn't.
Fencing tokens are only effective if downstream state transitions validate them.
If your execution modifies shared state outside Sentinel — for example updating a database row, processing a workflow step, or mutating application-owned execution state — you should include the fencing token in the write condition.
Example:
UPDATE payments SET status = 'completed' WHERE payment_id = %s AND sentinel_leases.fencing_token = %s;
This prevents stale workers from overwriting newer authoritative execution generations.
Sentinel enforces fencing internally for lease coordination and canonical execution completion, but downstream systems must also participate in fencing validation if they maintain mutable execution state.
This is a necessary distributed systems practice whenever execution authority can change over time.
TTL and Hard TTL
sentinel.once(
key="...",
fn=fn,
ttl_ms=3000, # Heartbeat interval and lease window
hard_ttl_ms=30000 # Absolute maximum lifetime of this execution
)
ttl_ms controls how often the heartbeat needs to renew the lease. hard_ttl_ms is the ceiling, no matter how healthy the heartbeat, execution cannot extend past this point.
For short work, they can be equal. For long-running jobs, use a short ttl_ms to detect dead workers quickly and a large hard_ttl_ms to give live workers room to finish.
If you omit hard_ttl_ms, it defaults to ttl_ms meaning heartbeat extension won't meaningfully extend the lease. This is intentional: explicit is better than surprising behavior for long-running work.
Namespaces
If you're running multiple systems against the same database, namespaces keep your coordination keys isolated.
sentinel = Sentinel(
get_conn=get_conn,
namespace="payments"
)
Tradeoffs
Sentinel makes specific choices that won't suit everyone.
PostgreSQL only. The coordination layer runs on PostgreSQL. If you need Redis-backed coordination or want to avoid adding DB load for execution state, Sentinel isn't the right fit today. Redis support is on the roadmap.
Explicit over automatic. Uncertain states are surfaced, not resolved for you. This is a feature for correctness-sensitive systems and friction for everything else.
Python only. No Go client, no multi-language support yet. If your workers are polyglot, you'll need a different solution or a coordination service layer in front of Sentinel. Go client currently on the roadmap.
No built-in retries. Sentinel coordinates execution. It doesn't implement retry logic, backoff, or dead-letter queues. You bring those or compose them yourself.
Not a queue. Sentinel doesn't dispatch work or schedule tasks. It coordinates execution of work you've already routed to a worker.
Known Failure Boundaries
Sentinel intentionally prevents automatic re-execution once work has crossed the execution boundary.
If a worker enters the executing state and then crashes, freezes, loses heartbeat authority, or disappears before canonical completion occurs, Sentinel will not automatically restart the work, even after the lease expires.
This is intentional.
At that point, Sentinel cannot safely determine whether the side effect:
- fully completed,
- partially completed,
- or never completed at all.
Instead of risking duplicate execution, Sentinel preserves the execution state and requires explicit reconciliation.
This creates an important tradeoff:
- Sentinel prevents overlapping or duplicate authoritative execution
- But uncertain execution outcomes may require reconciliation logic before progress can continue
This is why expired executing states surface as reconciliation-required rather than automatically resetting back to claimed.
Sentinel chooses correctness of execution authority over automatic replay.
Project Status
Sentinel is early-stage software under active development. The core execution semantics are stabilizing, but APIs and reconciliation flows may evolve as the project matures.
Roadmap
- Retry support with configurable backoff
- Redis-backed coordination
- Async support
- Append-only execution logs
- Stronger reconciliation tooling
- Metrics and observability hooks
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sentinel_coordination-0.2.0.tar.gz.
File metadata
- Download URL: sentinel_coordination-0.2.0.tar.gz
- Upload date:
- Size: 18.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51a67bad4de1526e1e728f536fec49af3c60b1decc87a464999cb82f6ccce8d2
|
|
| MD5 |
a771a947a5fa93eae0ab626a1eb0c4b2
|
|
| BLAKE2b-256 |
8b5499fe9a70601b95abc08fdc933d75c2f938693c6c376bad8b26ae24921615
|
File details
Details for the file sentinel_coordination-0.2.0-py3-none-any.whl.
File metadata
- Download URL: sentinel_coordination-0.2.0-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d35a706a27f8193605ff815c11eb1f934c4f007e7894a203320f3c0cd1c6975
|
|
| MD5 |
75f913ffb480443e333bee677701e7b9
|
|
| BLAKE2b-256 |
96059ab86f9671689642db8364125ccaa3d788a303715b1350c6c5db2b7610d4
|