Reliability layer for Celery. Zero job loss with task resurrection, idempotency, and graceful shutdown.

These details have not been verified by PyPI

Project description

Relier

Your Celery workers will crash tonight. Your tasks should still complete.

Relier makes Celery reliable. One decorator wraps your existing tasks with crash recovery, exactly-once execution, two-tier timeouts, graceful shutdown, admission control, and a DLQ without changing your function bodies or your Redis broker.

Every task either completes, hands off to another worker, or lands in the Dead Letter Queue with a traceable reason. Nothing silently disappears.

→ Landing page · Docs · Quickstart

What changes

Vanilla Celery:

@celery_app.task
def charge_customer(customer_id: str, amount_cents: int):
    return stripe.charge(customer_id, amount_cents)

charge_customer.delay("cus_abc", 5000)
# - Worker dies mid-charge      -> task lost
# - Network blip causes retry   -> customer charged twice
# - Stripe hangs                -> task hangs the worker forever
# - Traffic spike               -> queue floods, cascade failure

With Relier (same function, four added kwargs):

from relier.tasks.decorator import rl_task

@rl_task(
    queue="high_priority",
    idempotent=True,        # exactly-once via atomic Redis Lua
    soft_timeout=8,         # cleanup hook fires at 8s
    hard_timeout=10,        # cancelled at 10s
)
async def charge_customer(customer_id: str, amount_cents: int):
    return await stripe.charge(customer_id, amount_cents)

await charge_customer.apush("cus_abc", 5000)
# - Worker dies     -> Phoenix re-queues within ~8s (p99), same args; idempotency
#                      stops a double-charge
# - Network blip    -> cached result returned, no second charge
# - Stripe hangs    -> cancelled at 10s, quarantined to DLQ with full payload
# - Traffic spike   -> AdmissionRejectedError with Retry-After, HTTP 429 ready

That's the entire migration. Your function body doesn't change. Your call site swaps .delay(...) for await task.apush(...) (async) or task.push(...) (sync, for Flask / Django views / scripts).

What Relier solves

Problem	Vanilla Celery	With Relier
Worker OOM-killed mid-task	Lost forever, no trace	Phoenix re-queues within ~9 s (p99)
Non-idempotent retries	Your problem to solve	`idempotent=True` atomic Lua, exactly-once
No task timeouts	Zombie tasks block workers	Two-tier soft/hard timeout with cleanup hooks
Ungraceful deploys	~40% of in-flight tasks silently lost	SIGTERM drain + handoff to other workers
No visibility	`celery inspect`, then squint	`rl tasks inflight --follow`, structured output
Traffic spikes	Queue floods, cascade failures	Atomic admission control, `Retry-After`
Poison-pill tasks	Crash workers forever	Quarantined to DLQ after `max_resurrections`
Schema drift on rolling deploy	Old payloads on new code fail silently	Versioned envelope + sequential migrations

All eight covered. Same Celery programming model. Same Redis broker. No new infrastructure to operate beyond what you already have.

What Relier is and is not

Relier is a thin wrapper around Celery, not a replacement for it.

You keep your workers (celery -A relier.tasks.app worker), your Redis broker, your queue names, your @task intuition. Relier adds a lifecycle layer on top: heartbeat tracking, resurrection, idempotency, timeouts, graceful shutdown. Your function bodies don't change. Your infrastructure doesn't change. You add one decorator, switch .delay() to .push(), and you're done.

Relier is not Temporal or Hatchet.

Temporal and Hatchet are workflow engines. They model multi-step workflows with deterministic replay, activity retries across process restarts, and saga compensation. That's a fundamentally different problem and a fundamentally different programming model. If you need long-running workflows spanning hours, human approval steps, or saga rollbacks, use one of those.

Relier is for teams that already have Celery tasks and want them to stop disappearing. No workflow model. No deterministic replay. No new service to operate. Same Redis you already have.

Relier is not a DAG runner.

Prefect, Airflow, Dagster, Luigi these schedule and orchestrate pipelines of dependent tasks. They have UIs, schedulers, and retry policies baked into a pipeline definition. Relier has none of that.

Relier makes individual Celery tasks reliable. What those tasks do, when they run, and how they depend on each other is still your problem and Celery's.

vs. building it yourself. Most teams write some subset of this an idempotency table, sometimes a heartbeat-based resurrector, occasionally a DLQ. The pieces are individually well-understood. Composing them correctly (fence tokens for the GC-pause-victim case, AOF + noeviction preflight checks, thundering-herd defences on resurrection batches) is what Relier exists to spare you from. The chaos suite ships first-party so you can verify the guarantees hold on your own cluster, not just trust ours.

Install

pip install relier

Requirements: Python 3.11+, Redis 7+ with AOF persistence and maxmemory-policy noeviction. Relier preflight-checks both and refuses to start if either is wrong.

Quickstart

# tasks.py
from relier.tasks.decorator import rl_task

@rl_task(idempotent=True, hard_timeout=30)
async def send_invoice(invoice_id: str) -> dict:
    await charge_card(invoice_id)
    await email_invoice(invoice_id)
    return {"invoice_id": invoice_id}

# FastAPI
@app.post("/invoices/{invoice_id}/send")
async def dispatch(invoice_id: str):
    await send_invoice.apush(invoice_id)
    return {"status": "queued"}

# Three processes - bare metal, no Docker required
celery -A relier.tasks.app worker -l info -Q high_priority,default,low_priority,re-queue
rl run-resurrector
uvicorn main:app

Or get the full stack (Redis + workers + resurrector + OTel + Grafana):

make dev          # docker-compose.yml, single-node Redis with AOF
make prod         # docker-compose.prod.yml, Redis HA with Sentinel + backup

Full quickstart: docs/quickstart.md.

Verify it works (chaos suite, first-party)

# Seed a long-running task, SIGKILL the worker that's running it,
# watch Phoenix re-queue it onto a healthy worker, live.
rl chaos worker-kill --seed --watch --watch-duration 60

Five chaos scenarios ship with Relier: worker-kill, network-partition, load-spike, task-corrupt, slow-task. They let you prove the reliability claims against your own cluster, your own task code, your own Redis. Most projects ship a test suite; Relier also ships a chaos suite.

Full guide: docs/chaos-guide.md.

Performance

Measured by the built-in bench suite (docker compose -f docker-compose.bench.yml up --build) on Linux with prefork workers and synthetic 0.5 s tasks. All claims verified end-to-end not microbenchmarks against a mock.

Numbers below: Relier v0.1.0, captured 2026-05-25 against commit 41884c5. Re-run with make bench-docker to compare on your hardware.

Linux (Docker, python:3.11-slim, prefork=4) | Redis 7.2 AOF | 500 tasks × 5 kills

Metric                              Relier 0.1         Vanilla Celery
----------------------------------------------------------------------
Task delivery rate (5 SIGKILL)      100%               92.0%
OOM recovery avg / p99              7.3 s / 9.4 s      ∞ lost
Dual-OOM (2 concurrent tasks)       2/2 · 7.5 s        both lost
Idempotency (50 submissions)        1 execution        50 executions
Admission control p99 / max         0.763 ms / 1.72 ms n/a
Graceful shutdown (3 cycles)        100%               0%
Dispatch overhead (net avg)         +2.28 ms           —
File descriptor leak                Δ 0 (stable)       n/a
----------------------------------------------------------------------

+2.28 ms per dispatch pays for: atomic admission check, SHA-256-signed envelope wrap, heartbeat registration. On any task that does real work (a DB query, an HTTP call, an AI inference), this is invisible.

At 3.1 ms average per dispatch, a single async producer sustains ~320 apush() calls/second per thread. FastAPI producers fan out well past 1,000/second.

The admission control Lua script stays under 1 ms at p99 (0.763 ms), meaning the tail-latency cost of the admission check is bounded for the vast majority of requests.

Bench dashboard end of run

Full methodology, per-test breakdowns, and Docker Compose instructions: docs/benchmarks.md.

What's in the box

Zero job loss (Phoenix Pattern): heartbeat-based crash detection, atomic re-queue with lease + fence tokens.
Exactly-once via idempotency: atomic Redis Lua, claim/in-flight/completed states.
Two-tier timeouts: soft (cleanup hook) + hard (asyncio cancellation), enforced on async tasks.
Checkpointing: ctx.set_partial(state) in the soft-timeout hook saves progress to Redis; the next resurrection resumes from that state instead of starting over.
Graceful shutdown: SIGTERM drain phase, handoff to Phoenix for tasks that won't finish in time.
Dead Letter Queue: full payload + reason + resurrection history. CLI to inspect, release, retry, purge.
Admission control: atomic Lua-based fixed-window limiter, returns Retry-After.
SLO burn-rate tracking: 1h / 6h / 3d windows, Google SRE-style burn rates, JSON or table output.
Schema versioning: signed envelopes with sequential migrations for rolling deploys, old workers and new workers can run simultaneously without payload mismatches.
Full OpenTelemetry: every lifecycle event emits spans and metrics. Bundled OTel -> Prometheus -> Grafana stack.
Redis HA out of the box: Sentinel-based failover, replicas, hourly RDB backups, optional S3 offsite.
Async-first, sync-compatible: apush for asyncio (FastAPI), push for sync code (Flask, Django, scripts).
Chaos suite: five scenarios to verify the guarantees on your cluster.

Full feature reference: docs/.

Documentation


Quickstart	5-minute working setup
Celery Primer	If you've never used Celery
Core Concepts	What each mechanism does and why
Integration Recipes	FastAPI, Flask, Django, scripts
Patterns Cookbook	Idempotency keys, checkpoints, dedicated workers
Troubleshooting & FAQ	First place to look when things break
API Reference	Every `@rl_task` option, every dispatch method
Configuration	Every `RELIER_*` env var
CLI Reference	Every `rl` subcommand, what it touches in Redis
Deployment	Bare metal, Docker dev, Docker prod, Kubernetes
Durability & HA	What's protected against which failure mode
Architecture	Internals: async bridge, Redis keys, Lua scripts
Metrics Reference	OTel metric names and labels for dashboards
Chaos Guide	How to verify the guarantees yourself

Production status

Relier is pre-1.0. The API is stabilising but may change before 1.0. The internals (Redis key layout, Lua scripts, fence-token protocol) are production-grade and have been validated against the bundled chaos suite, including under network partitions and mass worker failure.

If you're considering it for production: read Durability & HA first, then run the chaos suite against a staging cluster that mirrors your prod setup. File issues for anything that surprises you. Those are the inputs that get the project to 1.0.

Contributing

Issues and pull requests welcome. Particularly valuable:

Real-world workloads that don't fit the current Patterns Cookbook
Failure modes the durability matrix doesn't cover
Documentation gaps you hit while integrating
Performance numbers from your environment (make bench output plus a one-line spec)

git clone https://github.com/getrelier/relier
cd relier
cp .env.example .env             # fill in your Redis URL
make setup                       # venv + dev deps + pre-commit
make test                        # unit tests
make test-integration            # integration tests against test-container Redis
make bench                       # synthetic bench smoke (no Docker, ~2 min)
make bench-docker                # full bench in Docker with Prometheus + Grafana

Open a PR against main. Quality gates: make lint check test must pass; make test-integration is recommended if you touched anything in core/ or tasks/.

Community

Issues — bugs, feature requests, questions via the issue templates above
Discussions — github.com/getrelier/relier/discussions ideas, integrations, show and tell
X / Twitter — @relierdev release announcements and short-form updates
Releases — watch this repo for new releases; the changelog is in each GitHub Release

Licence

MIT. See LICENSE.

Acknowledgements

Built on Celery, Redis, asyncio, and OpenTelemetry. The Phoenix Pattern owes its name to the obvious metaphor; the fence-token approach is borrowed from Martin Kleppmann's writeups on distributed locking. The explicit-checkpoint philosophy is shared with Faust, Temporal (despite their different model), and AWS Step Functions when production systems converge on a design choice, it's worth noticing.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

May 26, 2026

This version

0.1.1

May 25, 2026

0.1.0

May 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

relier-0.1.1.tar.gz (3.6 MB view details)

Uploaded May 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

relier-0.1.1-py3-none-any.whl (130.8 kB view details)

Uploaded May 25, 2026 Python 3

File details

Details for the file relier-0.1.1.tar.gz.

File metadata

Download URL: relier-0.1.1.tar.gz
Upload date: May 25, 2026
Size: 3.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for relier-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`1ce95368c0f2e875090da9aab8f1e8945d15d7d8cdd131c2d6f2f6da969c861b`
MD5	`817815a2bb7e8a3ec32b71bb51a369f1`
BLAKE2b-256	`a502004982063cfdb73924bf760ee485a801b461244009310bba2b09c2cb4b6d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for relier-0.1.1.tar.gz:

Publisher: publish.yml on getrelier/relier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: relier-0.1.1.tar.gz
- Subject digest: 1ce95368c0f2e875090da9aab8f1e8945d15d7d8cdd131c2d6f2f6da969c861b
- Sigstore transparency entry: 1631182295
- Sigstore integration time: May 25, 2026
Source repository:
- Permalink: getrelier/relier@2e09c92f87b1185ce27c11ac3904e8450793ff8c
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/getrelier
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2e09c92f87b1185ce27c11ac3904e8450793ff8c
- Trigger Event: push

File details

Details for the file relier-0.1.1-py3-none-any.whl.

File metadata

Download URL: relier-0.1.1-py3-none-any.whl
Upload date: May 25, 2026
Size: 130.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for relier-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`24d1dccd3144bac829790b96a2a911e3390486a13e0b92a4b227b22a325c80cc`
MD5	`dbbce0a55c1cab8af3c5e95af8552572`
BLAKE2b-256	`c1a0aac21eafd42d81acaaa0cf84197219a91225a661fee2986f1f497f356a4a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for relier-0.1.1-py3-none-any.whl:

Publisher: publish.yml on getrelier/relier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: relier-0.1.1-py3-none-any.whl
- Subject digest: 24d1dccd3144bac829790b96a2a911e3390486a13e0b92a4b227b22a325c80cc
- Sigstore transparency entry: 1631182320
- Sigstore integration time: May 25, 2026
Source repository:
- Permalink: getrelier/relier@2e09c92f87b1185ce27c11ac3904e8450793ff8c
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/getrelier
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2e09c92f87b1185ce27c11ac3904e8450793ff8c
- Trigger Event: push

relier 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Relier

What changes

What Relier solves

What Relier is and is not

Install

Quickstart

Verify it works (chaos suite, first-party)

Performance

What's in the box

Documentation

Production status

Contributing

Community

Licence

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance