A code-to-graph-to-memory profiler for Dask: know which chunk killed a worker and what source line produced it.
Project description
DaskGenie
A memory profiler and live dashboard for Dask that ties a worker's memory back to the line of your code that caused it.
DaskGenie fuses memray-deep allocation
tracing with Dask's task graph and streams it to a real-time dashboard. When a
worker dies from an out-of-memory error, you can see the suspect task, the chunk
it was holding, and the exact source line that allocated the array that killed
it — not just "a worker disappeared". It works on dask.distributed and on the
local (threaded / processes / synchronous) schedulers, across dask.array,
dask.dataframe, dask.bag, dask.delayed, and xarray on Zarr/NetCDF.
Features
- Source attribution. Maps every Dask task-graph layer back to the user
source line that built it — the call site in your code, never into
dask/numpy/xarrayinternals. - Deep memory (memray, as a library). Opt-in per-run allocation tracing,
epoch-rotated and folded to the first line of your code, so the dashboard shows
job.py:42 build = 12.8 GBand a per-worker flamegraph / memray-style tree — you never touch a capture file. - Worker-death post-mortem. A scheduler plugin records the tasks in flight when a worker vanishes; the collector joins in the chunk metadata and the allocation lines at the high-water mark to answer which chunk, which line.
- Real-time dashboard. A Next.js app streaming over WebSocket: live worker table, a zoomable task stream (global + per-worker), the whole task-graph DAG, memory-over-time with a click-to-inspect spike explorer, per-layer allocations over time, and the deep flamegraph.
- Any scheduler.
register()installs worker + scheduler plugins ondask.distributed;LocalProfilerhooks the callback API for the threaded, processes, and synchronous schedulers. - Team-friendly. Every run records the machine (hostname + IP) that opened it, so a shared collector becomes one place to see everyone's runs.
- Prometheus + TimescaleDB. The collector exposes
/metricsfor Grafana and stores to TimescaleDB (or self-contained SQLite for local use / tests).
Installation
pip install daskgenie
Optional extras:
pip install 'daskgenie[deep]' # memray-backed deep memory profiling
pip install 'daskgenie[collector]' # the FastAPI collector service (server side)
pip install 'daskgenie[examples]' # xarray, zarr, pandas, ... for the examples
DaskGenie requires Python 3.11 or newer. Deep memory profiling needs memray (Linux/macOS + CPython); where it isn't importable the profiler silently degrades to lightweight RSS/managed-memory sampling.
Quick start
Bring up the dashboard stack — the collector (API + /metrics + WebSocket),
TimescaleDB, and the Next.js dashboard — with Docker:
docker compose up -d --build
# dashboard → http://localhost:3000 collector API → http://localhost:8765
Then point a job at the collector. Each register() opens a run that appears
live on the dashboard:
from distributed import Client, LocalCluster
import daskgenie as dg
import daskgenie.client as genie
client = Client(LocalCluster(processes=True))
run_id = genie.register(client, "http://localhost:8765", run_name="nightly ETL", deep=True)
with dg.track() as source_map:
result = build_pipeline() # your dask work
genie.upload_graph("http://localhost:8765", run_id, source_map, collection=result)
result.compute()
Open the run and explore it as it runs:
- Overview — live stats, memory-over-time, the hottest allocation line.
- Timeline — a large, zoomable memory chart; click any point to see what was running and which source lines were allocating at that instant, plus a stacked per-layer allocation timeline.
- Workers — a live, native-Dask-style table (RSS vs limit, CPU, threads, executing/ready).
- Task stream — global + per-worker task lanes on a zoomable time axis.
- Graph — the real connected task DAG (canvas for large graphs), coloured by layer, death-suspect nodes highlighted; click a node for its source and chunks.
- Memory — the deep view: allocation flamegraph, peak bytes by source line, and peak memory by task.
- Post-mortem — each worker death: the allocation lines at the high-water mark, the suspect task, its source line, and the chunk it was holding.
Local schedulers
For the non-distributed schedulers (.compute(scheduler="threads"|"processes"| "synchronous"), the default for bare dask collections) use LocalProfiler,
which hooks Dask's callback API instead of installing cluster plugins:
import daskgenie as dg
with dg.track() as source_map:
result = build_pipeline()
with dg.LocalProfiler("http://localhost:8765", run_name="threaded job",
source_map=source_map, collection=result, deep=True) as prof:
result.compute(scheduler="threads")
# prof.run_id shows up in the dashboard like any other run
The post-mortem: which chunk killed this worker
With register() active, a scheduler plugin watches for worker deaths. On a
death it records the tasks that were in flight; the collector joins in the chunk
metadata the worker had already reported and (with deep=True) the allocation
lines at the high-water mark. See it end to end on a LocalCluster whose worker
is really OOM-killed:
docker compose up -d --build
uv run --extra demo --extra deep python examples/deep_oom.py
OOM vs. clean shutdown is a heuristic: the scheduler doesn't tell a plugin why a worker left, so DaskGenie flags a suspected OOM only when tasks were in flight at an unexpected removal, and never over-claims.
Configuration
Everything is configured through environment variables:
| Variable | Component | Purpose |
|---|---|---|
DASKGENIE_DSN |
collector | Postgres/TimescaleDB DSN; selects the Timescale backend (default in Docker). |
DASKGENIE_DB |
collector | SQLite path (or :memory:) when no DSN is set. |
DASKGENIE_HOST / DASKGENIE_PORT |
collector | Bind address (default 127.0.0.1:8765). |
COLLECTOR_URL |
dashboard | Where the Next.js server proxies /api (e.g. http://collector:8765). |
NEXT_PUBLIC_COLLECTOR_WS |
dashboard | WebSocket base the browser connects to (default ws://<host>:8765). |
The register() and LocalProfiler calls take deep=, sample_interval=,
flush_interval=, and deep_epoch_seconds= to trade overhead for resolution.
Notes and limitations
- Source attribution is strongest for
dask.arrayanddask.delayed. Fordask.dataframe(which builds graphs through dask-expr) and xarray, the heavy work runs inside library/C code, so per-line allocations fold to framework frames — the graph, memory, per-layer and flamegraph views still work, but layer→line mapping is sparse there. - memray is Linux/macOS + CPython only, and a single tracker runs per process; deep mode is opt-in and costs roughly 1.5–2× runtime.
- On a hard OOM kill the worker's in-flight memray epoch can die before it flushes, so a specific post-mortem's line attribution is best-effort — the Memory tab remains the reliable place to see the culprit line.
- The OOM label is a heuristic, not a certainty (see the post-mortem note).
- TimescaleDB is the default store in Docker; SQLite is the zero-setup backend used locally and by the test suite.
Examples
Runnable scripts covering the breadth of Dask live in
examples/ — distributed OOMs, the deep-memory demo, a big
minutes-long pipeline, a self-limiting crash, and one per collection type
(dask.delayed, dask.dataframe, dask.bag, xarray on Zarr and NetCDF). See
examples/README.md.
Development
uv sync --group dev --extra collector --extra deep
uv run pytest
uv run ruff check .
uv run ruff format .
uv run mypy src/
# collector + dashboard separately (dashboard proxies /api to the collector)
uv run python -m daskgenie.collector --port 8765 # terminal 1
cd web && npm install && npm run dev # terminal 2 → :3000
How source attribution works
HighLevelGraph.from_collections is the one classmethod almost every Dask
collection operation calls to register a new named graph layer. track()
patches it for the duration of the with block: each time a new layer appears,
it walks the call stack outward until it finds the first frame that isn't inside
dask/distributed/xarray/numpy/zarr/daskgenie or a site-packages
install — that's your call site. The deep memory engine reuses the same
library-path filter to fold memray stacks to the first user frame. The
library-path filter is configurable via track(extra_library_paths=[...]).
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file daskgenie-0.1.0.tar.gz.
File metadata
- Download URL: daskgenie-0.1.0.tar.gz
- Upload date:
- Size: 284.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c15f0e5b4328c99c6704207e7be5b107503169223a1a9f68fd03b824fb7686e
|
|
| MD5 |
b1070aa4374d643aa14f90ac1efb9267
|
|
| BLAKE2b-256 |
a8d2fe7a08adb40806733ffe78995cf69bfb6f6c7079d758dd275c9f44e405c2
|
Provenance
The following attestation bundles were made for daskgenie-0.1.0.tar.gz:
Publisher:
workflow-pypi.yml on polymood/DaskGenie
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
daskgenie-0.1.0.tar.gz -
Subject digest:
3c15f0e5b4328c99c6704207e7be5b107503169223a1a9f68fd03b824fb7686e - Sigstore transparency entry: 2062206542
- Sigstore integration time:
-
Permalink:
polymood/DaskGenie@26df083d832d8cb3d41cba29287e33a6691729cc -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/polymood
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow-pypi.yml@26df083d832d8cb3d41cba29287e33a6691729cc -
Trigger Event:
push
-
Statement type:
File details
Details for the file daskgenie-0.1.0-py3-none-any.whl.
File metadata
- Download URL: daskgenie-0.1.0-py3-none-any.whl
- Upload date:
- Size: 52.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d34954a8d2757fc7ad9a468bb03215ce917082bac79a30f4472f2f69828dfbe5
|
|
| MD5 |
630e9582efa569a32d9b96abf8b4f33b
|
|
| BLAKE2b-256 |
46346adaa5b28253376e86f788aa5cc6f607746a9a5cc0b85e86e9c8ad0b16de
|
Provenance
The following attestation bundles were made for daskgenie-0.1.0-py3-none-any.whl:
Publisher:
workflow-pypi.yml on polymood/DaskGenie
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
daskgenie-0.1.0-py3-none-any.whl -
Subject digest:
d34954a8d2757fc7ad9a468bb03215ce917082bac79a30f4472f2f69828dfbe5 - Sigstore transparency entry: 2062207221
- Sigstore integration time:
-
Permalink:
polymood/DaskGenie@26df083d832d8cb3d41cba29287e33a6691729cc -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/polymood
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow-pypi.yml@26df083d832d8cb3d41cba29287e33a6691729cc -
Trigger Event:
push
-
Statement type: