Self-contained distributed community captioning system
Project description
CaptionFlow
scalable, fault-tolerant vLLM-powered image captioning.
a fast websocket-based orchestrator paired with lightweight gpu workers achieves exceptional performance for batched requests through vLLM.
- orchestrator: hands out work in chunked shards, collects captions, checkpoints progress, and keeps simple stats.
- workers (vLLM): connect to the orchestrator, stream in image samples, batch them, and generate 1..N captions per image using prompts supplied by the orchestrator.
- config-driven: all components read YAML config; flags can override.
no conda. just
venv+pip.
install
python -m venv .venv
source .venv/bin/activate # windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e . # installs the `caption-flow` command
quickstart (single box)
- copy + edit the sample configs
cp examples/orchestrator/local_image_files.yaml my-orchestrator.yaml
cp examples/worker.yaml my-worker.yaml
cp examples/monitor.yaml my-monitor.yaml # optional terminal interface
set a unique shared token in both my-orchestrator.yaml and my-worker.yaml (see auth.worker_tokens in the orchestrator config and worker.token in the worker config).
if you use private hugging face datasets/models, export HUGGINGFACE_HUB_TOKEN before starting anything.
- start the orchestrator
caption-flow orchestrator --config my-orchestrator.yaml
- start one or more vLLM workers
# gpu 0 on the same host
caption-flow worker --config my-worker.yaml --gpu-id 0
# your second GPU
caption-flow worker --config my-worker.yaml --gpu-id 1
# on a remote host
caption-flow worker --config my-worker.yaml --server ws://your.hostname.address:8765
- (optional) start the monitor
caption-flow monitor --config my-monitor.yaml
- export the data
% caption-flow export --help
Usage: caption-flow export [OPTIONS]
Export caption data to various formats.
Options:
--format [jsonl|json|csv|txt|huggingface_hub|all] Export format (default: jsonl)
- jsonl: create JSON line file in the specified
--outputpath - csv: exports CSV-compatible data columns to the
--outputpath containing incomplete metadata - json: creates a
.jsonfile for each sample inside the--outputsubdirectory containing complete metadata; useful for webdatasets - txt: creates
.txtfile for each sample inside the--outputsubdirectory containing ONLY captions - huggingface_hub: creates a dataset on Hugging Face Hub, possibly
--privateand--nsfwwhere necessary - all: creates all export formats in a specified
--outputdirectory
how it’s wired
orchestrator
- websocket server (default
0.0.0.0:8765) with three client roles: workers, data-feeders, and admin. - dataset control: the orchestrator centrally defines the dataset (
huggingfaceorlocal) and version/name. it chunk-slices shards and assigns work. - data serving to remote workers: local files can be captioned by remote workers that don't have access to the same files, automatically.
- vLLM config broadcast: model, tp size, dtype, max seq len, memory targets, batching, sampling params, and inference prompts are all pushed to workers; workers can apply many changes without a model reload.
- storage + checkpoints: captions buffer to disk with periodic checkpoints. chunk state is tracked so restarts don’t double-work.
- auth: token lists for
worker,monitor, andadminroles.
vLLM worker
- one process per gpu. select the device with
--gpu-id(orworker.gpu_idin YAML). - gets its marching orders from the orchestrator: dataset info, model, prompts, batch size, and sampling.
- resilient: detects disconnects, abandons the current chunk cleanly, clears queues, reconnects, and resumes.
- batched generate(): images are resized down for consistent batching; each image can get multiple captions (one per prompt).
dataset formats
- huggingface hub or local based URL list datasets that are compatible with the datasets library
- webdatasets shards containing full image data; also can be hosted on the hub
- local folder filled with images; orchestrator will serve the data to workers
configuration path
config discovery order
for any component, the CLI looks for config in this order (first match wins):
--config /path/to/file.yaml./<component>.yaml(current directory)~/.caption-flow/<component>.yaml$XDG_CONFIG_HOME/caption-flow/<component>.yaml/etc/caption-flow/<component>.yaml- any
$XDG_CONFIG_DIRSentries undercaption-flow/ ./examples/<component>.yaml(fallback)
tls / certificates
use the built-in helpers during development:
# self-signed certs for quick local testing
caption-flow generate_cert --self-signed --domain localhost --output-dir ./certs
# inspect any certificate file
caption-flow inspect_cert ./certs/fullchain.pem
then point the orchestrator at the resulting cert/key (or run --no-ssl for dev-only ws://).
tips & notes
- multi-gpu: start one worker process per gpu (set
--gpu-idorworker.gpu_id). - throughput: tune
vllm.batch_sizein the orchestrator config (or override with--batch-sizeat worker start). higher isn’t always better; watch VRAM. - prompts: add more strings under
vllm.inference_promptsto get multiple captions per image; the worker returns only non-empty generations. - private HF: if your dataset/model needs auth, export
HUGGINGFACE_HUB_TOKENbeforecaption-flow worker .... - self-signed ssl: pass
--no-verify-sslto workers/monitors in dev. - recovery: if you hard-crash mid-run,
caption-flow scan_chunks --fixcan reset abandoned chunks so the orchestrator can reissue them cleanly.
roadmap
- hot config reload via the admin websocket path.
- dedicated data-feeder clients (separate from gpu workers) that push samples into the orchestrator.
- richer monitor TUI.
PRs welcome. keep it simple and fast.
architecture
┌─────────────┐ WebSocket ┌─────────────┐
│ Worker │◄──────────────────►│ │
│ │ │ │ ┌──────────────┐
│ │◄───────────────────│ │────►│Arrow/Parquet │
└─────────────┘ HTTP (img data) │ Orchestrator│ │ Storage │
│ │ └──────────────┘
┌─────────────┐ │ │
│ Worker │◄──────────────────►│ │
│ │ │ │
│ │◄───────────────────│ │
└─────────────┘ HTTP (img data) └─────────────┘
▲
┌─────────────┐ │
│ Monitor │◄──────────────────────────┘
└─────────────┘
Community Clusters
To contribute compute to a cluster:
- Install caption-flow:
pip install caption-flow - Get a worker token from the project maintainer
- Run:
caption-flow worker --server wss://project.domain.com:8765 --token YOUR_TOKEN
Your contributions will be tracked and attributed in the final dataset!
License
AGPLv3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file caption_flow-0.3.4.tar.gz.
File metadata
- Download URL: caption_flow-0.3.4.tar.gz
- Upload date:
- Size: 108.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba929a874e4678396027f0c0c22a68bbe3e501d01ef528ba341447400f807d4a
|
|
| MD5 |
cb0a91d5d4ffb4a58584c55e99539250
|
|
| BLAKE2b-256 |
c4a807976726b5ff13b1e6bb50cec57db9d5b1b55ba10d64d5c50884892b0492
|
File details
Details for the file caption_flow-0.3.4-py3-none-any.whl.
File metadata
- Download URL: caption_flow-0.3.4-py3-none-any.whl
- Upload date:
- Size: 118.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67a08606b2dff1a0ac6a6b193917abae37abd29e071383fe9d3390f000f5e341
|
|
| MD5 |
5dffa345a5a38e55048768e16a7cc4a1
|
|
| BLAKE2b-256 |
1bcfb0c437db11abe646f6bfe5e905023f4ba1cac165563c41b1829b124c6c5c
|