Self-contained distributed community captioning system
Project description
CaptionFlow
scalable, fault-tolerant vLLM-powered image captioning. this "first round" focuses on a fast websocket orchestrator plus lightweight gpu workers that batch requests through vLLM.
- orchestrator: hands out work in chunked shards, collects captions, checkpoints progress, and keeps simple stats.
- workers (vLLM): connect to the orchestrator, stream in image samples, batch them, and generate 1..N captions per image using prompts supplied by the orchestrator.
- config-driven: all components read YAML config; flags can override.
- tui monitor (optional): a monitor client is wired into the CLI; ship a
monitormodule to enable it.
no conda. just
venv+pip.
install
python -m venv .venv
source .venv/bin/activate # windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e . # installs the `caption-flow` command
quickstart (single box)
- copy + edit the sample configs
cp orchestrator.yaml my-orchestrator.yaml
cp worker.yaml my-worker.yaml
cp monitor.yaml my-monitor.yaml # optional; requires a monitor module
set a unique shared token in both my-orchestrator.yaml and my-worker.yaml (see auth.worker_tokens in the orchestrator config and worker.token in the worker config). if you use private hugging face datasets/models, export HUGGINGFACE_HUB_TOKEN before starting workers.
- start the orchestrator
caption-flow orchestrator --config my-orchestrator.yaml
- start one or more vLLM workers
# gpu 0 on the same host
caption-flow worker --config my-worker.yaml --gpu-id 0
# your second GPU
caption-flow worker --config my-worker.yaml --gpu-id 1
- (optional) start the monitor
caption-flow monitor --config my-monitor.yaml
- (optional) scan/fix chunks on disk if you had crashes
caption-flow scan_chunks --data-dir ./caption_data --checkpoint-dir ./checkpoints --fix
how it’s wired
orchestrator
- websocket server (default
0.0.0.0:8765) with three client roles: workers, data-feeders, and admin. - dataset control: the orchestrator centrally defines the dataset (
huggingfaceorlocal) and version/name. it chunk-slices shards and assigns work. - vLLM config broadcast: model, tp size, dtype, max seq len, memory targets, batching, sampling params, and inference prompts are all pushed to workers; workers can apply many changes without a model reload.
- storage + checkpoints: captions buffer to disk with periodic checkpoints. chunk state is tracked so restarts don’t double-work.
- auth: token lists for
worker,monitor, andadminroles.
start flags you’ll likely use:
--config PATH # yaml config for the orchestrator
--port INT, --host STR # bind controls
--data-dir PATH # overrides storage.data_dir
--cert PATH, --key PATH # enable TLS (or use --no-ssl for ws:// in dev)
--vllm # use the vLLM-style orchestrator (webdataset/hf)
vLLM worker
- one process per gpu. select the device with
--gpu-id(orworker.gpu_idin YAML). - gets its marching orders from the orchestrator: dataset info, model, prompts, batch size, and sampling.
- resilient: detects disconnects, abandons the current chunk cleanly, clears queues, reconnects, and resumes.
- batched generate(): images are resized down for consistent batching; each image can get multiple captions (one per prompt).
start flags you’ll likely use:
--config PATH # yaml for the worker
--server URL # ws(s)://host:port
--token STR # must match an allowed worker token on the orchestrator
--name STR # display name
--batch-size INT # override vLLM batch size
--vllm # use the vLLM worker implementation
--gpu-id INT # which gpu to use
--precision STR, --model STR # optional overrides for dtype/model
--no-verify-ssl # accept self-signed certs in dev
(optional) monitor
- a CLI entry exists for a TUI monitor; wire in a
monitormodule to enable it. config lives inmonitor.yamlor insideorchestrator.yamlundermonitor:.
configuration
config discovery order
for any component, the CLI looks for config in this order (first match wins):
--config /path/to/file.yaml./<component>.yaml(current directory)~/.caption-flow/<component>.yaml$XDG_CONFIG_HOME/caption-flow/<component>.yaml/etc/caption-flow/<component>.yaml- any
$XDG_CONFIG_DIRSentries undercaption-flow/ ./examples/<component>.yaml(fallback)
orchestrator.yaml (highlights)
orchestrator:
host: 0.0.0.0
port: 8765
# ssl:
# cert: /path/fullchain.pem
# key: /path/privkey.pem
dataset:
type: huggingface # or "local"
path: <hf-dataset-or-local-path>
name: <logical-name>
version: "1.0"
vllm:
model: Qwen/Qwen2.5-VL-3B-Instruct
tensor_parallel_size: 1
max_model_len: 16384
dtype: float16
gpu_memory_utilization: 0.92
enforce_eager: true
disable_mm_preprocessor_cache: true
limit_mm_per_prompt: { image: 1 }
batch_size: 8
sampling:
temperature: 0.7
top_p: 0.95
max_tokens: 256
repetition_penalty: 1.05
skip_special_tokens: true
stop: ["<|end|>", "<|endoftext|>", "<|im_end|>"]
inference_prompts:
- "describe this image in detail"
- "provide a comprehensive description of the visual content"
- "what are the key elements in this image?"
storage:
data_dir: ./caption_data
checkpoint_dir: ./checkpoints
caption_buffer_size: 100
checkpoint_interval: 1000
# chunking/queueing
chunk_size: 1000
chunks_per_request: 2
chunk_buffer_multiplier: 3
min_chunk_buffer: 10
auth:
worker_tokens:
- { token: "example-worker-token", name: "Example Worker" }
monitor_tokens:
- { token: "letmein", name: "Default monitor" }
admin_tokens:
- { token: "admin-secret-2024", name: "Admin" }
worker.yaml (highlights)
worker:
server: ws://localhost:8765 # use wss:// in prod
token: example-worker-token
name: local-gpu
gpu_id: 0
vllm: true
# local queues
readahead_size: 256
inference_queue_size: 128
monitor.yaml (optional)
monitor:
server: ws://localhost:8765
token: letmein
refresh_rate: 1.0
show_contributors: true
show_quality_metrics: true
max_activity_items: 20
show_chunk_progress: true
show_worker_queues: true
show_throughput_graph: true
tls / certificates
use the built-in helpers during development:
# self-signed certs for quick local testing
caption-flow generate_cert --self-signed --domain localhost --output-dir ./certs
# inspect any certificate file
caption-flow inspect_cert ./certs/fullchain.pem
then point the orchestrator at the resulting cert/key (or run --no-ssl for dev-only ws://).
tips & notes
- multi-gpu: start one worker process per gpu (set
--gpu-idorworker.gpu_id). - throughput: tune
vllm.batch_sizein the orchestrator config (or override with--batch-sizeat worker start). higher isn’t always better; watch VRAM. - prompts: add more strings under
vllm.inference_promptsto get multiple captions per image; the worker returns only non-empty generations. - private HF: if your dataset/model needs auth, export
HUGGINGFACE_HUB_TOKENbeforecaption-flow worker .... - self-signed ssl: pass
--no-verify-sslto workers/monitors in dev. - recovery: if you hard-crash mid-run,
caption-flow scan_chunks --fixcan reset abandoned chunks so the orchestrator can reissue them cleanly.
roadmap
- hot config reload via the admin websocket path.
- dedicated data-feeder clients (separate from gpu workers) that push samples into the orchestrator.
- richer monitor TUI.
PRs welcome. keep it simple and fast.
architecture
┌─────────────┐ WebSocket ┌─────────────┐
│ Worker │◄──────────────────►│ │
└─────────────┘ │ │ ┌──────────────┐
│ Orchestrator│────►│Arrow/Parquet │
┌─────────────┐ │ │ │ Storage │
│ Worker │◄──────────────────►│ │ └──────────────┘
└─────────────┘ └─────────────┘
▲
┌─────────────┐ │
│ Monitor │◄──────────────────────────┘
└─────────────┘
Storage Schema
captions.parquet
job_id: Unique job identifierdataset: Dataset nameshard: Shard identifieritem_key: Item within shardcaption: Generated caption textcontributor_id: Worker who generated ittimestamp: Generation timequality_score: Optional quality metric
jobs.parquet
job_id: Unique identifierdataset: Dataset nameshard: Shard identifierstatus: pending/processing/completed/failedassigned_to: Worker IDtimestamp: Status change time
contributors.parquet
contributor_id: Unique identifiername: Display nametotal_captions: Lifetime counttrust_level: Quality tier (0-5)
Development
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black src/
ruff --fix src/
# Type checking
mypy src/
Community Contribution
To contribute compute:
- Install caption-flow:
pip install caption-flow - Get a worker token from the project maintainer
- Run:
caption-flow worker --server wss://project.domain.com:8765 --token YOUR_TOKEN
Your contributions will be tracked and attributed in the final dataset!
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file caption_flow-0.2.0.tar.gz.
File metadata
- Download URL: caption_flow-0.2.0.tar.gz
- Upload date:
- Size: 89.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb17c8c0944e9d62a49f2e703e110cc7601245134de8dfa9d7dd9bd3880ad3ee
|
|
| MD5 |
20cc2aa728032ab8069613ba38b8757c
|
|
| BLAKE2b-256 |
058112bd4ab2180380f594b9795d2839de60a757227f1a55c31165254a0e3bec
|
File details
Details for the file caption_flow-0.2.0-py3-none-any.whl.
File metadata
- Download URL: caption_flow-0.2.0-py3-none-any.whl
- Upload date:
- Size: 95.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54907413af16d64185d430bf0e8003a73cb98fcaf180e53ee6616bfda3fea9db
|
|
| MD5 |
5d7d07b1d1fc3da1aa3d64f7329dfeac
|
|
| BLAKE2b-256 |
b3afad436581fe212e628375a30ec0cf2495e23100a127a2a08a67313f93b771
|