a fast, efficient inference engine for moondream

Project description

Kestrel

Kestrel Overview

High-performance inference engine for the Moondream vision-language model.

Kestrel provides async, micro-batched serving with streaming support, paged KV caching, and optimized CUDA kernels. It's designed for production deployments where throughput and latency matter.

Features

Async micro-batching — Cooperative scheduler batches heterogeneous requests without compromising per-request latency
Streaming — Real-time token streaming for query and caption tasks
Multi-task — Visual Q&A, captioning, point detection, object detection, and segmentation
Paged KV cache — Efficient memory management for high concurrency
Prefix caching — Radix tree-based caching for repeated prompts and images
LoRA adapters — Parameter-efficient fine-tuning support with automatic cloud loading

Requirements

Python 3.10+
NVIDIA GPU with optimized kernels for SM80 (A100), SM86 (A40, A10, RTX 3090), SM87 (Jetson Orin), SM89 (L4, L40S), and SM90 (H100). Other GPUs may work but have not been tested.
MOONDREAM_API_KEY environment variable (get this from moondream.ai)

Installation

pip install kestrel

For Jetson Orin, see the Jetson setup guide.

Model Access

Kestrel supports both Moondream 3 and Moondream 2:

Model	Repository	Notes
Moondream 2	vikhyatk/moondream2	Public, no approval needed
Moondream 3	moondream/moondream3-preview	Requires access approval

For Moondream 3, request access (automatically granted) then authenticate with huggingface-cli login or set HF_TOKEN.

Quick Start

import asyncio

from kestrel.config import RuntimeConfig
from kestrel.engine import InferenceEngine


async def main():
    # Weights are automatically downloaded from HuggingFace on first run.
    # Use model="moondream2" or model="moondream3-preview".
    cfg = RuntimeConfig(model="moondream2")

    # Create the engine (loads model and warms up)
    engine = await InferenceEngine.create(cfg)

    # Load an image (JPEG, PNG, or WebP bytes)
    image = open("photo.jpg", "rb").read()

    # Visual question answering
    result = await engine.query(
        image=image,
        question="What's in this image?",
        settings={"temperature": 0.2, "max_tokens": 512},
    )
    print(result.output["answer"])

    # Clean up
    await engine.shutdown()


asyncio.run(main())

Tasks

Kestrel supports several vision-language tasks through dedicated methods on the engine.

Query (Visual Q&A)

Ask questions about an image:

result = await engine.query(
    image=image,
    question="How many people are in this photo?",
    settings={
        "temperature": 0.2,  # Lower = more deterministic
        "top_p": 0.9,
        "max_tokens": 512,
    },
)
print(result.output["answer"])

Caption

Generate image descriptions:

result = await engine.caption(
    image,
    length="normal",  # "short", "normal", or "long"
    settings={"temperature": 0.2, "max_tokens": 512},
)
print(result.output["caption"])

Point

Locate objects as normalized (x, y) coordinates:

result = await engine.point(image, "person")
print(result.output["points"])
# [{"x": 0.5, "y": 0.3}, {"x": 0.8, "y": 0.4}]

Coordinates are normalized to [0, 1] where (0, 0) is top-left.

Detect

Detect objects as bounding boxes:

result = await engine.detect(
    image,
    "car",
    settings={"max_objects": 10},
)
print(result.output["objects"])
# [{"x_min": 0.1, "y_min": 0.2, "x_max": 0.5, "y_max": 0.6}, ...]

Bounding box coordinates are normalized to [0, 1].

Segment

Generate a segmentation mask (Moondream 3 only):

result = await engine.segment(image, "dog")
seg = result.output["segments"][0]
print(seg["svg_path"])  # SVG path data for the mask
print(seg["bbox"])      # {"x_min": ..., "y_min": ..., "x_max": ..., "y_max": ...}

Note: Segmentation requires Moondream 3 and separate model weights. Contact moondream.ai for access.

Streaming

For longer responses, you can stream tokens as they're generated:

image = open("photo.jpg", "rb").read()

stream = await engine.query(
    image=image,
    question="Describe this scene in detail.",
    stream=True,
    settings={"max_tokens": 1024},
)

# Print tokens as they arrive
async for chunk in stream:
    print(chunk.text, end="", flush=True)

# Get the final result with metrics
result = await stream.result()
print(f"\n\nGenerated {result.metrics.output_tokens} tokens")

Streaming is supported for query and caption methods.

Response Format

All methods return an EngineResult with these fields:

result.output          # Dict with task-specific output ("answer", "caption", "points", etc.)
result.finish_reason   # "stop" (natural end) or "length" (hit max_tokens)
result.metrics         # Timing and token counts

The metrics object contains:

result.metrics.input_tokens     # Number of input tokens (including image)
result.metrics.output_tokens    # Number of generated tokens
result.metrics.prefill_time_ms  # Time to process input
result.metrics.decode_time_ms   # Time to generate output
result.metrics.ttft_ms          # Time to first token

Using Finetunes

If you've created a finetuned model through the Moondream API, you can use it by passing the adapter ID:

result = await engine.query(
    image=image,
    question="What's in this image?",
    settings={"adapter": "01J5Z3NDEKTSV4RRFFQ69G5FAV@1000"},
)

The adapter ID format is {finetune_id}@{step} where:

finetune_id is the ID of your finetune job
step is the training step/checkpoint to use

Adapters are automatically downloaded and cached on first use.

Configuration

RuntimeConfig

RuntimeConfig(
    model="moondream3-preview",  # or "moondream2"
    max_batch_size=4,            # Max concurrent requests
)

Environment Variables

Variable	Description
`MOONDREAM_API_KEY`	Required. Get this from moondream.ai.
`HF_HOME`	Override HuggingFace cache directory for downloaded weights (default: `~/.cache/huggingface`).
`HF_TOKEN`	HuggingFace token for gated models like Moondream 3. Alternatively, run `huggingface-cli login`.

Triton Inference Server

Kestrel can be deployed as a Triton Inference Server backend. See the Triton setup guide.

Benchmarks

Throughput and latency for the query skill are tracked in PERFORMANCE.md, with results broken out by GPU.

License

Kestrel requires a Moondream API key. See moondream.ai/pricing for plans.

Project details

Release history Release notifications | RSS feed

0.3.1

May 1, 2026

0.3.0

May 1, 2026

0.2.1

Mar 25, 2026

This version

0.2.0

Mar 18, 2026

0.1.3

Feb 21, 2026

0.1.2

Feb 12, 2026

0.1.1

Feb 3, 2026

0.0.2

Jan 16, 2026

0.0.1

Jul 7, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kestrel-0.2.0.tar.gz (146.0 kB view details)

Uploaded Mar 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kestrel-0.2.0-py3-none-any.whl (168.1 kB view details)

Uploaded Mar 18, 2026 Python 3

File details

Details for the file kestrel-0.2.0.tar.gz.

File metadata

Download URL: kestrel-0.2.0.tar.gz
Upload date: Mar 18, 2026
Size: 146.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.0

File hashes

Hashes for kestrel-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`8b7295036939c238717496925ebc0f34b7edc4aac7d0db67fe7e0ed54ea9ee5e`
MD5	`f9bc4ca0cf4274f07321631b8918ae4a`
BLAKE2b-256	`9a4c130645e9115d5b12798e9e8882cba602435a55ddfaa50796a40153291ba1`

See more details on using hashes here.

File details

Details for the file kestrel-0.2.0-py3-none-any.whl.

File metadata

Download URL: kestrel-0.2.0-py3-none-any.whl
Upload date: Mar 18, 2026
Size: 168.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.0

File hashes

Hashes for kestrel-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`84dd69fb3c529d6b31400fda98f209b5aebe29dd124dae9691e931f6c19b4c16`
MD5	`de2b097506857d2a2d97183947e7c337`
BLAKE2b-256	`ca7aa42ea5336c4e4ca8e7b482f665cce8325f36cc78a9b4421a56ffd1fcfe08`

See more details on using hashes here.

kestrel 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Kestrel

Features

Requirements

Installation

Model Access

Quick Start

Tasks

Query (Visual Q&A)

Caption

Point

Detect

Segment

Streaming

Response Format

Using Finetunes

Configuration

RuntimeConfig

Environment Variables

Triton Inference Server

Benchmarks

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes