Skip to main content

Self-hostable OpenAI-compatible multi-modality AI server: 20 modalities (chat, image, audio, video, 3D, embeddings, OCR, segmentation), plugin runtimes (PyTorch, diffusers, llama-cpp-python).

Project description

Muse

fresh-venv-smoke

Model-agnostic multi-modality generation server. OpenAI-compatible HTTP is the canonical interface:

  • text-to-speech on /v1/audio/speech
  • speech-to-text on /v1/audio/transcriptions and /v1/audio/translations
  • text-to-music on /v1/audio/music and text-to-sound-effects on /v1/audio/sfx
  • text-to-image on /v1/images/generations, image inpainting on /v1/images/edits, image variations on /v1/images/variations
  • image-to-image super-resolution on /v1/images/upscale
  • promptable segmentation on /v1/images/segment
  • text-to-animation on /v1/images/animations
  • text-to-video on /v1/video/generations
  • image-to-vector on /v1/images/embeddings
  • audio-to-vector on /v1/audio/embeddings
  • text-to-vector on /v1/embeddings
  • text-to-text (LLM, tool calls, streaming) on /v1/chat/completions
  • text moderation/classification on /v1/moderations
  • text rerank (Cohere-compat) on /v1/rerank
  • text summarization (Cohere-compat) on /v1/summarize

Modality tags are MIME-style (audio/embedding, audio/generation, audio/speech, audio/transcription, chat/completion, embedding/text, image/animation, image/embedding, image/generation, image/segmentation, image/upscale, text/classification, text/rerank, text/summarization, video/generation).

Three ways to add a model, in order of how often you'll reach for them:

  1. Pull a GGUF or sentence-transformers model from HuggingFace by URI. No script, no edits:
    muse search qwen3 --modality chat/completion --max-size-gb 10
    muse pull hf://Qwen/Qwen3-8B-GGUF@q4_k_m
    
  2. Drop a .py script into ~/.muse/models/ for a one-off model with custom code (see docs/MODEL_SCRIPTS.md).
  3. Add a whole new modality (rare) by dropping a subpackage into src/muse/modalities/ or $MUSE_MODALITIES_DIR. The subpackage exports MODALITY + build_router and discovery picks it up. Optional: drop a hf.py next to __init__.py exporting an HF_PLUGIN dict; muse's HF resolver picks it up the same way and muse search/muse pull hf://... work for the new modality.

All three surfaces are discovered at runtime; there is no hardcoded catalog, no allowlist, and no registration calls.

The CLI is deliberately admin-only (serve, pull, search, models). Generation is reached via the HTTP API, consumed by Python clients, curl, or future wrappers like muse mcp.

Install

pip install -e ".[server,audio,images]"

Optional extras:

  • audio: PyTorch + transformers for TTS backends
  • audio-kokoro: Kokoro TTS (needs system espeak-ng)
  • images: diffusers + Pillow for SD-Turbo and future image backends
  • server: FastAPI + uvicorn + sse-starlette (only needed on the serving host)
  • dev: pytest + coverage tools

Quick start

# Pull bundled models by id (creates a dedicated venv + installs deps + downloads weights)
muse pull soprano-80m
muse pull sd-turbo

# Or pull anything resolvable from HuggingFace by URI
muse pull hf://Qwen/Qwen3-8B-GGUF@q4_k_m
muse pull hf://sentence-transformers/all-MiniLM-L6-v2

# Admin: list what's in the catalog
muse models list

# Start the server (instant boot; serves OpenAI-compatible endpoints).
# As of v0.40.0 muse is lazy-load: enabled models stay on disk until
# the first request that names them, then spawn a worker on demand.
muse serve --host 0.0.0.0 --port 8000

# Optional: pre-warm a model so the first real request is hot
muse models warmup soprano-80m

From any client, generation is an HTTP call:

# Text-to-speech
curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"Hello world","model":"soprano-80m"}' \
  --output hello.wav

# Embeddings (accepts single string or list)
curl -X POST http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input":"hello world","model":"all-minilm-l6-v2"}'

# Image embeddings (input is data: URL or http(s):// URL; mirrors /v1/embeddings)
IMG_B64=$(base64 -w0 cat.png)
curl -X POST http://localhost:8000/v1/images/embeddings \
  -H "Content-Type: application/json" \
  -d "{\"input\":\"data:image/png;base64,${IMG_B64}\",\"model\":\"dinov2-small\"}"

# Audio embeddings (multipart upload; one or more `file` parts; mirrors /v1/embeddings envelope)
curl -X POST http://localhost:8000/v1/audio/embeddings \
  -F "file=@clip.wav" \
  -F "model=mert-v1-95m"

# Chat (OpenAI-compatible incl. tools and streaming)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-8b-gguf-q4-k-m","messages":[{"role":"user","content":"Capital of France?"}]}'

# Rerank (Cohere-compat); pulls bge-reranker-v2-m3 by default
curl -X POST http://localhost:8000/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "what is muse?",
    "documents": [
      "muse is an audio server",
      "muse is a multi-modality generation server",
      "muse is the goddess of inspiration"
    ],
    "model": "bge-reranker-v2-m3",
    "top_n": 2,
    "return_documents": true
  }'

# Summarize (Cohere-compat); pulls bart-large-cnn by default
curl -X POST http://localhost:8000/v1/summarize \
  -H "Content-Type: application/json" \
  -d '{
    "text": "muse is a model-agnostic multi-modality generation server. It hosts text, image, audio, and video models behind a unified HTTP API that mirrors OpenAI where possible.",
    "length": "short",
    "format": "paragraph",
    "model": "bart-large-cnn"
  }'

# Music generation (capability-gated; default model: stable-audio-open-1.0)
curl -X POST http://localhost:8000/v1/audio/music \
  -H "Content-Type: application/json" \
  -d '{"prompt":"ambient piano with light rain","model":"stable-audio-open-1.0","duration":10.0}' \
  --output music.wav

# Sound effects generation (same model, different intent)
curl -X POST http://localhost:8000/v1/audio/sfx \
  -H "Content-Type: application/json" \
  -d '{"prompt":"footsteps on gravel","model":"stable-audio-open-1.0","duration":3.0}' \
  --output footsteps.wav

# Image inpainting (multipart: image + mask + prompt)
# White mask pixels are regenerated; black pixels are kept.
curl -X POST http://localhost:8000/v1/images/edits \
  -F "image=@scene.png" \
  -F "mask=@mask.png" \
  -F "prompt=add a moon to the sky" \
  -F "model=sd-turbo" \
  -F "size=512x512" \
  -F "n=1"

# Image variations (multipart: image only, no prompt)
curl -X POST http://localhost:8000/v1/images/variations \
  -F "image=@scene.png" \
  -F "model=sd-turbo" \
  -F "size=512x512" \
  -F "n=2"

# Image upscale (multipart: 4x super-resolution; SD x4 supports scale=4 only)
curl -s -X POST http://localhost:8000/v1/images/upscale \
  -F "image=@source.png" \
  -F "model=stable-diffusion-x4-upscaler" \
  -F "scale=4" \
  -F "prompt=high detail" \
  | jq -r '.data[0].b64_json' \
  | base64 -d > upscaled.png

# Image segmentation (multipart: SAM-2 promptable masks)
# Mode 1: automatic (sweep grid of point prompts internally)
curl -s -X POST http://localhost:8000/v1/images/segment \
  -F "image=@scene.png" \
  -F "model=sam2-hiera-tiny" \
  -F "mode=auto" \
  -F "max_masks=8"

# Mode 2: foreground click points
curl -s -X POST http://localhost:8000/v1/images/segment \
  -F "image=@scene.png" \
  -F "model=sam2-hiera-tiny" \
  -F "mode=points" \
  -F 'points=[[150, 200]]'

# Mode 3: bounding boxes
curl -s -X POST http://localhost:8000/v1/images/segment \
  -F "image=@scene.png" \
  -F "model=sam2-hiera-tiny" \
  -F "mode=boxes" \
  -F 'boxes=[[50, 60, 250, 240]]' \
  -F "mask_format=rle"

# Video generation (since v0.27.0; GPU-required, 8GB+ VRAM tight)
# Default response_format=mp4; "webm" and "frames_b64" also supported.
curl -s -X POST http://localhost:8000/v1/video/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a flag waving in the wind",
    "model": "wan2-1-t2v-1-3b",
    "duration_seconds": 5.0,
    "fps": 5,
    "size": "832x480",
    "steps": 30
  }' \
  | jq -r '.data[0].b64_json' \
  | base64 -d > flag.mp4
from muse.modalities.audio_speech import SpeechClient
from muse.modalities.image_generation import (
    GenerationsClient, ImageEditsClient, ImageVariationsClient,
)
from muse.modalities.embedding_text import EmbeddingsClient
from muse.modalities.chat_completion import ChatClient

# MUSE_SERVER env var sets the base URL for remote use; default http://localhost:8000
wav_bytes = SpeechClient().infer("Hello world")
pngs = GenerationsClient().generate("a cat on mars, cinematic", n=1)
vectors = EmbeddingsClient().embed(["alpha", "beta"])   # list[list[float]]
chat = ChatClient().chat(
    model="qwen3-8b-gguf-q4-k-m",
    messages=[{"role": "user", "content": "Capital of France?"}],
)

# Image inpainting and variations (since v0.21.0)
src = open("scene.png", "rb").read()
msk = open("mask.png", "rb").read()
edited = ImageEditsClient().edit(
    "add a moon to the sky", image=src, mask=msk, model="sd-turbo",
)
variants = ImageVariationsClient().vary(image=src, model="sd-turbo", n=2)

# Image upscale (since v0.25.0): 4x super-resolution
from muse.modalities.image_upscale import ImageUpscaleClient
from pathlib import Path
upscaled = ImageUpscaleClient().upscale(
    image=Path("source.png").read_bytes(),
    model="stable-diffusion-x4-upscaler",
    scale=4,
    prompt="razor sharp detail",
)
Path("upscaled.png").write_bytes(upscaled[0])

# Image segmentation (since v0.26.0): SAM-2 promptable masks
from muse.modalities.image_segmentation import ImageSegmentationClient
seg = ImageSegmentationClient()
src_bytes = Path("scene.png").read_bytes()
result_auto = seg.segment(
    image=src_bytes, model="sam2-hiera-tiny", mode="auto", max_masks=8,
)
result_points = seg.segment(
    image=src_bytes, model="sam2-hiera-tiny", mode="points",
    points=[[150, 200]],
)
result_boxes = seg.segment(
    image=src_bytes, model="sam2-hiera-tiny", mode="boxes",
    boxes=[[50, 60, 250, 240]], mask_format="rle",
)
# Each result is a dict {id, model, mode, image_size, masks: [...]}
# masks[i]["mask"] is a base64 PNG (mask_format=png_b64) or
# a {"size": [H, W], "counts": str} dict (mask_format=rle)

# Video generation (since v0.27.0): GPU-required, 8GB+ VRAM tight
# Wan2.1 T2V 1.3B (~3GB at fp16) is the default low-VRAM bundle;
# CogVideoX-2b (~9GB) and LTX-Video (~16GB) are curated additions.
from muse.modalities.video_generation import VideoGenerationClient
vid = VideoGenerationClient()
mp4_bytes = vid.generate(
    "a flag waving in the wind",
    model="wan2-1-t2v-1-3b",
    duration_seconds=5.0,
    fps=5,
    size="832x480",
    steps=30,
)
Path("flag.mp4").write_bytes(mp4_bytes)

VRAM caveats for video/generation: even Wan 1.3B at fp16 is tight on 8GB cards; 12GB+ recommended for headroom. CogVideoX-2b realistically wants 16GB. LTX-Video needs 16GB+. Mochi-1 (24GB+) and HunyuanVideo (60GB+) are documented but not curated; their dedicated runtimes ship in v1.next.

The OpenAI Python SDK works against muse with no modifications:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-used")
client.chat.completions.create(model="qwen3-8b-gguf-q4-k-m", messages=[...])

Vision (v0.42.0+):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-used")

with open("photo.png", "rb") as f:
    import base64
    data_url = "data:image/png;base64," + base64.b64encode(f.read()).decode()

r = client.chat.completions.create(
    model="smolvlm-256m-instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": data_url}},
        ],
    }],
)
print(r.choices[0].message.content)

muse serve auto-restarts crashed worker processes with exponential backoff. Individual model failures don't take down the server or other modalities.

As of v0.40.0 muse is lazy-load by default. muse serve brings the gateway up instantly with zero workers running. The first request to each model triggers a cold load (worker spawn + weights), so expect 5-30s of latency on that first hit; subsequent requests are hot. Memory pressure is handled by on-demand LRU eviction backed by live pynvml + psutil measurements: a 12GB GPU can have 30 models catalog-enabled and serve them all, just not simultaneously. Operators who want eager-boot semantics put a warmup loop in their startup script:

muse serve &
sleep 1
for m in $(muse models list --json | jq -r '.[].id'); do
    muse models warmup "$m"
done

muse models list shows a five-state status indicator: enabled_loaded (filled circle) for resident workers, enabled_unloaded (half circle) for catalog-enabled-but-unloaded, plus the existing disabled, recommended, and available states. /v1/models gains loaded, last_loaded_at, and unservable_reason per entry. Headroom margins are tunable via MUSE_GPU_HEADROOM_GB (default 1.0) and MUSE_CPU_HEADROOM_GB (default 2.0); declared caps via MUSE_GPU_BUDGET_GB and MUSE_CPU_BUDGET_GB are optional and combined with live measurements as min(declared, live).

CLI (admin-only)

Command Description
muse serve start the HTTP server (instant boot; lazy-load on first request)
muse pull <model-id-or-uri> download weights + install deps + run probe (accepts bundled id OR resolver URI like hf://org/repo@variant; --no-probe opts out)
muse search <query> [--modality M] search HuggingFace for pullable GGUF / sentence-transformers models
muse models list [--modality X] list known/pulled models with five-state status (enabled_loaded / enabled_unloaded / disabled / recommended / available)
muse models info <model-id> show catalog entry
muse models remove <model-id> unregister from catalog
muse models enable <model-id> mark a pulled model active in the catalog (allowed to lazy-load)
muse models disable <model-id> mark a pulled model inactive in the catalog (refuses to lazy-load)
muse models warmup <model-id> pre-load a model into a worker without serving traffic; first real request is hot
muse models refresh <id> | --all | --enabled re-install muse[server,extras] into per-model venv(s) (after pip install -U muse)
muse mcp [--http] run an MCP server bridging muse to LLM clients (29 tools)

No per-modality subcommands (muse speak, muse audio ...). Those would be hardcoded modality-to-verb mappings that grow with every new modality. Keeping the CLI modality-agnostic means embeddings, transcriptions, and video land without CLI churn.

HTTP endpoints

Endpoint Purpose
GET /health liveness + enabled modalities
GET /v1/models all registered models, aggregated
POST /v1/audio/speech synthesize speech (OpenAI-compatible)
GET /v1/audio/speech/voices list voices for a model
POST /v1/audio/transcriptions transcribe audio to text (OpenAI-compatible)
POST /v1/audio/translations transcribe + translate audio to English (OpenAI-compatible)
POST /v1/images/generations generate images (OpenAI-compatible; supports img2img via image + strength)
POST /v1/images/edits inpaint masked regions (OpenAI-compatible; multipart with image+mask+prompt)
POST /v1/images/variations generate alternates of one image (OpenAI-compatible; multipart, no prompt)
POST /v1/embeddings text embeddings (OpenAI-compatible)
POST /v1/images/embeddings image embeddings (OpenAI-shape envelope mirroring /v1/embeddings)
POST /v1/audio/embeddings audio embeddings (multipart upload + OpenAI-shape envelope mirroring /v1/embeddings)
POST /v1/chat/completions chat (OpenAI-compatible incl. tools, structured output, streaming)
POST /v1/moderations text moderation/classification (OpenAI-compatible)
POST /v1/rerank text rerank (Cohere-compat)
POST /v1/summarize text summarization (Cohere-compat)
POST /v1/audio/music music generation (capability-gated; muse-native shape)
POST /v1/audio/sfx sound-effect generation (capability-gated; muse-native shape)
POST /v1/video/generations text-to-video generation (mp4/webm/frames_b64; GPU-required)

Error shape is uniform: {"error": {"code", "message", "type"}} across 404 (model not found) and 422 (validation). Matches OpenAI's envelope so clients written against their API work against muse.

Admin endpoints (v0.28.0+)

Eleven endpoints under /v1/admin/* let you enable, disable, probe, pull, and remove models on a running supervisor without restarting it. The admin surface is closed-by-default: set MUSE_ADMIN_TOKEN to any non-empty value to enable it, then send Authorization: Bearer <token> on every request.

Endpoint Purpose
POST /v1/admin/models/{id}/enable spawn a worker (or restart-in-place) hosting id; returns 202 + job_id
POST /v1/admin/models/{id}/disable unload id from its worker; sync
POST /v1/admin/models/{id}/probe run muse models probe in the model's venv; returns 202 + job_id
POST /v1/admin/models/_/pull pull from a curated alias or resolver URI in the body; returns 202 + job_id
DELETE /v1/admin/models/{id}?purge=bool remove from catalog (refuses 409 if loaded)
GET /v1/admin/models/{id}/status merged catalog + live worker view
GET /v1/admin/workers spawned workers + pid/uptime/restart-count
POST /v1/admin/workers/{port}/restart SIGTERM by port; auto-restart monitor handles bringup
GET /v1/admin/memory per-device aggregate + per-model breakdown
GET /v1/admin/jobs/{job_id} one async-job record (404 if reaped)
GET /v1/admin/jobs recent jobs newest-first

Auth setup:

export MUSE_ADMIN_TOKEN="$(openssl rand -hex 32)"  # or any non-empty value
muse serve  # admin endpoints now active under the same port

Five auth scenarios:

  • env var unset, any header: 503 admin_disabled
  • env var set, no header: 401 missing_token
  • env var set, malformed header: 401 missing_token
  • env var set, wrong bearer: 403 invalid_token
  • env var set, correct bearer: route runs

curl examples:

TOKEN="$MUSE_ADMIN_TOKEN"
H="Authorization: Bearer $TOKEN"

# enable a pulled model (worker spawns or joins existing venv-group)
curl -s -X POST -H "$H" http://localhost:8000/v1/admin/models/kokoro-82m/enable

# poll the returned job
curl -s -H "$H" http://localhost:8000/v1/admin/jobs/<job_id>

# disable a loaded model (sync)
curl -s -X POST -H "$H" http://localhost:8000/v1/admin/models/kokoro-82m/disable

# merged status
curl -s -H "$H" http://localhost:8000/v1/admin/models/kokoro-82m/status

# memory aggregate (psutil + pynvml)
curl -s -H "$H" http://localhost:8000/v1/admin/memory

Python (use the AdminClient):

from muse.admin.client import AdminClient

# Reads MUSE_SERVER and MUSE_ADMIN_TOKEN from env when unset.
admin = AdminClient()

job = admin.enable("kokoro-82m")
final = admin.wait(job["job_id"])
print(final["state"], final.get("result"))

print(admin.status("kokoro-82m"))
print(admin.workers())
print(admin.memory())

The muse models enable/disable CLI commands route through this admin API automatically when MUSE_ADMIN_TOKEN is set and the supervisor is reachable, falling back to a catalog-only mutation (effective on next muse serve) otherwise.

MCP server (since v0.29.0)

muse mcp runs a Model Context Protocol server that exposes muse to LLM clients (Claude Desktop, Cursor, etc.) as 29 structured tools: 11 admin tools (gated by MUSE_ADMIN_TOKEN) plus 18 inference tools. Stdio mode is the default (for desktop apps); HTTP+SSE mode (--http --port 8088) is available for remote / web embedders.

muse mcp                                  # stdio mode
muse mcp --http --port 8088               # HTTP+SSE
muse mcp --filter inference               # only inference tools (no admin)
muse mcp --filter admin                   # only admin tools (control panel)
muse mcp --server http://other:8000       # connect to a remote muse server

Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS, %APPDATA%\Claude\claude_desktop_config.json on Windows):

{
  "mcpServers": {
    "muse": {
      "command": "muse",
      "args": ["mcp"],
      "env": {
        "MUSE_SERVER": "http://localhost:8000",
        "MUSE_ADMIN_TOKEN": "your-admin-token-here"
      }
    }
  }
}

Tools split into two groups:

Admin (11): muse_list_models, muse_get_model_info, muse_search_models, muse_pull_model, muse_remove_model, muse_enable_model, muse_disable_model, muse_probe_model, muse_get_memory_status, muse_get_workers, muse_get_jobs. Long-running ops (pull, probe, enable) return a job_id and the LLM polls muse_get_jobs to track progress.

Inference (18): muse_chat, muse_summarize, muse_rerank, muse_classify, muse_embed_text, muse_generate_image, muse_edit_image, muse_vary_image, muse_upscale_image, muse_segment_image, muse_generate_animation, muse_embed_image, muse_speak, muse_transcribe, muse_generate_music, muse_generate_sfx, muse_embed_audio, muse_generate_video.

Binary inputs accept <name>_b64 (base64), <name>_url (data: or http URL), or <name>_path (local file). Image and audio outputs return as MCP ImageContent / AudioContent blocks plus a JSON summary.

Architecture

  • muse.core: modality-agnostic discovery, registry, catalog, venv management, HF downloader, pip auto-install, FastAPI app factory.
  • muse.cli_impl: serve (supervisor), worker (single-venv process), gateway (HTTP proxy routing by request's model field).
  • muse.modalities/: one subpackage per modality (wire contract: protocol + routes + codec + client).
    • audio_embedding/ (MODALITY "audio/embedding"; multipart upload + OpenAI-shape envelope; includes runtimes/transformers_audio.py)
    • audio_generation/ (MODALITY "audio/generation"; mounts both /v1/audio/music and /v1/audio/sfx on one MIME tag with per-route capability gates)
    • audio_speech/ (MODALITY "audio/speech")
    • audio_transcription/ (MODALITY "audio/transcription"; multipart/form-data upload, OpenAI Whisper wire shape)
    • chat_completion/ (MODALITY "chat/completion"; includes runtimes/llama_cpp.py)
    • embedding_text/ (MODALITY "embedding/text"; includes runtimes/sentence_transformers.py)
    • image_embedding/ (MODALITY "image/embedding"; includes runtimes/transformers_image.py)
    • image_generation/ (MODALITY "image/generation")
    • text_classification/ (MODALITY "text/classification"; OpenAI /v1/moderations wire shape)
    • text_rerank/ (MODALITY "text/rerank"; Cohere /v1/rerank wire shape)
    • text_summarization/ (MODALITY "text/summarization"; Cohere /v1/summarize wire shape)
    • video_generation/ (MODALITY "video/generation"; includes runtimes/wan_runtime.py and runtimes/cogvideox_runtime.py)
  • muse.models/: flat directory of drop-in model scripts, one file per model (MANIFEST + Model class).
    • soprano_80m.py, kokoro_82m.py, bark_small.py (audio/speech)
    • nv_embed_v2.py (embedding/text; MiniLM and Qwen3-Embedding are now resolver-pulled via the generic runtime, see curated.yaml)
    • sd_turbo.py (image/generation)
    • bge_reranker_v2_m3.py (text/rerank)
    • stable_audio_open_1_0.py (audio/generation; Stable Audio Open 1.0, Apache 2.0)
    • bart_large_cnn.py (text/summarization; facebook/bart-large-cnn, Apache 2.0, ~400MB CPU-friendly)
    • dinov2_small.py (image/embedding; facebook/dinov2-small, Apache 2.0, 88MB, 384-dim CPU-friendly)
    • mert_v1_95m.py (audio/embedding; m-a-p/MERT-v1-95M, MIT, 95MB, 768-dim music understanding via mean-pool over time)
    • wan2_1_t2v_1_3b.py (video/generation; Wan-AI/Wan2.1-T2V-1.3B, Apache 2.0, ~3GB at fp16, 5s clips at 832x480, GPU-required)
  • muse.core.resolvers: URI -> ResolvedModel dispatch for muse pull hf://....
    • resolvers_hf registers the hf:// resolver for HuggingFace GGUF + sentence-transformers repos.

muse serve is a supervisor process. It spawns one worker subprocess per venv (each pulled model has its own venv with its own deps) and runs a gateway that proxies by the model field. Dep conflicts between models are structurally impossible.

Three ways to extend muse:

  1. Resolver URI: muse pull hf://Qwen/Qwen3-8B-GGUF@q4_k_m for any GGUF or sentence-transformers HF repo. See docs/RESOLVERS.md.
  2. Model script: drop a .py into ~/.muse/models/ for one-off models with custom code. See docs/MODEL_SCRIPTS.md.
  3. Modality subpackage: drop into src/muse/modalities/ or $MUSE_MODALITIES_DIR for a whole new modality.

See CLAUDE.md for implementation details and contribution guide, docs/MODEL_SCRIPTS.md for writing your own model scripts, docs/RESOLVERS.md for adding a new URI scheme, and docs/CHAT_COMPLETION.md for the chat endpoint specification.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

museq-0.45.2.tar.gz (397.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

museq-0.45.2-py3-none-any.whl (505.1 kB view details)

Uploaded Python 3

File details

Details for the file museq-0.45.2.tar.gz.

File metadata

  • Download URL: museq-0.45.2.tar.gz
  • Upload date:
  • Size: 397.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for museq-0.45.2.tar.gz
Algorithm Hash digest
SHA256 76944b805eb51906094701ca9c537839892322ab94e07b70784a0133b43cf2aa
MD5 0e89f2daa66b941de7c74b423194377e
BLAKE2b-256 0365573f323b5749c16d0e95451385e802e41b8dc4ff001aaf87a9f2c63800a7

See more details on using hashes here.

File details

Details for the file museq-0.45.2-py3-none-any.whl.

File metadata

  • Download URL: museq-0.45.2-py3-none-any.whl
  • Upload date:
  • Size: 505.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for museq-0.45.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2cb01b1e44d9633d651839bc04de8f4137dd3b5806c2bacae150cf96eccd403c
MD5 c1b24e0b537e832bc41a967dfa0684d8
BLAKE2b-256 d8306dd40e3e2a23acdae780b1d791eb65d2dd7dd4807b9e190bddbf824e283f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page