Unified serving layer for non-text foundation models
Project description
Sheaf
Unified serving layer for non-text foundation models.
vLLM solved inference for text LLMs by defining a standard compute contract and optimizing behind it. The same problem exists for every other class of foundation model — time series, tabular, molecular, geospatial, diffusion, audio — and nobody has solved it. Sheaf is that solution.
Each model type gets a typed request/response contract. Batching, caching, and scheduling are optimized per model type. Ray Serve is the substrate. Feast is a first-class input primitive.
In mathematics, a sheaf tracks locally-defined data that glues consistently across a space. Each model type defines its own local contract; Sheaf ensures they cohere into a unified serving layer.
Install
Try it without installing — a live deployment of
amazon/chronos-bolt-tinyis running on Modal at:https://korbonits--sheaf-demo-modalserver---init----locals---serve.modal.runcurl https://korbonits--sheaf-demo-modalserver---init----locals---serve.modal.run/chronos/health
Requires Python 3.11+. macOS's system
python3is usually 3.10 — bootstrap a 3.11 venv first viauv(uv venv --python 3.11 .venv && source .venv/bin/activate) orpyenv. The[molecular]and[protein]extras (ESM-3 / ESMC / ESMFold2) additionally require Python 3.12+, and the two are mutually exclusive (they share theesmimport name from different packages).
pip install sheaf-serve # core only
pip install "sheaf-serve[time-series]" # + Chronos2 / TimesFM / Moirai
pip install "sheaf-serve[tabular]" # + TabPFN
pip install "sheaf-serve[molecular]" # + ESM-3 (Python 3.12+)
pip install "sheaf-serve[protein]" # + ESMC / ESMFold2 deps (Python 3.12+)
# then also (no PyPI release yet — pinned commit per upstream README):
pip install "esm@git+https://github.com/Biohub/esm.git@81b3646c9429ea8458918415ad6a46178cb59833"
pip install "sheaf-serve[genomics]" # + Nucleotide Transformer
pip install "sheaf-serve[small-molecule]" # + MolFormer
pip install "sheaf-serve[materials]" # + MACE-MP
pip install "sheaf-serve[audio]" # + Whisper / faster-whisper
pip install "sheaf-serve[audio-generation]" # + MusicGen
pip install "sheaf-serve[tts]" # + Bark
pip install "sheaf-serve[vision]" # + DINOv2 / OpenCLIP / SAM2 / Depth Anything / DETR
pip install "sheaf-serve[earth-observation]" # + Prithvi
pip install "sheaf-serve[weather]" # + GraphCast
pip install "sheaf-serve[feast]" # + Feast feature store integration
pip install "sheaf-serve[modal]" # + Modal serverless deployment
pip install "sheaf-serve[batch]" # + offline batch inference (Ray Data)
pip install "sheaf-serve[all]" # everything
Quickstart
Direct backend inference:
from sheaf.api.time_series import Frequency, OutputMode, TimeSeriesRequest
from sheaf.backends.chronos import Chronos2Backend
backend = Chronos2Backend(model_id="amazon/chronos-bolt-tiny", device_map="cpu")
backend.load()
req = TimeSeriesRequest(
model_name="chronos-bolt-tiny",
history=[312, 298, 275, 260, 255, 263, 285, 320,
368, 402, 421, 435, 442, 438, 430, 425],
horizon=12,
frequency=Frequency.HOURLY,
output_mode=OutputMode.QUANTILES,
quantile_levels=[0.1, 0.5, 0.9],
)
response = backend.predict(req)
# response.mean, response.quantiles
Ray Serve (production, autoscaling):
from sheaf import ModelServer
from sheaf.spec import ModelSpec, ResourceConfig
from sheaf.api.base import ModelType
server = ModelServer(models=[
ModelSpec(
name="chronos",
model_type=ModelType.TIME_SERIES,
backend="chronos2",
backend_kwargs={"model_id": "amazon/chronos-bolt-small"},
resources=ResourceConfig(num_gpus=1),
),
])
server.run() # POST /chronos/predict, GET /chronos/health
Feast feature store (resolve features at request time):
# ModelSpec wires Feast — no history needed in the request
spec = ModelSpec(
name="chronos",
model_type=ModelType.TIME_SERIES,
backend="chronos2",
feast_repo_path="/feast/feature_repo",
)
# Client sends feature_ref instead of raw history
{
"model_type": "time_series",
"model_name": "chronos",
"feature_ref": {
"feature_view": "asset_prices",
"feature_name": "close_history_30d",
"entity_key": "ticker",
"entity_value": "AAPL"
},
"horizon": 7,
"frequency": "1d"
}
Modal (serverless, zero-infra):
from sheaf import ModalServer
server = ModalServer(models=[spec], app_name="my-sheaf", gpu="A10G")
app = server.app # modal deploy my_server.py
Docker:
FROM ghcr.io/korbonits/sheaf-serve:v0.10.0
RUN pip install --no-cache-dir 'sheaf-serve[time-series]==0.10.0'
COPY server.py .
CMD ["python", "server.py"]
The base image is sheaf-serve core only; extend with the backend extras you need. See examples/docker/ for a worked example with a runnable server.py.
Kubernetes (KubeRay):
examples/k8s/ ships a RayService manifest that deploys the same ModelSpec shape via the KubeRay operator. sheaf.build_app(spec) returns the Ray Serve Application directly, so it slots into KubeRay's serveConfigV2.applications[].import_path:
# app.py — referenced by the manifest as `import_path: app:app`
from sheaf import build_app
from sheaf.spec import ModelSpec
spec = ModelSpec(name="chronos", ...)
app = build_app(spec)
Typed Python client:
from sheaf.client import SheafClient
from sheaf.api.time_series import Frequency, TimeSeriesRequest
with SheafClient(base_url="http://localhost:8000") as client:
resp = client.predict(
"chronos",
TimeSeriesRequest(
model_name="chronos",
history=[1.0, 2.0, 3.0, 4.0, 5.0],
horizon=3,
frequency=Frequency.HOURLY,
),
)
# resp is a typed TimeSeriesResponse — same Pydantic class the server returned
print(resp.mean)
AsyncSheafClient is the async-mirror; client.stream(deployment, request) yields SSE events for streaming backends like FLUX.
See examples/ for time series comparison, tabular, audio, vision, and the Feast feature store quickstart.
Protein models
Sheaf serves three protein foundation models, each via its own typed contract:
- ESM-3 (
api/molecular.py, backendesm3) — per-sequence pooled embeddings (mean / cls). Use for sequence-level similarity, clustering, and downstream featurization.[molecular]extra (Python 3.12+). - ESMC (
api/protein_language.py, backendesmc) — per-token logits + optional per-token embeddings from Biohub's 2026-05-27 release. Use when you need masked-LM logits, per-residue representations, or all-layer hidden states. Default model:Biohub/ESMC-6B.[protein]extra (Python 3.12+); 300M / 600M variants are Forge API-only and currently raiseNotImplementedError. - ESMFold2 (
api/structure.py, backendesmfold2) — protein structure prediction with inference-time scaling. Exposesnum_loops,num_sampling_steps,num_samples,seedas first-class request fields; returns PDB / mmCIF + pLDDT + pTM/ipTM + optional PAE. Default model:biohub/ESMFold2.[protein]extra (Python 3.12+).
[molecular] (ESM-3) and [protein] (ESMC + ESMFold2) share the esm import name from different upstream packages — install one or the other in a given environment. See docs/adr/0001-esmc-esmfold2-integration.md for the rationale.
Biohub release announcement: https://github.com/Biohub/esm · preprint: https://biohub.ai/papers/esm_protein.pdf.
Supported model types
| Type | Status | Backends |
|---|---|---|
| Time series | ✅ v0.1 | Chronos2, Chronos-Bolt, TimesFM, Moirai |
| Tabular | ✅ v0.1 | TabPFN v2 |
| Audio transcription | ✅ v0.3 | Whisper, faster-whisper |
| Audio generation | ✅ v0.3 | MusicGen |
| Text-to-speech | ✅ v0.3 | Bark |
| Vision embeddings | ✅ v0.3 | OpenCLIP, DINOv2 |
| Segmentation | ✅ v0.3 | SAM2 |
| Depth estimation | ✅ v0.3 | Depth Anything v2 |
| Object detection | ✅ v0.3 | DETR / RT-DETR |
| Protein / molecular | ✅ v0.3 | ESM-3 (Python 3.12+) |
| Protein language modeling | ✅ v0.11 | ESMC 6B (Biohub) |
| Protein structure prediction | ✅ v0.11 | ESMFold2 (Biohub) — inference-time scaling |
| Genomics | ✅ v0.3 | Nucleotide Transformer |
| Small molecule | ✅ v0.3 | MolFormer-XL |
| Materials science | ✅ v0.3 | MACE-MP-0 |
| Earth observation | ✅ v0.3 | Prithvi (IBM/NASA) |
| Weather forecasting | ✅ v0.3 | GraphCast |
| Cross-modal embeddings | ✅ v0.3 | ImageBind (text, vision, audio, depth, thermal) |
| Feast feature store | ✅ v0.3 | Any Feast online store (SQLite, Redis, DynamoDB, …) |
| Modal serverless | ✅ v0.3 | ModalServer — zero-infra GPU deployment |
| Diffusion / image gen | ✅ v0.4 | FLUX (schnell, dev) |
| Video understanding | ✅ v0.4 | VideoMAE, TimeSformer |
| LiDAR / 3D point cloud | ✅ v0.5 | PointNet (pure PyTorch; embed + ModelNet40 classify) |
| Pose estimation | ✅ v0.5 | ViTPose (COCO 17-keypoint, optional person bboxes) |
| Optical flow | ✅ v0.5 | RAFT (raft_large / raft_small via torchvision) |
| Multimodal generation | ✅ v0.5 | SDXL img2img + inpainting |
| Speech synthesis | ✅ v0.5 | Kokoro (voice + speed per request) |
| Offline batch inference | ✅ v0.6 | BatchRunner (Ray Data; tasks + actor-pool modes) |
| Async-job worker | ✅ v0.7 | SheafWorker (Redis Streams; pluggable queue/result ABCs) |
| LoRA adapter multiplexing | ✅ v0.8 | FLUX, SDXL via ModelSpec.lora (local paths + HF Hub sources) |
Roadmap to production
v0.2 — serving layer (complete)
- Ray Serve integration tested end-to-end
- Async
predict()handlers - HTTP API with proper request validation (422 on bad input)
- Health check and readiness probe endpoints
- Batching scheduler (BatchPolicy wired into
@serve.batchper deployment) - Error handling at the service boundary (backend exceptions → structured HTTP 500)
- Model hot-swap without restart (
ModelServer.update()) - Container-friendly auth for TabPFN v2 (
TABPFN_TOKENenv var)
v0.3 — model types + integrations (complete)
- ESM-3 protein embeddings
- Nucleotide Transformer genomics embeddings
- MolFormer-XL small molecule embeddings
- MACE-MP-0 materials (energy, forces, stress)
- Whisper / faster-whisper audio transcription
- MusicGen audio generation
- Bark text-to-speech
- OpenCLIP image/text embeddings
- DINOv2 image embeddings
- SAM2 segmentation
- Depth Anything v2 depth estimation
- DETR / RT-DETR object detection
- Prithvi earth observation embeddings
- GraphCast weather forecasting
- ImageBind cross-modal embeddings (text, vision, audio, depth, thermal)
- Feast feature store integration (
feature_refin requests,FeastResolver,feast_repo_pathonModelSpec) - Modal serverless deployment (
ModalServer— zero-infra alternative to Ray Serve)
v0.4 — generation + video (complete)
- FLUX diffusion / image generation
- VideoMAE / TimeSformer video understanding
v0.5 — observability + new modalities
Ops / DX:
- PyPI publish (v0.4.0)
- Prometheus metrics endpoint per deployment
- Structured logging with request IDs end-to-end
- OpenTelemetry traces through the request path
Serving / infra:
- Streaming responses (
POST /{name}/stream→ SSE; FLUX emits per-step progress events) - Request caching (
CacheConfigonModelSpec— in-process LRU, optional TTL) -
bucket_bybatching — group requests by field value before@serve.batch
New model types:
- LiDAR / 3D point cloud (PointNet — pure-PyTorch, no torch-geometric; embed + ModelNet40 classify; install with
pip install 'sheaf-serve[lidar]') - Pose estimation (ViTPose — COCO 17-keypoint skeleton, optional person bboxes; install with
pip install 'sheaf-serve[pose]') - Optical flow (RAFT — raft_large/raft_small via torchvision; (H, W, 2) float32 flow field; install with
pip install 'sheaf-serve[optical-flow]') - Multimodal generation — text+image-conditioned (SDXL img2img + inpainting; install with
pip install 'sheaf-serve[multimodal-generation]') - Speech synthesis with fine-grained control (Kokoro — voice + speed per request; install with
pip install 'sheaf-serve[kokoro]')
v0.6 — offline batch inference (complete)
-
BatchRunner— same backend, same typed contract, offline batch mode; Ray Datamap_batchessubstrate, stateless tasks with a worker-local backend cache soload()fires once per worker (not once per batch); install withpip install 'sheaf-serve[batch]' -
BatchSpec— mirrorsModelSpecfor backend selection;JsonlSource/JsonlSinkin v1; new sources/sinks (S3, Parquet, Delta) slot in as additionalBatchSource/BatchSinksubclasses without changing the runner API - Actor-pool execution mode for warm loads on expensive backends (FLUX, GraphCast, SDXL) — opt-in via
BatchSpec.compute="actors"+num_actors=N;load()runs once per actor at__init__and persists for the actor's lifetime (#13) - Resumable checkpointing across process restarts (#12)
v0.7 — async-job queue (complete)
-
SheafWorker— queue-consumer pattern for long-running inference; v1 ships Redis Streams + consumer groups (horizontal scaling), pluggableJobQueue/ResultStoreABCs for SQS / Kafka follow-ups; install withpip install 'sheaf-serve[worker]' - Job lifecycle: enqueue → processing → result / dead-letter; at-least-once delivery via XACK-after-persist; per-job webhook on completion (best-effort POST)
- Priority lanes + per-tenant fair queuing
v0.8 — LoRA adapter multiplexing (complete)
-
ModelSpec.lora = LoRAConfig(adapters={...}, default="...")— declare per-deployment adapter registry; one GPU deployment serves many fine-tunes - Per-request adapter selection via
DiffusionRequest.adapters/MultimodalGenerationRequest.adapters(with optionaladapter_weightsfor fusion) - First targets: FLUX (FLUX.1-schnell + FLUX.1-dev), SDXL (img2img + inpaint)
- Local paths and HF Hub sources both supported (
hf:org/repo[:weight_file]convention) - Bucket-by-resolved-adapter inside Ray Serve batch windows:
set_active_adaptersis called exactly once per homogeneous sub-batch - Hot-add adapters at runtime without
ModelServer.update(spec)(deferred — adds VRAM-eviction / index-sync surface area) - Expose
enable_sequential_cpu_offloadonFluxBackendso FLUX + LoRA fits on 16-24 GB GPUs (currently onlyenable_model_cpu_offload, which leaves ~22 GB resident — Modal LoRA quickstart needs A100 today, this would unlock A10G)
v0.9 — typed Python client (complete)
Ships as sheaf.client inside sheaf-serve (not a separate sheaf-client PyPI package — schemas stay in one tree, no codegen, no drift). Splittable into its own package later if external client contributors arrive or install footprint becomes a real cost.
-
SheafClient(sync) +AsyncSheafClient(async,httpx-backed); typedpredict(deployment, request) -> responseagainst the discriminatedAnyResponseunion -
health()/ready()helpers; structured exceptions (ValidationErrorfor 422,ServerErrorfor 5xx,ClientErrorfor transport / decode failures) - SSE streaming via
client.stream(deployment, request)async generator -
RetryConfigwith exponential backoff: configurable status codes, connection-error retry toggle, andmax_attemptscap. Streams bypass retry by design (re-running yields interleaved progress events). - Server-side
request_id(the UUID minted on the request) is attached to every raisedSheafErrorsubclass so callers can log-correlate without holding the original request object. - OpenAPI export via
python -m sheaf.openapi --specs my_module:specs > openapi.json(orsheaf.openapi.generate(specs)programmatically) — backends are not loaded during generation, so it runs without GPU.
v0.10 — container + Kubernetes deployment
Today sheaf ships three deployment paths: ModelServer (a local Ray cluster you bring), ModalServer (Modal serverless), and BatchRunner / SheafWorker (offline / async). Production K8s clusters running their own Ray are common and have no first-class story yet — every team rolls their own image.
- Reference
Dockerfile(multi-stage, uv-based; CPU base + CUDA variant) so teams aren't building this from scratch. Pinned to a sheaf release; rebuilt on tag. -
examples/k8s/with aRayServicemanifest — KubeRay's canonical Ray-on-K8s shape — and a shortREADME.mdcovering prereqs (KubeRay operator installed),kubectl apply, and a port-forward smoke test. - GitHub Actions workflow that builds + pushes the Dockerfile to
ghcr.io/korbonits/sheaf-serve:vX.Y.Zonv*tag push, mirroring the PyPI publish flow.
v0.11 — Biohub protein-biology release integration
Biohub's "world model of protein biology" landed 2026-05-27 under MIT. Sheaf integrates the two model artifacts as first-class typed contracts; ESM Atlas (dataset) is out of scope. See docs/adr/0001-esmc-esmfold2-integration.md.
-
ESMCBackend— per-token logits + per-token embeddings viatransformers.AutoModelForMaskedLM, defaultBiohub/ESMC-6B. -
ESMFold2Backend— protein structure prediction withnum_loops/num_sampling_steps/num_samples/seedas first-class request fields, returning PDB / mmCIF + pLDDT + pTM/ipTM + optional PAE. - New
STRUCTUREmodel category — first non-tensor output category (structure file as text). -
[protein]install extra;esmfromgit+https://github.com/Biohub/esm.git@81b3646c9429ea8458918415ad6a46178cb59833documented (no PyPI release yet). - End-to-end GPU smoke —
examples/quickstart_protein_modal.pyrunsESMFold2Backendon H100 via Modal (~70s cold start to a persistent volume, sub-second per fold). 53-residue target → 43,088-char mmCIF, pTM=0.2465. - Forge / Biohub-Platform HTTP-client variants for the ESMC 300M / 600M / ESMFold2-fast API-only models.
Architecture
┌─────────────────────────────────────────┐
│ API Layer │ typed contracts per model type
│ TimeSeriesRequest TabularRequest ... │
├─────────────────────────────────────────┤
│ Scheduling Layer │ model-type-aware batching
│ BatchPolicy RequestQueue │
├─────────────────────────────────────────┤
│ Backend Layer │ pluggable execution + Ray Serve
│ ModelBackend CacheManager Feast │
└─────────────────────────────────────────┘
Adding a new backend takes one class:
from sheaf.backends.base import ModelBackend
from sheaf.registry import register_backend
@register_backend("my-model")
class MyModelBackend(ModelBackend):
def load(self) -> None:
self._model = load_my_model()
def predict(self, request):
...
@property
def model_type(self):
return "time_series"
Contributing
Issues and PRs welcome. See CONTRIBUTING.md for development setup.
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sheaf_serve-0.11.0.tar.gz.
File metadata
- Download URL: sheaf_serve-0.11.0.tar.gz
- Upload date:
- Size: 944.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fffaf0a82082862dbbdc70d5db5ee6ae027fec3ec49396dab47e11e24898d969
|
|
| MD5 |
d9f74c5aaf022bd58a05b4401f860162
|
|
| BLAKE2b-256 |
b980ac904913b07faa7cf1e2e7b3e65fe5832706ea251f9560e4a5c55744ad32
|
Provenance
The following attestation bundles were made for sheaf_serve-0.11.0.tar.gz:
Publisher:
publish.yml on korbonits/sheaf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sheaf_serve-0.11.0.tar.gz -
Subject digest:
fffaf0a82082862dbbdc70d5db5ee6ae027fec3ec49396dab47e11e24898d969 - Sigstore transparency entry: 1651414362
- Sigstore integration time:
-
Permalink:
korbonits/sheaf@9e9206df2c4c7d4103ce3a11af4e23922ee31c51 -
Branch / Tag:
refs/tags/v0.11.0 - Owner: https://github.com/korbonits
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9e9206df2c4c7d4103ce3a11af4e23922ee31c51 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sheaf_serve-0.11.0-py3-none-any.whl.
File metadata
- Download URL: sheaf_serve-0.11.0-py3-none-any.whl
- Upload date:
- Size: 174.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e8e324a3cb883496f528023c12f455b2d8e9111dbd69b77f8d279be19943e17
|
|
| MD5 |
a8f2b8b842a94240e4189445d845753f
|
|
| BLAKE2b-256 |
e4311913b2f3e771423bac3eb0892204645a22075508751cac55afcf2e09d814
|
Provenance
The following attestation bundles were made for sheaf_serve-0.11.0-py3-none-any.whl:
Publisher:
publish.yml on korbonits/sheaf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sheaf_serve-0.11.0-py3-none-any.whl -
Subject digest:
4e8e324a3cb883496f528023c12f455b2d8e9111dbd69b77f8d279be19943e17 - Sigstore transparency entry: 1651414458
- Sigstore integration time:
-
Permalink:
korbonits/sheaf@9e9206df2c4c7d4103ce3a11af4e23922ee31c51 -
Branch / Tag:
refs/tags/v0.11.0 - Owner: https://github.com/korbonits
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9e9206df2c4c7d4103ce3a11af4e23922ee31c51 -
Trigger Event:
push
-
Statement type: