Local retrieval companion for inference servers — scoped, session-aware, file-watching.
Project description
longctx-svc
Local retrieval companion for inference servers. Scoped, session-aware, file-watching. Tool is optional — if you don't run it, your engine behaves exactly as before.
WIP. Apache-2.0.
Engine-agnostic by design
longctx-svc speaks plain HTTP/JSON. It works with any engine that accepts a prompt — no engine forks required for the basic path.
| Engine | Mode | Wiring |
|---|---|---|
vllm-swift |
embedded | optional --retrieval-endpoint URL flag (engine-side) |
TheTom/llama-cpp-turboquant (llama-server) |
proxy | point client at longctx-svc; longctx-svc forwards to llama-server |
TheTom/vllm (feature/turboquant-amd-noautotune) |
proxy or embedded | OpenAI-compat passthrough; or call LongctxClient from a custom hook |
| vLLM (CUDA) | proxy | OpenAI-compat passthrough |
| anything OpenAI-compat | proxy | OpenAI-compat passthrough |
Mode A — proxy (zero engine changes)
# 1. Run your engine as usual
llama-server -m model.gguf --port 8080 &
# (or vLLM AMD, vLLM CUDA, vllm-swift, ...)
# 2. Run longctx-svc in front of it
longctx-svc serve --upstream http://localhost:8080
# 3. Point your OpenAI client at longctx-svc instead of the engine
export OPENAI_BASE_URL=http://localhost:8765/v1
longctx-svc detects the project from the messages, retrieves top-K
chunks, splices them into the system message, and forwards the request
to the upstream. Response (including SSE stream) is passed straight
back. If no path is mentioned in the messages, the request is forwarded
unmodified.
Mode B — embedded (engine calls /retrieve)
For tighter integration (e.g. so the engine can reuse retrieved chunks
across KV cache boundaries), engines import LongctxClient:
from longctx_svc.client import LongctxClient
cli = LongctxClient.from_env() # honors LONGCTX_ENDPOINT
if cli is not None: # tool is optional
res = cli.retrieve(
prefill_text=full_prompt,
query=user_message,
session_id=session_id,
top_k=8,
)
full_prompt = cli.splice(full_prompt, res)
Network failure → empty result → engine falls back to the no-retrieval path. Optional tool stays optional.
HTTP surface
| Endpoint | Purpose |
|---|---|
POST /retrieve |
engine-side retrieval (Mode B) |
POST /v1/chat/completions |
OpenAI-compat passthrough (Mode A) |
POST /v1/completions |
legacy OpenAI-compat passthrough (Mode A) |
GET /longctx/status |
JSON status; Accept: text/plain for the Sarah-visible block |
GET /healthz |
liveness probe |
Headers
Every retrieve / proxy response sets:
x-longctx-session: <session-id|ephemeral>x-longctx-scope: <project-root|"">x-longctx-chunks-used: <n>x-longctx-scope-status: ready|empty|error|no-scope
Session affinity is sent on the request side via:
x-session-affinity: <id>(preferred)x-session-id: <id>metadata.session_idin the JSON body
No header → ephemeral request, no caching.
Install (alpha)
pip install -e services/longctx-svc
longctx-svc serve # http://127.0.0.1:8765
Tests
cd services/longctx-svc
pytest tests/ --no-cov
85 tests cover: scope detection, walk + .gitignore, chunker, indexer, session manager, the Sarah-journey end-to-end, and the engine-agnostic client + OpenAI-compat proxy.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file longctx_svc-0.3.0a3-py3-none-any.whl.
File metadata
- Download URL: longctx_svc-0.3.0a3-py3-none-any.whl
- Upload date:
- Size: 45.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
354e4146b5b48f10e17a5c41c73d33feda8d64d4049fc661e5864d274d79acd3
|
|
| MD5 |
66111665bf91a3e82aad0b54aa69601d
|
|
| BLAKE2b-256 |
fc9233b8862a2d32eec6f5d567af92e4db0dd5ca6be6757d8efe750c2709c4ce
|