Skip to main content

Local retrieval companion for inference servers — scoped, session-aware, file-watching.

Project description

longctx-svc

Local retrieval companion for inference servers. Scoped, session-aware, file-watching. Tool is optional — if you don't run it, your engine behaves exactly as before.

WIP. Apache-2.0.

Engine-agnostic by design

longctx-svc speaks plain HTTP/JSON. It works with any engine that accepts a prompt — no engine forks required for the basic path.

Engine Mode Wiring
vllm-swift embedded optional --retrieval-endpoint URL flag (engine-side)
TheTom/llama-cpp-turboquant (llama-server) proxy point client at longctx-svc; longctx-svc forwards to llama-server
TheTom/vllm (feature/turboquant-amd-noautotune) proxy or embedded OpenAI-compat passthrough; or call LongctxClient from a custom hook
vLLM (CUDA) proxy OpenAI-compat passthrough
anything OpenAI-compat proxy OpenAI-compat passthrough

Mode A — proxy (zero engine changes)

# 1. Run your engine as usual
llama-server -m model.gguf --port 8080 &
# (or vLLM AMD, vLLM CUDA, vllm-swift, ...)

# 2. Run longctx-svc in front of it
longctx-svc serve --upstream http://localhost:8080

# 3. Point your OpenAI client at longctx-svc instead of the engine
export OPENAI_BASE_URL=http://localhost:8765/v1

longctx-svc detects the project from the messages, retrieves top-K chunks, splices them into the system message, and forwards the request to the upstream. Response (including SSE stream) is passed straight back. If no path is mentioned in the messages, the request is forwarded unmodified.

Mode B — embedded (engine calls /retrieve)

For tighter integration (e.g. so the engine can reuse retrieved chunks across KV cache boundaries), engines import LongctxClient:

from longctx_svc.client import LongctxClient

cli = LongctxClient.from_env()        # honors LONGCTX_ENDPOINT
if cli is not None:                   # tool is optional
    res = cli.retrieve(
        prefill_text=full_prompt,
        query=user_message,
        session_id=session_id,
        top_k=8,
    )
    full_prompt = cli.splice(full_prompt, res)

Network failure → empty result → engine falls back to the no-retrieval path. Optional tool stays optional.

HTTP surface

Endpoint Purpose
POST /retrieve engine-side retrieval (Mode B)
POST /v1/chat/completions OpenAI-compat passthrough (Mode A)
POST /v1/completions legacy OpenAI-compat passthrough (Mode A)
GET /longctx/status JSON status; Accept: text/plain for the Sarah-visible block
GET /healthz liveness probe

Headers

Every retrieve / proxy response sets:

  • x-longctx-session: <session-id|ephemeral>
  • x-longctx-scope: <project-root|"">
  • x-longctx-chunks-used: <n>
  • x-longctx-scope-status: ready|empty|error|no-scope

Session affinity is sent on the request side via:

  1. x-session-affinity: <id> (preferred)
  2. x-session-id: <id>
  3. metadata.session_id in the JSON body

No header → ephemeral request, no caching.

Install (alpha)

pip install -e services/longctx-svc
longctx-svc serve              # http://127.0.0.1:8765

Tests

cd services/longctx-svc
pytest tests/ --no-cov

85 tests cover: scope detection, walk + .gitignore, chunker, indexer, session manager, the Sarah-journey end-to-end, and the engine-agnostic client + OpenAI-compat proxy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

longctx_svc-0.3.0a3-py3-none-any.whl (45.6 kB view details)

Uploaded Python 3

File details

Details for the file longctx_svc-0.3.0a3-py3-none-any.whl.

File metadata

  • Download URL: longctx_svc-0.3.0a3-py3-none-any.whl
  • Upload date:
  • Size: 45.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for longctx_svc-0.3.0a3-py3-none-any.whl
Algorithm Hash digest
SHA256 354e4146b5b48f10e17a5c41c73d33feda8d64d4049fc661e5864d274d79acd3
MD5 66111665bf91a3e82aad0b54aa69601d
BLAKE2b-256 fc9233b8862a2d32eec6f5d567af92e4db0dd5ca6be6757d8efe750c2709c4ce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page