Open long-context inference stack: retrieval + open weights, no closed parts.
Project description
longctx
Open long-context inference stack. Retrieval + open weights, no closed parts.
A small library that bundles the components needed to reach Anthropic-class long-context retrieval performance on a single accessible GPU using only open weights.
What it is
longctx is a thin wrapper over standard tools:
- Retrieval: sentence-transformers (bi-encoder) + faiss
- Generation: any OpenAI-compatible LLM endpoint (vLLM, SGLang, llama.cpp server)
- Defaults tuned for: Qwen2.5-14B-Instruct-1M, but works with any instruction-following open model
Why
A stack of longctx defaults running Qwen2.5-14B-Instruct-1M on a single MI300X scored 0.822 on MRCR v2 8K bin (n=82, mass-validated 2026-05-06), beating the headline number a $29M-funded closed-weight startup published with their custom subquadratic architecture. The architectural moat narrative wasn't load-bearing for the workload. Retrieval + open weights solve it.
This library exists so the rest of the open ecosystem can reproduce that result with one pip install.
Install
pip install longctx
For local vLLM serving:
pip install longctx[serve]
Quickstart
from longctx import LongCtxClient
# Defaults: sentence-transformers/all-MiniLM-L6-v2 + local vLLM at port 5050
client = LongCtxClient()
# Pass your candidate chunks and a query
result = client.ask(
query="What was the third response about regulatory compliance?",
candidates=[
"Response 1: brief on regulatory compliance...",
"Response 2: legal analysis of...",
"Response 3: detailed compliance walkthrough...",
# ... up to thousands of candidates
],
top_k=8,
)
print(result.content)
print(f"Retrieved indices: {result.retrieved_indices}")
print(f"Prompt tokens: {result.prompt_tokens}")
Custom embedder
from longctx import LongCtxClient, RetrievalPipeline
# Default uses MiniLM-L6 (23M params, CPU-friendly).
# For higher quality at the cost of compute:
pipeline = RetrievalPipeline(embedder_model="BAAI/bge-large-en-v1.5")
client = LongCtxClient(pipeline=pipeline)
Notes on rerankers
longctx does not enable cross-encoder reranking by default. Off-the-shelf rerankers (ms-marco-MiniLM, bge-reranker-base) degraded retrieval quality on MRCR-style tasks in our 2026-05-06 testing. They are trained for web-search relevance, which doesn't transfer to "find the Nth message of type X" task semantics.
A retrieval-style reranker fine-tuned on appropriate data is on the roadmap. Until then, pure bi-encoder retrieval is the default.
Status
Pre-alpha v0.1.0. APIs may change.
Headline numbers (mass-validated)
End-to-end validation 2026-05-06 on AMD MI300X with vLLM-served Qwen2.5-14B-Instruct-1M, default LongCtxClient config (sentence-transformers MiniLM-L6 + faiss top-K=8):
| MRCR v2 8-needle bin | pipeline | n | avg_score | prefix_pass |
|---|---|---|---|---|
| 8K (16K-32K char) | RAG | 82 | 0.822 | 100% |
| 32K (64K-128K char) | RAG | 98 | 0.697 | 97% |
| 64K (128K-256K char) | RAG | 95 | 0.641 | 98% |
| 64K (128K-256K char) | chunked-RAG | 95 | 0.670 | 98% |
Reference baseline: SubQ Inc.'s published MRCR headline = 0.659 (closed-weight, custom subquadratic architecture, $29M funding).
Three of three bins clear the closed-weight headline with the right pipeline. Plain RAG over standard attention is competitive with claimed-state-of-the-art subquadratic architectures on MRCR-style retrieval workloads at every bin we measured.
Other tested generators (single-run, n=30, not mass-validated)
- Qwen2.5-7B-Instruct + RAG: 0.567 (2.4× faster, fits 16GB GPU)
- Qwen2.5-32B-Instruct + RAG: 0.237 (vanilla 32K context window, training-data fit limits the result)
- Qwen3-Next-80B-A3B + RAG: 0.281 (linear-attention hybrid, MoE)
Single-run scores at n=30 have substantial variance (we observed ±0.05 swings between adjacent runs of the same config). Trust the mass-validated numbers above for headline claims.
Mistral-7B-Instruct-v0.3 and Qwen3-8B failed with the default Qwen2.5-style template (prefix-first instruction). Templates are provided for both: longctx.templates.MISTRAL_VERBATIM_TEMPLATE and longctx.templates.QWEN3_NO_THINK_TEMPLATE. Validation against MRCR for these templates is on the roadmap.
Reproduce
longctx-bench --data-dir /path/to/mrcr/v2 --model qwen2.5-14b-instruct-1m \
--bins 8k 32k 64k --n 80 --include-chunked
License
Apache 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file longctx-0.2.0.tar.gz.
File metadata
- Download URL: longctx-0.2.0.tar.gz
- Upload date:
- Size: 30.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8222af746975cb06d0df0ac24f94ef0b17d36c8cde5497aabb5fcf71058b34ae
|
|
| MD5 |
493e255dbf625507e11a81ca2d145e60
|
|
| BLAKE2b-256 |
a0171cd18ea7a38710eecf374f057139a96a65af8048da9fcd5109c8ec4f8341
|
File details
Details for the file longctx-0.2.0-py3-none-any.whl.
File metadata
- Download URL: longctx-0.2.0-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b75203acbe220677ecdbfa68bd6d3bfeee62573e3f85302c2bbd47f0d8177a6
|
|
| MD5 |
7a8d068f427397382f20e49ede27a235
|
|
| BLAKE2b-256 |
87fbc1647942a95a60e982421235dc2962dfc386c7616b030e61468872be3363
|