Skip to main content

Vendor-neutral RAG + LLM serving infrastructure: swappable LLM protocol and vector store (FAISS/NumPy/Qdrant), cached embedding index, and observability.

Project description

RAG + LLM Serving Infrastructure

CI PyPI

An installable, vendor-neutral foundation for retrieval-augmented LLM applications: a swappable vector store, a cached embedding index, a provider-agnostic LLM protocol, the observability around them, a FastAPI serving layer, and a retrieval-quality eval gate.

Distilled infrastructure layer — typed, tested, packaged, and runnable on its own.

Install

pip install rag-llm-infra                                   # core (numpy)
pip install "rag-llm-infra[faiss,qdrant,openai,serve]"      # + native backends, OpenAI, serving
pip install -e ".[dev]"                                     # from a local clone, for development

Quickstart — end-to-end RAG (no API key, no network)

git clone https://github.com/MarwaBS/rag-llm-infra && cd rag-llm-infra
pip install -e .
python example.py
embed documents → index in a VectorStore → retrieve top-k for a query
                → build a grounded prompt → answer with an LLMProtocol backend

Runs on the NumPy vector store + the deterministic mock LLM, so it needs no key. In production, swap the demo embedder for EmbeddingEngine and get_llm("mock") for get_llm("openai").

Serve it

pip install "rag-llm-infra[serve]"
uvicorn rag_llm_infra.serve:app          # or: docker build -t rag-llm-infra . && docker run -p 8000:8000 rag-llm-infra
curl -XPOST localhost:8000/index -d '{"documents":["FAISS is in-process vector search","Qdrant is a vector database"]}' -H 'content-type: application/json'
curl -XPOST localhost:8000/query -d '{"query":"vector search","k":1}'      -H 'content-type: application/json'

What's inside

Module Responsibility
rag_llm_infra.llm_protocol LLMProtocolruntime_checkable Protocol over OpenAI / Anthropic-stub / Mock; factory get_llm()
rag_llm_infra.vector_store VectorStoreProtocol — in-process FAISS IndexFlatIP, pure-NumPy fallback, real Qdrant (batched search)
rag_llm_infra.evidence_index EmbeddingEngine — SentenceTransformers embeddings + a memory-pressure-aware cache (insertion-order eviction) guarded by a writer-preferring reader/writer lock, so the slow model.encode runs outside the lock
rag_llm_infra.tracing OpenTelemetry spans with console-exporter + no-op fallbacks
rag_llm_infra.log_config structured JSON logging + an llm_call latency/token timer
rag_llm_infra.serve FastAPI service (/index, /query, /health) wiring the parts together
rag_llm_infra.faithfulness groundedness(answer, contexts) — lexical faithfulness metric for RAG output
rag_llm_infra.fallback FallbackLLM — budget-aware multi-provider routing; drop-in LLMProtocol

Quality gates

python -m eval.retrieval_eval      # recall@1 / MRR — retrieval mechanics over the demo embedder
python -m eval.generation_eval     # groundedness (faithfulness) of generated answers

Both run in CI: a retrieval regression (recall@1 ≥ 0.80, MRR ≥ 0.85) or a faithfulness regression (grounded answer below threshold, or the metric failing to flag a hallucinated control) fails the build and cannot merge.

Engineering principles demonstrated

  • Swap by interfaceLLMProtocol / VectorStoreProtocol make the model and the index runtime-swappable.
  • Degrade, don't crash — FAISS / Qdrant / OpenTelemetry / SentenceTransformers are lazily imported with working fallbacks; missing infra never hard-fails import.
  • Measured, not asserted — a retrieval eval gate, not just unit tests; packaged and CI-built end to end.

Develop / test

pip install -e ".[dev]"     # installs FAISS + Qdrant + serve extras too
ruff check . && pytest && python -m eval.retrieval_eval

CI installs the native backends, so the FAISS and Qdrant tests run there (they skip only when those libraries are absent).

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_llm_infra-0.1.1.tar.gz (40.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rag_llm_infra-0.1.1-py3-none-any.whl (25.6 kB view details)

Uploaded Python 3

File details

Details for the file rag_llm_infra-0.1.1.tar.gz.

File metadata

  • Download URL: rag_llm_infra-0.1.1.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for rag_llm_infra-0.1.1.tar.gz
Algorithm Hash digest
SHA256 596bca43b57129200dcb9488fb70ff86489bba983c61142fc29e163eb676047f
MD5 7f5b98f9ad9359aca3dc54730f2f5321
BLAKE2b-256 0cfb251af876e20c0a0cf37bc38bffe665a8b1b8ba686131247eb8d38019e71c

See more details on using hashes here.

File details

Details for the file rag_llm_infra-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: rag_llm_infra-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for rag_llm_infra-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e6d79b500fbc1077179059c0f66e97d239047b6cb2bdce3d0eee96b7b38e34c5
MD5 367c8d0321ecf1b510b03b85ce7e3fc5
BLAKE2b-256 c9f99d7dab889bb2be0d0a007d1c22d53288f9ee155d5831e5ca907578704b98

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page