Vendor-neutral RAG + LLM serving infrastructure: swappable LLM protocol and vector store (FAISS/NumPy/Qdrant), cached embedding index, and observability.

These details have not been verified by PyPI

Project links

Project description

RAG + LLM Serving Infrastructure

An installable, vendor-neutral foundation for retrieval-augmented LLM applications: a swappable vector store, a cached embedding index, a provider-agnostic LLM protocol, the observability around them, a FastAPI serving layer, and a retrieval-quality eval gate.

Distilled infrastructure layer — typed, tested, packaged, and runnable on its own.

Install

pip install rag-llm-infra                                   # core (numpy)
pip install "rag-llm-infra[faiss,qdrant,openai,serve]"      # + native backends, OpenAI, serving
pip install "rag-llm-infra[psutil]"                         # + memory-pressure-aware cache trimming
pip install -e ".[dev]"                                     # from a local clone, for development

Quickstart — end-to-end RAG (no API key, no network)

git clone https://github.com/MarwaBS/rag-llm-infra && cd rag-llm-infra
pip install -e .
python example.py

embed documents → index in a VectorStore → retrieve top-k for a query
                → build a grounded prompt → answer with an LLMProtocol backend

Runs on the NumPy vector store + the deterministic mock LLM, so it needs no key. In production, swap the demo embedder for EmbeddingEngine and get_llm("mock") for get_llm("openai").

Serve it

pip install "rag-llm-infra[serve]"
uvicorn rag_llm_infra.serve:app          # or: docker build -t rag-llm-infra . && docker run -p 8000:8000 rag-llm-infra

curl -XPOST localhost:8000/index -d '{"documents":["FAISS is in-process vector search","Qdrant is a vector database"]}' -H 'content-type: application/json'
curl -XPOST localhost:8000/query -d '{"query":"vector search","k":1}'      -H 'content-type: application/json'

What's inside

Module	Responsibility
`rag_llm_infra.llm_protocol`	`LLMProtocol` — `runtime_checkable` Protocol over OpenAI / Anthropic-stub / Mock; factory `get_llm()`
`rag_llm_infra.vector_store`	`VectorStoreProtocol` — in-process FAISS `IndexFlatIP`, pure-NumPy fallback, real Qdrant (batched search)
`rag_llm_infra.evidence_index`	`EmbeddingEngine` — SentenceTransformers embeddings + a cache (insertion-order eviction) guarded by a writer-preferring reader/writer lock, so the slow `model.encode` runs outside the lock. Memory-pressure-aware trimming activates with the `[psutil]` extra (`pip install "rag-llm-infra[psutil]"`); without it the cache is fixed-size
`rag_llm_infra.tracing`	OpenTelemetry spans with console-exporter + no-op fallbacks
`rag_llm_infra.log_config`	structured JSON logging + an `llm_call` latency/token timer
`rag_llm_infra.serve`	FastAPI service (`/index`, `/query`, `/health`) wiring the parts together
`rag_llm_infra.faithfulness`	`groundedness(answer, contexts)` — lexical faithfulness metric for RAG output
`rag_llm_infra.fallback`	`FallbackLLM` — budget-aware multi-provider routing; drop-in `LLMProtocol`

Quality gates

python -m eval.retrieval_eval      # recall@1 / MRR — retrieval mechanics over the demo embedder
python -m eval.generation_eval     # groundedness (faithfulness) of generated answers

Both run in CI: a retrieval regression (recall@1 ≥ 0.80, MRR ≥ 0.85) or a faithfulness regression (grounded answer below threshold, or the metric failing to flag a hallucinated control) fails the build and cannot merge.

groundedness is a cheap lexical tripwire, not a faithfulness guarantee — it scores token overlap, so by construction it is negation-blind ("X is not Y" looks grounded), dilutable (a false clause appended to a true answer only dents the score), and propositional claims it can't verify. It catches the common out-of-vocabulary hallucination signature cheaply on every generation; pair it with an LLM-judge for semantic faithfulness. The limits are spelled out in the faithfulness module docstring and pinned by tests so they can't be quietly oversold later.

Engineering principles demonstrated

Swap by interface — LLMProtocol / VectorStoreProtocol make the model and the index runtime-swappable.
Degrade, don't crash — FAISS / Qdrant / OpenTelemetry / SentenceTransformers are lazily imported with working fallbacks; missing infra never hard-fails import.
Measured, not asserted — a retrieval eval gate, not just unit tests; packaged and CI-built end to end.

Develop / test

pip install -e ".[dev]"     # installs FAISS + Qdrant + serve extras too
ruff check . && pytest && python -m eval.retrieval_eval

CI installs the native backends, so the FAISS and Qdrant tests run there (they skip only when those libraries are absent).

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Jul 5, 2026

0.1.1

Jun 17, 2026

0.1.0

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_llm_infra-0.1.2.tar.gz (49.1 kB view details)

Uploaded Jul 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rag_llm_infra-0.1.2-py3-none-any.whl (28.0 kB view details)

Uploaded Jul 5, 2026 Python 3

File details

Details for the file rag_llm_infra-0.1.2.tar.gz.

File metadata

Download URL: rag_llm_infra-0.1.2.tar.gz
Upload date: Jul 5, 2026
Size: 49.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for rag_llm_infra-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`3fb7fa69b4dbd4c8553638936e0fb4f0247c03958fd10d5855a63b2093d2d6c5`
MD5	`228907fcee319df3627bf5430fb5a693`
BLAKE2b-256	`1e89d8ea45ab9685efbc111980635bdf0c3ee501e395f6932c32e9eeb37c8896`

See more details on using hashes here.

File details

Details for the file rag_llm_infra-0.1.2-py3-none-any.whl.

File metadata

Download URL: rag_llm_infra-0.1.2-py3-none-any.whl
Upload date: Jul 5, 2026
Size: 28.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for rag_llm_infra-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da8a5adb4a1a8660f5cc0243951f05957f3216e9af445110f1f2a43487c262ac`
MD5	`fb701d8d4743e77b3fb0dd0a3c09c3d3`
BLAKE2b-256	`4243949f7d4670c175b2584aa3bc66403c4d088aff37dc0c0ef03a85c42c48ed`

See more details on using hashes here.

rag-llm-infra 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RAG + LLM Serving Infrastructure

Install

Quickstart — end-to-end RAG (no API key, no network)

Serve it

What's inside

Quality gates

Engineering principles demonstrated

Develop / test

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes