A modular, research-backed RAG building block library
Project description
atomic-rag
Note: WIP I'm still reviewing the generated code, though the exmaples work :-) .
A modular Python library of research-backed RAG building blocks. Each component solves one specific failure mode of retrieval-augmented generation and can be used independently or composed into a full pipeline.
The design goal is the opposite of LangChain: no magic, no hidden abstractions. Every module has a clear input/output contract (DataPacket), is independently testable, and can be swapped without touching anything else.
Install
Quickstart — full local stack with Ollama:
pip install -e ".[all,dev]"
This installs every dependency needed to run the pipeline end-to-end (Ollama, ChromaDB, BM25, sentence-transformers, MarkItDown) plus the test suite. Then pull the required models:
ollama pull nomic-embed-text # embeddings
ollama pull llama3.2:3b # chat / reasoning
Alternative — requirements.txt:
pip install -r requirements.txt
pip install -e .
Pick only what you need:
pip install -e ".[dev]" # tests only — no runtime deps
pip install -e ".[retrieval]" # ChromaDB + BM25
pip install -e ".[reranker]" # cross-encoder reranking (optional)
pip install -e ".[ollama]" # local models via Ollama
pip install -e ".[openai]" # OpenAI API models
pip install -e ".[markitdown]" # PDF/PPTX/XLSX ingestion
pip install -e ".[ragas]" # Ragas evaluation metrics
Quick Start
Ingest a PDF or Office document:
from atomic_rag.ingestion import MarkItDownIngestor
ingestor = MarkItDownIngestor()
docs = ingestor.ingest("reports/q4-2024.pdf")
for doc in docs:
print(f"[{doc.chunk_index}] {doc.content[:80]}...")
Ingest a Python codebase (AST-based chunking):
from atomic_rag.ingestion import CodeIngestor
ingestor = CodeIngestor()
docs = ingestor.ingest_directory("src/") # walks recursively, ignores __pycache__ etc.
for doc in docs:
print(f"[{doc.chunk_index}] {doc.metadata['type']:<8} {doc.metadata.get('name', '')} ({doc.source})")
Development
pytest # run all tests (integration tests excluded)
pytest -m integration # run integration tests (requires real dependencies)
pytest tests/test_ingestion.py # run a single test file
pytest tests/test_ingestion.py::TestMarkdownChunker::test_splits_on_h2_headers # single test
pytest --cov=atomic_rag --cov-report=term-missing # with coverage
Architecture
All modules communicate through a single DataPacket object that accumulates state as it moves through the pipeline. Modules never mutate their input — they return a copy with their output fields populated.
DataPacket(query="...")
-> [Phase 2] expanded_queries populated
-> [Phase 3] documents populated (retrieved + reranked, with scores)
-> [Phase 4] context populated (compressed string for the LLM)
-> [Phase 5] answer populated
-> [Eval] eval_scores populated (faithfulness, answer_relevance, context_precision)
Each phase also appends a TraceEntry to packet.trace for observability.
Phases
| Phase | Problem solved | Key technique | Status |
|---|---|---|---|
| 1 — Ingestion | Messy PDFs destroy table/header structure | Markdown-native parsing (MarkItDown) + AST-based code chunking | done |
| 3 — Retrieval | Vector search misses keywords and acronyms | Hybrid search (vector + BM25) + RRF + cross-encoder reranking | done |
| 4 — Context | LLMs ignore information buried mid-context | Sentence-level cosine filtering (SentenceCompressor) | done |
| 2 — Query | Vague queries miss the relevant documents | HyDE + multi-query expansion | done |
| 5 — Agent | Hallucinations when retrieved context is insufficient | Corrective RAG (C-RAG) with evaluator + fallback | done |
| Eval | No visibility into where the pipeline fails | Faithfulness + answer relevance + Ragas integration | done |
Phase 3 before Phase 2 is intentional — hybrid retrieval delivers the highest quality improvement per unit of work. Query intelligence (Phase 2) has diminishing returns until retrieval is solid.
Tech Stack
| Layer | Library |
|---|---|
| Parsing | Microsoft MarkItDown (swap: Docling) |
| Vector store | ChromaDB (swap: Qdrant) |
| Keyword search | rank-bm25 |
| Reranking | sentence-transformers cross-encoders |
| LLM / Embedder | Ollama (swap: OpenAI, or any ChatModelBase) |
| Evaluation | Built-in scorers + optional Ragas integration |
Docs
Start at docs/index.md — it has a guided reading order, a full table of contents, and a pipeline diagram.
Quick links:
- DataPacket contract
- Ingestion module
- Retrieval module
- Hybrid search technique
- Cross-encoder reranking
- Markdown-native parsing
- Context module
- Context compression technique
- Query module
- HyDE technique
- Multi-query expansion technique
- Agent module
- Corrective RAG technique
- Evaluation module
- Swapping backends guide
Examples
examples/code_qa/— full pipeline demo: indexes a Python codebase and answers questions via retrieval + compression + C-RAG
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file atomic_rag_lib-0.1.0.tar.gz.
File metadata
- Download URL: atomic_rag_lib-0.1.0.tar.gz
- Upload date:
- Size: 59.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e0b693fadc8f0db551ae38ffc956f128b9d7e9fe88b7ae360c0ad56fc13ff4c0
|
|
| MD5 |
d37f7d13c43b2f15c2437dbfe53b5391
|
|
| BLAKE2b-256 |
527e16a806c51111374531e8017817cd4756eb892c4948441ae156d8a5973687
|
File details
Details for the file atomic_rag_lib-0.1.0-py3-none-any.whl.
File metadata
- Download URL: atomic_rag_lib-0.1.0-py3-none-any.whl
- Upload date:
- Size: 46.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87606f030707797523b73812b2142b16a93d407d90b099bc2d708dae288c9487
|
|
| MD5 |
41ccbb60e03a446e78615a035ad65865
|
|
| BLAKE2b-256 |
6417234dde2b0a44c2c0274b3f26bb7371a2c8e2365b9baffa9bf56f14fff104
|