Skip to main content

A modular, research-backed RAG building block library

Project description

atomic-rag

Note: WIP I'm still reviewing the generated code, though the exmaples work :-) .

A modular Python library of research-backed RAG building blocks. Each component solves one specific failure mode of retrieval-augmented generation and can be used independently or composed into a full pipeline.

The design goal is the opposite of LangChain: no magic, no hidden abstractions. Every module has a clear input/output contract (DataPacket), is independently testable, and can be swapped without touching anything else.


Install

Full local stack with Ollama:

pip install "atomic-rag-lib[all]"

Then pull the required Ollama models:

ollama pull nomic-embed-text   # embeddings
ollama pull llama3.2:3b        # chat / reasoning

Pick only what you need:

pip install atomic-rag-lib                     # core only (DataPacket + schema)
pip install "atomic-rag-lib[ollama]"           # local models via Ollama
pip install "atomic-rag-lib[openai]"           # OpenAI API models
pip install "atomic-rag-lib[retrieval]"        # ChromaDB + BM25
pip install "atomic-rag-lib[markitdown]"       # PDF/PPTX/XLSX ingestion
pip install "atomic-rag-lib[reranker]"         # cross-encoder reranking
pip install "atomic-rag-lib[ragas]"            # Ragas evaluation metrics
pip install "atomic-rag-lib[all]"              # everything above

For development (clone the repo first):

git clone https://github.com/rohinp/atomic-rag
cd atomic-rag
pip install -e ".[all,dev]"

Quick Start

Ingest a PDF or Office document:

from atomic_rag.ingestion import MarkItDownIngestor

ingestor = MarkItDownIngestor()
docs = ingestor.ingest("reports/q4-2024.pdf")

for doc in docs:
    print(f"[{doc.chunk_index}] {doc.content[:80]}...")

Ingest a Python codebase (AST-based chunking):

from atomic_rag.ingestion import CodeIngestor

ingestor = CodeIngestor()
docs = ingestor.ingest_directory("src/")  # walks recursively, ignores __pycache__ etc.

for doc in docs:
    print(f"[{doc.chunk_index}] {doc.metadata['type']:<8} {doc.metadata.get('name', '')}  ({doc.source})")

Development

pytest                          # run all tests (integration tests excluded)
pytest -m integration           # run integration tests (requires real dependencies)
pytest tests/test_ingestion.py  # run a single test file
pytest tests/test_ingestion.py::TestMarkdownChunker::test_splits_on_h2_headers  # single test
pytest --cov=atomic_rag --cov-report=term-missing  # with coverage

Architecture

All modules communicate through a single DataPacket object that accumulates state as it moves through the pipeline. Modules never mutate their input — they return a copy with their output fields populated.

DataPacket(query="...")
  -> [Phase 2] expanded_queries populated
  -> [Phase 3] documents populated (retrieved + reranked, with scores)
  -> [Phase 4] context populated (compressed string for the LLM)
  -> [Phase 5] answer populated
  -> [Eval]    eval_scores populated (faithfulness, answer_relevance, context_precision)

Each phase also appends a TraceEntry to packet.trace for observability.

Phases

Phase Problem solved Key technique Status
1 — Ingestion Messy PDFs destroy table/header structure Markdown-native parsing (MarkItDown) + AST-based code chunking done
3 — Retrieval Vector search misses keywords and acronyms Hybrid search (vector + BM25) + RRF + cross-encoder reranking done
4 — Context LLMs ignore information buried mid-context Sentence-level cosine filtering (SentenceCompressor) done
2 — Query Vague queries miss the relevant documents HyDE + multi-query expansion done
5 — Agent Hallucinations when retrieved context is insufficient Corrective RAG (C-RAG) with evaluator + fallback done
Eval No visibility into where the pipeline fails Faithfulness + answer relevance + Ragas integration done

Phase 3 before Phase 2 is intentional — hybrid retrieval delivers the highest quality improvement per unit of work. Query intelligence (Phase 2) has diminishing returns until retrieval is solid.

Tech Stack

Layer Library
Parsing Microsoft MarkItDown (swap: Docling)
Vector store ChromaDB (swap: Qdrant)
Keyword search rank-bm25
Reranking sentence-transformers cross-encoders
LLM / Embedder Ollama (swap: OpenAI, or any ChatModelBase)
Evaluation Built-in scorers + optional Ragas integration

Docs

Start at docs/index.md — it has a guided reading order, a full table of contents, and a pipeline diagram.

Quick links:

Examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atomic_rag_lib-0.1.1.tar.gz (59.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atomic_rag_lib-0.1.1-py3-none-any.whl (46.9 kB view details)

Uploaded Python 3

File details

Details for the file atomic_rag_lib-0.1.1.tar.gz.

File metadata

  • Download URL: atomic_rag_lib-0.1.1.tar.gz
  • Upload date:
  • Size: 59.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for atomic_rag_lib-0.1.1.tar.gz
Algorithm Hash digest
SHA256 caf88ebb069f05b36a06daf41ca3037abb3d00bb1109e03fdb8965660ee72416
MD5 97369b6917a8a131ed3ecbfb897319d1
BLAKE2b-256 55aa223d9e160f9a12e59a35b32c9bb1cc3d4d75ee1bacec21304fff981ea3b3

See more details on using hashes here.

File details

Details for the file atomic_rag_lib-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: atomic_rag_lib-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 46.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for atomic_rag_lib-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4190b66ff9d1d23f4e6e78575c5478c79b845f8ed5b3e1f196cffac26a0754c6
MD5 c42cfbe084e9f1df638a0641d6b5ff56
BLAKE2b-256 cd385a7f1e805c834564c96fe61b21b5c76a8dde613dbc8173059b3dcc7a6eee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page