Skip to main content

A modular, research-backed RAG building block library

Project description

atomic-rag

Note: WIP I'm still reviewing the generated code, though the exmaples work :-) .

A modular Python library of research-backed RAG building blocks. Each component solves one specific failure mode of retrieval-augmented generation and can be used independently or composed into a full pipeline.

The design goal is the opposite of LangChain: no magic, no hidden abstractions. Every module has a clear input/output contract (DataPacket), is independently testable, and can be swapped without touching anything else.


Install

Quickstart — full local stack with Ollama:

pip install -e ".[all,dev]"

This installs every dependency needed to run the pipeline end-to-end (Ollama, ChromaDB, BM25, sentence-transformers, MarkItDown) plus the test suite. Then pull the required models:

ollama pull nomic-embed-text   # embeddings
ollama pull llama3.2:3b        # chat / reasoning

Alternative — requirements.txt:

pip install -r requirements.txt
pip install -e .

Pick only what you need:

pip install -e ".[dev]"          # tests only — no runtime deps
pip install -e ".[retrieval]"    # ChromaDB + BM25
pip install -e ".[reranker]"     # cross-encoder reranking (optional)
pip install -e ".[ollama]"       # local models via Ollama
pip install -e ".[openai]"       # OpenAI API models
pip install -e ".[markitdown]"   # PDF/PPTX/XLSX ingestion
pip install -e ".[ragas]"        # Ragas evaluation metrics

Quick Start

Ingest a PDF or Office document:

from atomic_rag.ingestion import MarkItDownIngestor

ingestor = MarkItDownIngestor()
docs = ingestor.ingest("reports/q4-2024.pdf")

for doc in docs:
    print(f"[{doc.chunk_index}] {doc.content[:80]}...")

Ingest a Python codebase (AST-based chunking):

from atomic_rag.ingestion import CodeIngestor

ingestor = CodeIngestor()
docs = ingestor.ingest_directory("src/")  # walks recursively, ignores __pycache__ etc.

for doc in docs:
    print(f"[{doc.chunk_index}] {doc.metadata['type']:<8} {doc.metadata.get('name', '')}  ({doc.source})")

Development

pytest                          # run all tests (integration tests excluded)
pytest -m integration           # run integration tests (requires real dependencies)
pytest tests/test_ingestion.py  # run a single test file
pytest tests/test_ingestion.py::TestMarkdownChunker::test_splits_on_h2_headers  # single test
pytest --cov=atomic_rag --cov-report=term-missing  # with coverage

Architecture

All modules communicate through a single DataPacket object that accumulates state as it moves through the pipeline. Modules never mutate their input — they return a copy with their output fields populated.

DataPacket(query="...")
  -> [Phase 2] expanded_queries populated
  -> [Phase 3] documents populated (retrieved + reranked, with scores)
  -> [Phase 4] context populated (compressed string for the LLM)
  -> [Phase 5] answer populated
  -> [Eval]    eval_scores populated (faithfulness, answer_relevance, context_precision)

Each phase also appends a TraceEntry to packet.trace for observability.

Phases

Phase Problem solved Key technique Status
1 — Ingestion Messy PDFs destroy table/header structure Markdown-native parsing (MarkItDown) + AST-based code chunking done
3 — Retrieval Vector search misses keywords and acronyms Hybrid search (vector + BM25) + RRF + cross-encoder reranking done
4 — Context LLMs ignore information buried mid-context Sentence-level cosine filtering (SentenceCompressor) done
2 — Query Vague queries miss the relevant documents HyDE + multi-query expansion done
5 — Agent Hallucinations when retrieved context is insufficient Corrective RAG (C-RAG) with evaluator + fallback done
Eval No visibility into where the pipeline fails Faithfulness + answer relevance + Ragas integration done

Phase 3 before Phase 2 is intentional — hybrid retrieval delivers the highest quality improvement per unit of work. Query intelligence (Phase 2) has diminishing returns until retrieval is solid.

Tech Stack

Layer Library
Parsing Microsoft MarkItDown (swap: Docling)
Vector store ChromaDB (swap: Qdrant)
Keyword search rank-bm25
Reranking sentence-transformers cross-encoders
LLM / Embedder Ollama (swap: OpenAI, or any ChatModelBase)
Evaluation Built-in scorers + optional Ragas integration

Docs

Start at docs/index.md — it has a guided reading order, a full table of contents, and a pipeline diagram.

Quick links:

Examples

  • examples/code_qa/ — full pipeline demo: indexes a Python codebase and answers questions via retrieval + compression + C-RAG

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atomic_rag_lib-0.1.0.tar.gz (59.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

atomic_rag_lib-0.1.0-py3-none-any.whl (46.8 kB view details)

Uploaded Python 3

File details

Details for the file atomic_rag_lib-0.1.0.tar.gz.

File metadata

  • Download URL: atomic_rag_lib-0.1.0.tar.gz
  • Upload date:
  • Size: 59.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for atomic_rag_lib-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e0b693fadc8f0db551ae38ffc956f128b9d7e9fe88b7ae360c0ad56fc13ff4c0
MD5 d37f7d13c43b2f15c2437dbfe53b5391
BLAKE2b-256 527e16a806c51111374531e8017817cd4756eb892c4948441ae156d8a5973687

See more details on using hashes here.

File details

Details for the file atomic_rag_lib-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: atomic_rag_lib-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 46.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for atomic_rag_lib-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87606f030707797523b73812b2142b16a93d407d90b099bc2d708dae288c9487
MD5 41ccbb60e03a446e78615a035ad65865
BLAKE2b-256 6417234dde2b0a44c2c0274b3f26bb7371a2c8e2365b9baffa9bf56f14fff104

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page