A modular, research-backed RAG building block library

These details have not been verified by PyPI

Project links

Source Code

Project description

atomic-rag

Note: WIP I'm still reviewing the generated code, though the exmaples work :-) .

A modular Python library of research-backed RAG building blocks. Each component solves one specific failure mode of retrieval-augmented generation and can be used independently or composed into a full pipeline.

The design goal is the opposite of LangChain: no magic, no hidden abstractions. Every module has a clear input/output contract (DataPacket), is independently testable, and can be swapped without touching anything else.

Install

Quickstart — full local stack with Ollama:

pip install -e ".[all,dev]"

This installs every dependency needed to run the pipeline end-to-end (Ollama, ChromaDB, BM25, sentence-transformers, MarkItDown) plus the test suite. Then pull the required models:

ollama pull nomic-embed-text   # embeddings
ollama pull llama3.2:3b        # chat / reasoning

Alternative — requirements.txt:

pip install -r requirements.txt
pip install -e .

Pick only what you need:

pip install -e ".[dev]"          # tests only — no runtime deps
pip install -e ".[retrieval]"    # ChromaDB + BM25
pip install -e ".[reranker]"     # cross-encoder reranking (optional)
pip install -e ".[ollama]"       # local models via Ollama
pip install -e ".[openai]"       # OpenAI API models
pip install -e ".[markitdown]"   # PDF/PPTX/XLSX ingestion
pip install -e ".[ragas]"        # Ragas evaluation metrics

Quick Start

Ingest a PDF or Office document:

from atomic_rag.ingestion import MarkItDownIngestor

ingestor = MarkItDownIngestor()
docs = ingestor.ingest("reports/q4-2024.pdf")

for doc in docs:
    print(f"[{doc.chunk_index}] {doc.content[:80]}...")

Ingest a Python codebase (AST-based chunking):

from atomic_rag.ingestion import CodeIngestor

ingestor = CodeIngestor()
docs = ingestor.ingest_directory("src/")  # walks recursively, ignores __pycache__ etc.

for doc in docs:
    print(f"[{doc.chunk_index}] {doc.metadata['type']:<8} {doc.metadata.get('name', '')}  ({doc.source})")

Development

pytest                          # run all tests (integration tests excluded)
pytest -m integration           # run integration tests (requires real dependencies)
pytest tests/test_ingestion.py  # run a single test file
pytest tests/test_ingestion.py::TestMarkdownChunker::test_splits_on_h2_headers  # single test
pytest --cov=atomic_rag --cov-report=term-missing  # with coverage

Architecture

All modules communicate through a single DataPacket object that accumulates state as it moves through the pipeline. Modules never mutate their input — they return a copy with their output fields populated.

DataPacket(query="...")
  -> [Phase 2] expanded_queries populated
  -> [Phase 3] documents populated (retrieved + reranked, with scores)
  -> [Phase 4] context populated (compressed string for the LLM)
  -> [Phase 5] answer populated
  -> [Eval]    eval_scores populated (faithfulness, answer_relevance, context_precision)

Each phase also appends a TraceEntry to packet.trace for observability.

Phases

Phase	Problem solved	Key technique	Status
1 — Ingestion	Messy PDFs destroy table/header structure	Markdown-native parsing (MarkItDown) + AST-based code chunking	done
3 — Retrieval	Vector search misses keywords and acronyms	Hybrid search (vector + BM25) + RRF + cross-encoder reranking	done
4 — Context	LLMs ignore information buried mid-context	Sentence-level cosine filtering (SentenceCompressor)	done
2 — Query	Vague queries miss the relevant documents	HyDE + multi-query expansion	done
5 — Agent	Hallucinations when retrieved context is insufficient	Corrective RAG (C-RAG) with evaluator + fallback	done
Eval	No visibility into where the pipeline fails	Faithfulness + answer relevance + Ragas integration	done

Phase 3 before Phase 2 is intentional — hybrid retrieval delivers the highest quality improvement per unit of work. Query intelligence (Phase 2) has diminishing returns until retrieval is solid.

Tech Stack

Layer	Library
Parsing	Microsoft MarkItDown (swap: Docling)
Vector store	ChromaDB (swap: Qdrant)
Keyword search	rank-bm25
Reranking	sentence-transformers cross-encoders
LLM / Embedder	Ollama (swap: OpenAI, or any ChatModelBase)
Evaluation	Built-in scorers + optional Ragas integration

Docs

Start at docs/index.md — it has a guided reading order, a full table of contents, and a pipeline diagram.

Quick links:

Examples

examples/code_qa/ — full pipeline demo: indexes a Python codebase and answers questions via retrieval + compression + C-RAG

Project details

These details have not been verified by PyPI

Project links

Source Code

Release history Release notifications | RSS feed

0.1.1

Apr 22, 2026

This version

0.1.0

Apr 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atomic_rag_lib-0.1.0.tar.gz (59.0 kB view details)

Uploaded Apr 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

atomic_rag_lib-0.1.0-py3-none-any.whl (46.8 kB view details)

Uploaded Apr 22, 2026 Python 3

File details

Details for the file atomic_rag_lib-0.1.0.tar.gz.

File metadata

Download URL: atomic_rag_lib-0.1.0.tar.gz
Upload date: Apr 22, 2026
Size: 59.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for atomic_rag_lib-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e0b693fadc8f0db551ae38ffc956f128b9d7e9fe88b7ae360c0ad56fc13ff4c0`
MD5	`d37f7d13c43b2f15c2437dbfe53b5391`
BLAKE2b-256	`527e16a806c51111374531e8017817cd4756eb892c4948441ae156d8a5973687`

See more details on using hashes here.

File details

Details for the file atomic_rag_lib-0.1.0-py3-none-any.whl.

File metadata

Download URL: atomic_rag_lib-0.1.0-py3-none-any.whl
Upload date: Apr 22, 2026
Size: 46.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for atomic_rag_lib-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`87606f030707797523b73812b2142b16a93d407d90b099bc2d708dae288c9487`
MD5	`41ccbb60e03a446e78615a035ad65865`
BLAKE2b-256	`6417234dde2b0a44c2c0274b3f26bb7371a2c8e2365b9baffa9bf56f14fff104`

See more details on using hashes here.

atomic-rag-lib 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

atomic-rag

Install

Quick Start

Development

Architecture

Phases

Tech Stack

Docs

Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes