Skip to main content

Core utilities for document processing, RAG configuration, querying, and evaluation.

Project description

ragbandit-core

Test Coverage

Core utilities for:

  • Document ingestion & processing (OCR, chunking, embedding)
  • Building and running Retrieval-Augmented Generation (RAG) pipelines
  • Evaluating answers with automated metrics

Test Coverage

The codebase maintains 87% test coverage with comprehensive integration tests covering all major components. See tests/README.md for details on running tests and coverage reports.

Quick start

pip install ragbandit-core
from ragbandit.documents import (
    DocumentPipeline,
    ReferencesRefiner,
    FootnoteRefiner,
    MistralOCR,
    MistralEmbedder,
    SemanticChunker
)
import os
import logging
from dotenv import load_dotenv
load_dotenv()

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

file_path = "./data/raw/[document_name].pdf"

doc_pipeline = DocumentPipeline(
    chunker=SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500),
    embedder=MistralEmbedder(api_key=MISTRAL_API_KEY, model="mistral-embed"),
    ocr_processor=MistralOCR(api_key=MISTRAL_API_KEY),
    refiners=[
        ReferencesRefiner(api_key=MISTRAL_API_KEY),
        FootnoteRefiner(api_key=MISTRAL_API_KEY),
    ],
)

extended_response = doc_pipeline.process(file_path)

Using Alternative OCR and Embedding Providers

The package supports multiple OCR and embedding providers:

from ragbandit.documents import (
    DocumentPipeline,
    DatalabOCR,
    OpenAIEmbedder,
    FixedSizeChunker
)
import os

DATALAB_API_KEY = os.getenv("DATALAB_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

file_path = "./data/raw/[document_name].pdf"

# Using Datalab OCR and OpenAI embeddings
doc_pipeline = DocumentPipeline(
    ocr_processor=DatalabOCR(
        api_key=DATALAB_API_KEY,
        model="marker",
        mode="balanced"  # Options: fast, balanced, accurate
    ),
    chunker=FixedSizeChunker(chunk_size=500, overlap=100),
    embedder=OpenAIEmbedder(
        api_key=OPENAI_API_KEY,
        model="text-embedding-3-small"  # or text-embedding-3-large
    ),
)

result = doc_pipeline.process(file_path)

Running Steps Manually

For more control, you can run each pipeline step independently:

from ragbandit.documents import (
    DocumentPipeline,
    ReferencesRefiner,
    MistralOCR,
    MistralEmbedder,
    SemanticChunker
)
import os
from dotenv import load_dotenv
load_dotenv()

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"

# Create pipeline with only the components you need
pipeline = DocumentPipeline(
    ocr_processor=MistralOCR(api_key=MISTRAL_API_KEY),
    refiners=[ReferencesRefiner(api_key=MISTRAL_API_KEY)],
    chunker=SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500),
    embedder=MistralEmbedder(api_key=MISTRAL_API_KEY, model="mistral-embed"),
)

# Step 1: Run OCR
ocr_result = pipeline.run_ocr(file_path)

# Step 2: Run refiners (optional)
refining_results = pipeline.run_refiners(ocr_result)
final_doc = refining_results[-1]  # Get the last refiner's output

# Step 3: Chunk the document
chunk_result = pipeline.run_chunker(final_doc)

# Step 4: Embed chunks
embedding_result = pipeline.run_embedder(chunk_result)

You can also use components independently without a pipeline:

# Run OCR directly - Mistral
ocr = MistralOCR(api_key=MISTRAL_API_KEY)
ocr_result = ocr.process(file_path)

# Or use Datalab OCR
from ragbandit.documents import DatalabOCR
datalab_ocr = DatalabOCR(
    api_key=DATALAB_API_KEY,
    mode="accurate",
    max_pages=10  # Optional: limit pages processed
)
ocr_result = datalab_ocr.process(file_path)

# Run refiners directly
refiner = FootnoteRefiner(api_key=MISTRAL_API_KEY)
refined_result = refiner.process(ocr_result)

# Run chunker directly
chunker = SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500)
chunk_result = chunker.chunk(refined_result)

# Run embedder directly - Mistral
embedder = MistralEmbedder(api_key=MISTRAL_API_KEY)
embedding_result = embedder.embed_chunks(chunk_result)

# Or use OpenAI embeddings
from ragbandit.documents import OpenAIEmbedder
openai_embedder = OpenAIEmbedder(
    api_key=OPENAI_API_KEY,
    model="text-embedding-3-large"  # Higher quality, larger dimensions
)
embedding_result = openai_embedder.embed_chunks(chunk_result)

Available Components

OCR

Class Provider Models Key params
MistralOCR Mistral mistral-ocr-2512 (default), mistral-ocr-2505 api_key, model
DatalabOCR Datalab marker api_key, mode (fast / balanced / accurate), max_pages, page_range

Refiners

Class What it does
ReferencesRefiner Detects and extracts the references/bibliography section. Stores in extracted_data["references_markdown"].
FootnoteRefiner Detects footnotes, inlines explanations, and collects citations.
TableOfContentsRefiner Detects and removes the table of contents. Stores in extracted_data["toc_markdown"].

Chunkers

Class Params (defaults) When to use
FixedSizeChunker chunk_size=1000, overlap=200 Fast, deterministic splitting by character count
SentenceChunker sentences_per_chunk=5, sentence_overlap=1, min_chunk_size=100 Sentence-aware sliding window, no external deps
RecursiveMarkdownChunker chunk_size=1000, overlap=100 Heading-aware hierarchical splitting (H1→H2→H3→H4→paragraph→sentence)
SemanticChunker api_key, min_chunk_size=500 LLM-based semantic boundary detection (uses Mistral)

Embedders

Class Provider Models Cost / 1M tokens
MistralEmbedder Mistral mistral-embed $0.10
OpenAIEmbedder OpenAI text-embedding-3-small, text-embedding-3-large $0.02 / $0.13
VoyageAIEmbedder Voyage AI voyage-3, voyage-3-large, voyage-3-lite $0.06 / $0.18 / $0.02
CohereEmbedder Cohere embed-v4.0, embed-english-v3.0, embed-multilingual-v3.0 $0.12 / $0.10 / $0.10

Examples & Notebooks

Example scripts (examples/)

File What it shows
01_basic_pipeline.py End-to-end DocumentPipeline.process() with MistralOCR + FixedSizeChunker + MistralEmbedder
02_choosing_components.py Same doc with two combos (Mistral-only vs mixed providers) — compares chunks, dims, cost
03_step_by_step.py Manual run_ocr()run_refiners()run_chunker()run_embedder() with intermediate inspection
04_cost_tracking.py TokenUsageTracker standalone — per-model breakdown and total cost
.venv/bin/python examples/01_basic_pipeline.py

Notebooks (notebooks/)

File What it shows
getting_started.ipynb Full pipeline walkthrough — one cell per stage
component_comparison.ipynb Compares FixedSize vs Sentence vs RecursiveMarkdown chunking strategies
component_explorer.ipynb Exercises every component with all valid configurations

Each notebook has a setup cell where you set PDF_PATH and ENV_PATH to point to your own document and API keys.

Package layout

ragbandit-core/
├── examples/          # Runnable example scripts
├── notebooks/         # Jupyter notebooks
├── src/ragbandit/
│   ├── config/        # Pricing and model configuration
│   ├── documents/     # Document ingestion, OCR, chunking, embedding
│   │   ├── chunkers/
│   │   ├── embedders/
│   │   ├── ocr/
│   │   └── refiners/
│   ├── prompt_tools/  # LLM-based tools
│   └── utils/         # Token tracking, logging, client managers
└── tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragbandit_core-0.3.0.tar.gz (49.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragbandit_core-0.3.0-py3-none-any.whl (69.7 kB view details)

Uploaded Python 3

File details

Details for the file ragbandit_core-0.3.0.tar.gz.

File metadata

  • Download URL: ragbandit_core-0.3.0.tar.gz
  • Upload date:
  • Size: 49.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragbandit_core-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c303926aee04f984639839ee0c255e6da05efd06193a3bf2b1f042725d9aadc3
MD5 59edd65396b90475830a13642289dce0
BLAKE2b-256 06848888facde7e75d9a4dc3e9cfd601ce183581e8bda4c024197c656a60c936

See more details on using hashes here.

File details

Details for the file ragbandit_core-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: ragbandit_core-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 69.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragbandit_core-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7025def6214c301787a9364c29f4e7961ccce1930fba146f1935360ad0ec3b85
MD5 15054ee31d421cf9cd57cb8cef7dd721
BLAKE2b-256 4519abfa1c6f20ef56c127498b02f0d661ebbad3425008d73750d17c81f7a64a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page