Core utilities for document processing, RAG configuration, querying, and evaluation.
Project description
ragbandit-core
Core utilities for:
- Document ingestion & processing (OCR, chunking, embedding)
- Building and running Retrieval-Augmented Generation (RAG) pipelines
- Evaluating answers with automated metrics
Test Coverage
The codebase maintains 87% test coverage with comprehensive integration tests covering all major components. See tests/README.md for details on running tests and coverage reports.
Quick start
pip install ragbandit-core
from ragbandit.documents import (
DocumentPipeline,
ReferencesRefiner,
FootnoteRefiner,
MistralOCR,
MistralEmbedder,
SemanticChunker
)
import os
import logging
from dotenv import load_dotenv
load_dotenv()
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"
doc_pipeline = DocumentPipeline(
chunker=SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500),
embedder=MistralEmbedder(api_key=MISTRAL_API_KEY, model="mistral-embed"),
ocr_processor=MistralOCR(api_key=MISTRAL_API_KEY),
refiners=[
ReferencesRefiner(api_key=MISTRAL_API_KEY),
FootnoteRefiner(api_key=MISTRAL_API_KEY),
],
)
extended_response = doc_pipeline.process(file_path)
Using Alternative OCR and Embedding Providers
The package supports multiple OCR and embedding providers:
from ragbandit.documents import (
DocumentPipeline,
DatalabOCR,
OpenAIEmbedder,
FixedSizeChunker
)
import os
DATALAB_API_KEY = os.getenv("DATALAB_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
file_path = "./data/raw/[document_name].pdf"
# Using Datalab OCR and OpenAI embeddings
doc_pipeline = DocumentPipeline(
ocr_processor=DatalabOCR(
api_key=DATALAB_API_KEY,
model="marker",
mode="balanced" # Options: fast, balanced, accurate
),
chunker=FixedSizeChunker(chunk_size=500, overlap=100),
embedder=OpenAIEmbedder(
api_key=OPENAI_API_KEY,
model="text-embedding-3-small" # or text-embedding-3-large
),
)
result = doc_pipeline.process(file_path)
Running Steps Manually
For more control, you can run each pipeline step independently:
from ragbandit.documents import (
DocumentPipeline,
ReferencesRefiner,
MistralOCR,
MistralEmbedder,
SemanticChunker
)
import os
from dotenv import load_dotenv
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"
# Create pipeline with only the components you need
pipeline = DocumentPipeline(
ocr_processor=MistralOCR(api_key=MISTRAL_API_KEY),
refiners=[ReferencesRefiner(api_key=MISTRAL_API_KEY)],
chunker=SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500),
embedder=MistralEmbedder(api_key=MISTRAL_API_KEY, model="mistral-embed"),
)
# Step 1: Run OCR
ocr_result = pipeline.run_ocr(file_path)
# Step 2: Run refiners (optional)
refining_results = pipeline.run_refiners(ocr_result)
final_doc = refining_results[-1] # Get the last refiner's output
# Step 3: Chunk the document
chunk_result = pipeline.run_chunker(final_doc)
# Step 4: Embed chunks
embedding_result = pipeline.run_embedder(chunk_result)
You can also use components independently without a pipeline:
# Run OCR directly - Mistral
ocr = MistralOCR(api_key=MISTRAL_API_KEY)
ocr_result = ocr.process(file_path)
# Or use Datalab OCR
from ragbandit.documents import DatalabOCR
datalab_ocr = DatalabOCR(
api_key=DATALAB_API_KEY,
mode="accurate",
max_pages=10 # Optional: limit pages processed
)
ocr_result = datalab_ocr.process(file_path)
# Run refiners directly
refiner = FootnoteRefiner(api_key=MISTRAL_API_KEY)
refined_result = refiner.process(ocr_result)
# Run chunker directly
chunker = SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500)
chunk_result = chunker.chunk(refined_result)
# Run embedder directly - Mistral
embedder = MistralEmbedder(api_key=MISTRAL_API_KEY)
embedding_result = embedder.embed_chunks(chunk_result)
# Or use OpenAI embeddings
from ragbandit.documents import OpenAIEmbedder
openai_embedder = OpenAIEmbedder(
api_key=OPENAI_API_KEY,
model="text-embedding-3-large" # Higher quality, larger dimensions
)
embedding_result = openai_embedder.embed_chunks(chunk_result)
Available Components
OCR
| Class | Provider | Models | Key params |
|---|---|---|---|
MistralOCR |
Mistral | mistral-ocr-2512 (default), mistral-ocr-2505 |
api_key, model |
DatalabOCR |
Datalab | marker |
api_key, mode (fast / balanced / accurate), max_pages, page_range |
Refiners
| Class | What it does |
|---|---|
ReferencesRefiner |
Detects and extracts the references/bibliography section. Stores in extracted_data["references_markdown"]. |
FootnoteRefiner |
Detects footnotes, inlines explanations, and collects citations. |
TableOfContentsRefiner |
Detects and removes the table of contents. Stores in extracted_data["toc_markdown"]. |
Chunkers
| Class | Params (defaults) | When to use |
|---|---|---|
FixedSizeChunker |
chunk_size=1000, overlap=200 |
Fast, deterministic splitting by character count |
SentenceChunker |
sentences_per_chunk=5, sentence_overlap=1, min_chunk_size=100 |
Sentence-aware sliding window, no external deps |
RecursiveMarkdownChunker |
chunk_size=1000, overlap=100 |
Heading-aware hierarchical splitting (H1→H2→H3→H4→paragraph→sentence) |
SemanticChunker |
api_key, min_chunk_size=500 |
LLM-based semantic boundary detection (uses Mistral) |
Embedders
| Class | Provider | Models | Cost / 1M tokens |
|---|---|---|---|
MistralEmbedder |
Mistral | mistral-embed |
$0.10 |
OpenAIEmbedder |
OpenAI | text-embedding-3-small, text-embedding-3-large |
$0.02 / $0.13 |
VoyageAIEmbedder |
Voyage AI | voyage-3, voyage-3-large, voyage-3-lite |
$0.06 / $0.18 / $0.02 |
CohereEmbedder |
Cohere | embed-v4.0, embed-english-v3.0, embed-multilingual-v3.0 |
$0.12 / $0.10 / $0.10 |
Examples & Notebooks
Example scripts (examples/)
| File | What it shows |
|---|---|
01_basic_pipeline.py |
End-to-end DocumentPipeline.process() with MistralOCR + FixedSizeChunker + MistralEmbedder |
02_choosing_components.py |
Same doc with two combos (Mistral-only vs mixed providers) — compares chunks, dims, cost |
03_step_by_step.py |
Manual run_ocr() → run_refiners() → run_chunker() → run_embedder() with intermediate inspection |
04_cost_tracking.py |
TokenUsageTracker standalone — per-model breakdown and total cost |
.venv/bin/python examples/01_basic_pipeline.py
Notebooks (notebooks/)
| File | What it shows |
|---|---|
getting_started.ipynb |
Full pipeline walkthrough — one cell per stage |
component_comparison.ipynb |
Compares FixedSize vs Sentence vs RecursiveMarkdown chunking strategies |
component_explorer.ipynb |
Exercises every component with all valid configurations |
Each notebook has a setup cell where you set PDF_PATH and ENV_PATH to point to your own document and API keys.
Package layout
ragbandit-core/
├── examples/ # Runnable example scripts
├── notebooks/ # Jupyter notebooks
├── src/ragbandit/
│ ├── config/ # Pricing and model configuration
│ ├── documents/ # Document ingestion, OCR, chunking, embedding
│ │ ├── chunkers/
│ │ ├── embedders/
│ │ ├── ocr/
│ │ └── refiners/
│ ├── prompt_tools/ # LLM-based tools
│ └── utils/ # Token tracking, logging, client managers
└── tests/
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragbandit_core-0.3.0.tar.gz.
File metadata
- Download URL: ragbandit_core-0.3.0.tar.gz
- Upload date:
- Size: 49.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c303926aee04f984639839ee0c255e6da05efd06193a3bf2b1f042725d9aadc3
|
|
| MD5 |
59edd65396b90475830a13642289dce0
|
|
| BLAKE2b-256 |
06848888facde7e75d9a4dc3e9cfd601ce183581e8bda4c024197c656a60c936
|
File details
Details for the file ragbandit_core-0.3.0-py3-none-any.whl.
File metadata
- Download URL: ragbandit_core-0.3.0-py3-none-any.whl
- Upload date:
- Size: 69.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7025def6214c301787a9364c29f4e7961ccce1930fba146f1935360ad0ec3b85
|
|
| MD5 |
15054ee31d421cf9cd57cb8cef7dd721
|
|
| BLAKE2b-256 |
4519abfa1c6f20ef56c127498b02f0d661ebbad3425008d73750d17c81f7a64a
|