Core utilities for document processing, RAG configuration, querying, and evaluation.
Project description
ragbandit-core
Core utilities for:
- Document ingestion & processing (OCR, chunking, embedding)
- Building and running Retrieval-Augmented Generation (RAG) pipelines
- Evaluating answers with automated metrics
Test Coverage
The codebase maintains 87% test coverage with comprehensive integration tests covering all major components. See tests/README.md for details on running tests and coverage reports.
Quick start
pip install ragbandit-core
from ragbandit.documents import (
DocumentPipeline,
ReferencesRefiner,
FootnoteRefiner,
MistralOCRDocument,
MistralEmbedder,
SemanticChunker
)
import os
import logging
from dotenv import load_dotenv
load_dotenv()
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"
doc_pipeline = DocumentPipeline(
chunker=SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500),
embedder=MistralEmbedder(api_key=MISTRAL_API_KEY, model="mistral-embed"),
ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
refiners=[
ReferencesRefiner(api_key=MISTRAL_API_KEY),
FootnoteRefiner(api_key=MISTRAL_API_KEY),
],
)
extended_response = doc_pipeline.process(file_path)
Using Alternative OCR and Embedding Providers
The package supports multiple OCR and embedding providers:
from ragbandit.documents import (
DocumentPipeline,
DatalabOCR,
OpenAIEmbedder,
FixedSizeChunker
)
import os
DATALAB_API_KEY = os.getenv("DATALAB_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
file_path = "./data/raw/[document_name].pdf"
# Using Datalab OCR and OpenAI embeddings
doc_pipeline = DocumentPipeline(
ocr_processor=DatalabOCR(
api_key=DATALAB_API_KEY,
model="marker",
mode="balanced" # Options: fast, balanced, accurate
),
chunker=FixedSizeChunker(chunk_size=500, overlap=100),
embedder=OpenAIEmbedder(
api_key=OPENAI_API_KEY,
model="text-embedding-3-small" # or text-embedding-3-large
),
)
result = doc_pipeline.process(file_path)
Running Steps Manually
For more control, you can run each pipeline step independently:
from ragbandit.documents import (
DocumentPipeline,
ReferencesRefiner,
MistralOCRDocument,
MistralEmbedder,
SemanticChunker
)
import os
from dotenv import load_dotenv
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"
# Create pipeline with only the components you need
pipeline = DocumentPipeline(
ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
refiners=[ReferencesRefiner(api_key=MISTRAL_API_KEY)],
chunker=SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500),
embedder=MistralEmbedder(api_key=MISTRAL_API_KEY, model="mistral-embed"),
)
# Step 1: Run OCR
ocr_result = pipeline.run_ocr(file_path)
# Step 2: Run refiners (optional)
refining_results = pipeline.run_refiners(ocr_result)
final_doc = refining_results[-1] # Get the last refiner's output
# Step 3: Chunk the document
chunk_result = pipeline.run_chunker(final_doc)
# Step 4: Embed chunks
embedding_result = pipeline.run_embedder(chunk_result)
You can also use components independently without a pipeline:
# Run OCR directly - Mistral
ocr = MistralOCRDocument(api_key=MISTRAL_API_KEY)
ocr_result = ocr.process(file_path)
# Or use Datalab OCR
from ragbandit.documents import DatalabOCR
datalab_ocr = DatalabOCR(
api_key=DATALAB_API_KEY,
mode="accurate",
max_pages=10 # Optional: limit pages processed
)
ocr_result = datalab_ocr.process(file_path)
# Run refiners directly
refiner = FootnoteRefiner(api_key=MISTRAL_API_KEY)
refined_result = refiner.process(ocr_result)
# Run chunker directly
chunker = SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500)
chunk_result = chunker.chunk(refined_result)
# Run embedder directly - Mistral
embedder = MistralEmbedder(api_key=MISTRAL_API_KEY)
embedding_result = embedder.embed_chunks(chunk_result)
# Or use OpenAI embeddings
from ragbandit.documents import OpenAIEmbedder
openai_embedder = OpenAIEmbedder(
api_key=OPENAI_API_KEY,
model="text-embedding-3-large" # Higher quality, larger dimensions
)
embedding_result = openai_embedder.embed_chunks(chunk_result)
Package layout
ragbandit-core/
├── src/ragbandit/
│ ├── documents/ # document ingestion, OCR, chunking,
└── tests/
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragbandit_core-0.2.2.tar.gz.
File metadata
- Download URL: ragbandit_core-0.2.2.tar.gz
- Upload date:
- Size: 41.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e0b260642ca9d2d553e6127142cf6d35ebdd1c8cf6025867755c9b1861d2956
|
|
| MD5 |
5c3ce9510d4ea2b808889e96ba42d4a8
|
|
| BLAKE2b-256 |
764d3f28298902f97768d7a88f1bed5696ad6ccae76fdd46bff84e5f33b7ac75
|
File details
Details for the file ragbandit_core-0.2.2-py3-none-any.whl.
File metadata
- Download URL: ragbandit_core-0.2.2-py3-none-any.whl
- Upload date:
- Size: 55.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c254862aa1b2d327da127428068069443e1c95d8f38feaf8bdcbd52d315211f2
|
|
| MD5 |
f2f85428071f02bb51ec283ba78f0f52
|
|
| BLAKE2b-256 |
9cc6bd01cb1960dd0588d81f0eb52b92cdfb305a08ccd3758c75738f58389bf3
|