Core utilities for document processing, RAG configuration, querying, and evaluation.
Project description
ragbandit-core
Core utilities for:
- Document ingestion & processing (OCR, chunking, embedding)
- Building and running Retrieval-Augmented Generation (RAG) pipelines
- Evaluating answers with automated metrics
Quick start
pip install ragbandit-core
from ragbandit.documents import (
DocumentPipeline,
ReferencesProcessor,
FootnoteProcessor,
MistralOCRDocument,
MistralEmbedder,
SemanticChunker
)
import os
import logging
from dotenv import load_dotenv
load_dotenv()
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"
doc_pipeline = DocumentPipeline(
chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY),
embedder=MistralEmbedder(model="mistral-embed", api_key=MISTRAL_API_KEY), # noqa
ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
processors=[
ReferencesProcessor(api_key=MISTRAL_API_KEY),
FootnoteProcessor(api_key=MISTRAL_API_KEY),
],
)
extended_response = doc_pipeline.process(file_path)
Running Steps Manually
For more control, you can run each pipeline step independently:
from ragbandit.documents import (
DocumentPipeline,
ReferencesProcessor,
MistralOCRDocument,
MistralEmbedder,
SemanticChunker
)
import os
from dotenv import load_dotenv
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"
# Create pipeline with only the components you need
pipeline = DocumentPipeline(
ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
processors=[ReferencesProcessor(api_key=MISTRAL_API_KEY)],
chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY),
embedder=MistralEmbedder(model="mistral-embed", api_key=MISTRAL_API_KEY),
)
# Step 1: Run OCR
ocr_result = pipeline.run_ocr(file_path)
# Step 2: Run processors (optional)
processing_results = pipeline.run_processors(ocr_result)
final_doc = processing_results[-1] # Get the last processor's output
# Step 3: Chunk the document
chunk_result = pipeline.run_chunker(final_doc)
# Step 4: Embed chunks
embedding_result = pipeline.run_embedder(chunk_result)
You can also create separate pipelines for different steps:
# OCR-only pipeline
ocr_pipeline = DocumentPipeline(
ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY)
)
ocr_result = ocr_pipeline.run_ocr(file_path)
# Later, chunk with a different pipeline
chunk_pipeline = DocumentPipeline(
chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY)
)
chunks = chunk_pipeline.run_chunker(ocr_result)
Package layout
ragbandit-core/
├── src/ragbandit/
│ ├── documents/ # document ingestion, OCR, chunking,
└── tests/
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ragbandit_core-0.1.2.tar.gz
(35.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragbandit_core-0.1.2.tar.gz.
File metadata
- Download URL: ragbandit_core-0.1.2.tar.gz
- Upload date:
- Size: 35.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87b983b4b87e87cf369a5a44df1075f104240101f0e7e5e4d68c373a15dc6c02
|
|
| MD5 |
47ca5784b48bb735d01896982bcde8bd
|
|
| BLAKE2b-256 |
631b2120444f0b5055c16d4999356043ef32524eb0678fb1a146179d0709c636
|
File details
Details for the file ragbandit_core-0.1.2-py3-none-any.whl.
File metadata
- Download URL: ragbandit_core-0.1.2-py3-none-any.whl
- Upload date:
- Size: 46.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
157cdef4d18ef3bc756abdad0187d0dedf7957cc2cd997989e6864613e9a5c44
|
|
| MD5 |
094544d59a235b1fe6070b5de4f6062a
|
|
| BLAKE2b-256 |
4a13845b49739186700ad332f120cf40e389d49c4c7adb0cca94a035c3de3067
|