Skip to main content

Core utilities for document processing, RAG configuration, querying, and evaluation.

Project description

ragbandit-core

Core utilities for:

  • Document ingestion & processing (OCR, chunking, embedding)
  • Building and running Retrieval-Augmented Generation (RAG) pipelines
  • Evaluating answers with automated metrics

Quick start

pip install ragbandit-core
from ragbandit.documents import (
    DocumentPipeline,
    ReferencesProcessor,
    FootnoteProcessor,
    MistralOCRDocument,
    MistralEmbedder,
    SemanticChunker
)
import os
import logging
from dotenv import load_dotenv
load_dotenv()

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

file_path = "./data/raw/[document_name].pdf"

doc_pipeline = DocumentPipeline(
    chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY),
    embedder=MistralEmbedder(model="mistral-embed", api_key=MISTRAL_API_KEY),  # noqa
    ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
    processors=[
        ReferencesProcessor(api_key=MISTRAL_API_KEY),
        FootnoteProcessor(api_key=MISTRAL_API_KEY),
    ],
)

extended_response = doc_pipeline.process(file_path)

Running Steps Manually

For more control, you can run each pipeline step independently:

from ragbandit.documents import (
    DocumentPipeline,
    ReferencesProcessor,
    MistralOCRDocument,
    MistralEmbedder,
    SemanticChunker
)
import os
from dotenv import load_dotenv
load_dotenv()

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"

# Create pipeline with only the components you need
pipeline = DocumentPipeline(
    ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
    processors=[ReferencesProcessor(api_key=MISTRAL_API_KEY)],
    chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY),
    embedder=MistralEmbedder(model="mistral-embed", api_key=MISTRAL_API_KEY),
)

# Step 1: Run OCR
ocr_result = pipeline.run_ocr(file_path)

# Step 2: Run processors (optional)
processing_results = pipeline.run_processors(ocr_result)
final_doc = processing_results[-1]  # Get the last processor's output

# Step 3: Chunk the document
chunk_result = pipeline.run_chunker(final_doc)

# Step 4: Embed chunks
embedding_result = pipeline.run_embedder(chunk_result)

You can also create separate pipelines for different steps:

# OCR-only pipeline
ocr_pipeline = DocumentPipeline(
    ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY)
)
ocr_result = ocr_pipeline.run_ocr(file_path)

# Later, chunk with a different pipeline
chunk_pipeline = DocumentPipeline(
    chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY)
)
chunks = chunk_pipeline.run_chunker(ocr_result)

Package layout

ragbandit-core/
├── src/ragbandit/
│   ├── documents/   # document ingestion, OCR, chunking, 
└── tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragbandit_core-0.1.2.tar.gz (35.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragbandit_core-0.1.2-py3-none-any.whl (46.6 kB view details)

Uploaded Python 3

File details

Details for the file ragbandit_core-0.1.2.tar.gz.

File metadata

  • Download URL: ragbandit_core-0.1.2.tar.gz
  • Upload date:
  • Size: 35.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragbandit_core-0.1.2.tar.gz
Algorithm Hash digest
SHA256 87b983b4b87e87cf369a5a44df1075f104240101f0e7e5e4d68c373a15dc6c02
MD5 47ca5784b48bb735d01896982bcde8bd
BLAKE2b-256 631b2120444f0b5055c16d4999356043ef32524eb0678fb1a146179d0709c636

See more details on using hashes here.

File details

Details for the file ragbandit_core-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: ragbandit_core-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 46.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragbandit_core-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 157cdef4d18ef3bc756abdad0187d0dedf7957cc2cd997989e6864613e9a5c44
MD5 094544d59a235b1fe6070b5de4f6062a
BLAKE2b-256 4a13845b49739186700ad332f120cf40e389d49c4c7adb0cca94a035c3de3067

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page