Skip to main content

Core utilities for document processing, RAG configuration, querying, and evaluation.

Project description

ragbandit-core

Core utilities for:

  • Document ingestion & processing (OCR, chunking, embedding)
  • Building and running Retrieval-Augmented Generation (RAG) pipelines
  • Evaluating answers with automated metrics

Quick start

pip install ragbandit-core
from ragbandit.documents import (
    DocumentPipeline,
    ReferencesProcessor,
    FootnoteProcessor,
    MistralOCRDocument,
    MistralEmbedder,
    SemanticChunker
)
import os
import logging
from dotenv import load_dotenv
load_dotenv()

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

file_path = "./data/raw/[document_name].pdf"

doc_pipeline = DocumentPipeline(
    chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY),
    embedder=MistralEmbedder(model="mistral-embed", api_key=MISTRAL_API_KEY),  # noqa
    ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
    processors=[
        ReferencesProcessor(api_key=MISTRAL_API_KEY),
        FootnoteProcessor(api_key=MISTRAL_API_KEY),
    ],
)

extended_response = doc_pipeline.process(file_path)

Running Steps Manually

For more control, you can run each pipeline step independently:

from ragbandit.documents import (
    DocumentPipeline,
    ReferencesProcessor,
    MistralOCRDocument,
    MistralEmbedder,
    SemanticChunker
)
import os
from dotenv import load_dotenv
load_dotenv()

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"

# Create pipeline with only the components you need
pipeline = DocumentPipeline(
    ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
    processors=[ReferencesProcessor(api_key=MISTRAL_API_KEY)],
    chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY),
    embedder=MistralEmbedder(model="mistral-embed", api_key=MISTRAL_API_KEY),
)

# Step 1: Run OCR
ocr_result = pipeline.run_ocr(file_path)

# Step 2: Run processors (optional)
processing_results = pipeline.run_processors(ocr_result)
final_doc = processing_results[-1]  # Get the last processor's output

# Step 3: Chunk the document
chunk_result = pipeline.run_chunker(final_doc)

# Step 4: Embed chunks
embedding_result = pipeline.run_embedder(chunk_result)

You can also create separate pipelines for different steps:

# OCR-only pipeline
ocr_pipeline = DocumentPipeline(
    ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY)
)
ocr_result = ocr_pipeline.run_ocr(file_path)

# Later, chunk with a different pipeline
chunk_pipeline = DocumentPipeline(
    chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY)
)
chunks = chunk_pipeline.run_chunker(ocr_result)

Package layout

ragbandit-core/
├── src/ragbandit/
│   ├── documents/   # document ingestion, OCR, chunking, 
└── tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragbandit_core-0.1.1.tar.gz (35.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragbandit_core-0.1.1-py3-none-any.whl (46.6 kB view details)

Uploaded Python 3

File details

Details for the file ragbandit_core-0.1.1.tar.gz.

File metadata

  • Download URL: ragbandit_core-0.1.1.tar.gz
  • Upload date:
  • Size: 35.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragbandit_core-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c9ef4d2e74754c67d13bef64039989ab2e6ec9052b8b69c3a05a16978f5ad783
MD5 3cadcfd407665c4e2600b8c53b55f2bf
BLAKE2b-256 5a9b98e34cc7110d825a84c5d3d5a7c7f6d017fe2ac1dadbd54886169711b5b5

See more details on using hashes here.

File details

Details for the file ragbandit_core-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ragbandit_core-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 46.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragbandit_core-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 704a6f5228e19008d15484c161c8f54d8625c9c4d05e4ed4215d764259ebebf7
MD5 b5c70f26abb0fc2d247e77b6ab97a6c2
BLAKE2b-256 8320bb5b98cc1cf9a6d7a7bbdb396053499f9b56215cd60225b4c8e86f02e5a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page