Skip to main content

Core utilities for document processing, RAG configuration, querying, and evaluation.

Project description

ragbandit-core

Test Coverage

Core utilities for:

  • Document ingestion & processing (OCR, chunking, embedding)
  • Building and running Retrieval-Augmented Generation (RAG) pipelines
  • Evaluating answers with automated metrics

Test Coverage

The codebase maintains 87% test coverage with comprehensive integration tests covering all major components. See tests/README.md for details on running tests and coverage reports.

Quick start

pip install ragbandit-core
from ragbandit.documents import (
    DocumentPipeline,
    ReferencesRefiner,
    FootnoteRefiner,
    MistralOCR,
    MistralEmbedder,
    SemanticChunker
)
import os
import logging
from dotenv import load_dotenv
load_dotenv()

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

file_path = "./data/raw/[document_name].pdf"

doc_pipeline = DocumentPipeline(
    chunker=SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500),
    embedder=MistralEmbedder(api_key=MISTRAL_API_KEY, model="mistral-embed"),
    ocr_processor=MistralOCR(api_key=MISTRAL_API_KEY),
    refiners=[
        ReferencesRefiner(api_key=MISTRAL_API_KEY),
        FootnoteRefiner(api_key=MISTRAL_API_KEY),
    ],
)

extended_response = doc_pipeline.process(file_path)

Using Alternative OCR and Embedding Providers

The package supports multiple OCR and embedding providers:

from ragbandit.documents import (
    DocumentPipeline,
    DatalabOCR,
    OpenAIEmbedder,
    FixedSizeChunker
)
import os

DATALAB_API_KEY = os.getenv("DATALAB_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

file_path = "./data/raw/[document_name].pdf"

# Using Datalab OCR and OpenAI embeddings
doc_pipeline = DocumentPipeline(
    ocr_processor=DatalabOCR(
        api_key=DATALAB_API_KEY,
        model="marker",
        mode="balanced"  # Options: fast, balanced, accurate
    ),
    chunker=FixedSizeChunker(chunk_size=500, overlap=100),
    embedder=OpenAIEmbedder(
        api_key=OPENAI_API_KEY,
        model="text-embedding-3-small"  # or text-embedding-3-large
    ),
)

result = doc_pipeline.process(file_path)

Running Steps Manually

For more control, you can run each pipeline step independently:

from ragbandit.documents import (
    DocumentPipeline,
    ReferencesRefiner,
    MistralOCR,
    MistralEmbedder,
    SemanticChunker
)
import os
from dotenv import load_dotenv
load_dotenv()

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"

# Create pipeline with only the components you need
pipeline = DocumentPipeline(
    ocr_processor=MistralOCR(api_key=MISTRAL_API_KEY),
    refiners=[ReferencesRefiner(api_key=MISTRAL_API_KEY)],
    chunker=SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500),
    embedder=MistralEmbedder(api_key=MISTRAL_API_KEY, model="mistral-embed"),
)

# Step 1: Run OCR
ocr_result = pipeline.run_ocr(file_path)

# Step 2: Run refiners (optional)
refining_results = pipeline.run_refiners(ocr_result)
final_doc = refining_results[-1]  # Get the last refiner's output

# Step 3: Chunk the document
chunk_result = pipeline.run_chunker(final_doc)

# Step 4: Embed chunks
embedding_result = pipeline.run_embedder(chunk_result)

You can also use components independently without a pipeline:

# Run OCR directly - Mistral
ocr = MistralOCR(api_key=MISTRAL_API_KEY)
ocr_result = ocr.process(file_path)

# Or use Datalab OCR
from ragbandit.documents import DatalabOCR
datalab_ocr = DatalabOCR(
    api_key=DATALAB_API_KEY,
    mode="accurate",
    max_pages=10  # Optional: limit pages processed
)
ocr_result = datalab_ocr.process(file_path)

# Run refiners directly
refiner = FootnoteRefiner(api_key=MISTRAL_API_KEY)
refined_result = refiner.process(ocr_result)

# Run chunker directly
chunker = SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500)
chunk_result = chunker.chunk(refined_result)

# Run embedder directly - Mistral
embedder = MistralEmbedder(api_key=MISTRAL_API_KEY)
embedding_result = embedder.embed_chunks(chunk_result)

# Or use OpenAI embeddings
from ragbandit.documents import OpenAIEmbedder
openai_embedder = OpenAIEmbedder(
    api_key=OPENAI_API_KEY,
    model="text-embedding-3-large"  # Higher quality, larger dimensions
)
embedding_result = openai_embedder.embed_chunks(chunk_result)

Package layout

ragbandit-core/
├── src/ragbandit/
│   ├── documents/   # document ingestion, OCR, chunking, 
└── tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragbandit_core-0.2.4.tar.gz (41.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragbandit_core-0.2.4-py3-none-any.whl (55.1 kB view details)

Uploaded Python 3

File details

Details for the file ragbandit_core-0.2.4.tar.gz.

File metadata

  • Download URL: ragbandit_core-0.2.4.tar.gz
  • Upload date:
  • Size: 41.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragbandit_core-0.2.4.tar.gz
Algorithm Hash digest
SHA256 e994b4bc3ed95d910fd8338ff242b3d564b3d004d7f23436d8cbe2b0557a0d2e
MD5 e5ef5f62d9f80cded92041b1dba0ed93
BLAKE2b-256 27cd7e980fa9285fc56487763c82e24ac94e96e5750902da7b1d01b0eb5cb0c4

See more details on using hashes here.

File details

Details for the file ragbandit_core-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: ragbandit_core-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 55.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragbandit_core-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0940d7eed7edbbae4cc3c2b15917aa6fee77dae5d79f0c17396fcae416d2b624
MD5 b2afc4e1d470f9ce87de53c42f636760
BLAKE2b-256 4b294237d9e1c1c772cb0e778b668806a714863847b1b9bc4f6439ab49c1d3fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page