Skip to main content

Core utilities for document processing, RAG configuration, querying, and evaluation.

Project description

ragbandit-core

Test Coverage

Core utilities for:

  • Document ingestion & processing (OCR, chunking, embedding)
  • Building and running Retrieval-Augmented Generation (RAG) pipelines
  • Evaluating answers with automated metrics

Test Coverage

The codebase maintains 87% test coverage with comprehensive integration tests covering all major components. See tests/README.md for details on running tests and coverage reports.

Quick start

pip install ragbandit-core
from ragbandit.documents import (
    DocumentPipeline,
    ReferencesRefiner,
    FootnoteRefiner,
    MistralOCRDocument,
    MistralEmbedder,
    SemanticChunker
)
import os
import logging
from dotenv import load_dotenv
load_dotenv()

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

file_path = "./data/raw/[document_name].pdf"

doc_pipeline = DocumentPipeline(
    chunker=SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500),
    embedder=MistralEmbedder(api_key=MISTRAL_API_KEY, model="mistral-embed"),
    ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
    refiners=[
        ReferencesRefiner(api_key=MISTRAL_API_KEY),
        FootnoteRefiner(api_key=MISTRAL_API_KEY),
    ],
)

extended_response = doc_pipeline.process(file_path)

Using Alternative OCR and Embedding Providers

The package supports multiple OCR and embedding providers:

from ragbandit.documents import (
    DocumentPipeline,
    DatalabOCR,
    OpenAIEmbedder,
    FixedSizeChunker
)
import os

DATALAB_API_KEY = os.getenv("DATALAB_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

file_path = "./data/raw/[document_name].pdf"

# Using Datalab OCR and OpenAI embeddings
doc_pipeline = DocumentPipeline(
    ocr_processor=DatalabOCR(
        api_key=DATALAB_API_KEY,
        model="marker",
        mode="balanced"  # Options: fast, balanced, accurate
    ),
    chunker=FixedSizeChunker(chunk_size=500, overlap=100),
    embedder=OpenAIEmbedder(
        api_key=OPENAI_API_KEY,
        model="text-embedding-3-small"  # or text-embedding-3-large
    ),
)

result = doc_pipeline.process(file_path)

Running Steps Manually

For more control, you can run each pipeline step independently:

from ragbandit.documents import (
    DocumentPipeline,
    ReferencesRefiner,
    MistralOCRDocument,
    MistralEmbedder,
    SemanticChunker
)
import os
from dotenv import load_dotenv
load_dotenv()

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"

# Create pipeline with only the components you need
pipeline = DocumentPipeline(
    ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
    refiners=[ReferencesRefiner(api_key=MISTRAL_API_KEY)],
    chunker=SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500),
    embedder=MistralEmbedder(api_key=MISTRAL_API_KEY, model="mistral-embed"),
)

# Step 1: Run OCR
ocr_result = pipeline.run_ocr(file_path)

# Step 2: Run refiners (optional)
refining_results = pipeline.run_refiners(ocr_result)
final_doc = refining_results[-1]  # Get the last refiner's output

# Step 3: Chunk the document
chunk_result = pipeline.run_chunker(final_doc)

# Step 4: Embed chunks
embedding_result = pipeline.run_embedder(chunk_result)

You can also use components independently without a pipeline:

# Run OCR directly - Mistral
ocr = MistralOCRDocument(api_key=MISTRAL_API_KEY)
ocr_result = ocr.process(file_path)

# Or use Datalab OCR
from ragbandit.documents import DatalabOCR
datalab_ocr = DatalabOCR(
    api_key=DATALAB_API_KEY,
    mode="accurate",
    max_pages=10  # Optional: limit pages processed
)
ocr_result = datalab_ocr.process(file_path)

# Run refiners directly
refiner = FootnoteRefiner(api_key=MISTRAL_API_KEY)
refined_result = refiner.process(ocr_result)

# Run chunker directly
chunker = SemanticChunker(api_key=MISTRAL_API_KEY, min_chunk_size=500)
chunk_result = chunker.chunk(refined_result)

# Run embedder directly - Mistral
embedder = MistralEmbedder(api_key=MISTRAL_API_KEY)
embedding_result = embedder.embed_chunks(chunk_result)

# Or use OpenAI embeddings
from ragbandit.documents import OpenAIEmbedder
openai_embedder = OpenAIEmbedder(
    api_key=OPENAI_API_KEY,
    model="text-embedding-3-large"  # Higher quality, larger dimensions
)
embedding_result = openai_embedder.embed_chunks(chunk_result)

Package layout

ragbandit-core/
├── src/ragbandit/
│   ├── documents/   # document ingestion, OCR, chunking, 
└── tests/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragbandit_core-0.2.2.tar.gz (41.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragbandit_core-0.2.2-py3-none-any.whl (55.1 kB view details)

Uploaded Python 3

File details

Details for the file ragbandit_core-0.2.2.tar.gz.

File metadata

  • Download URL: ragbandit_core-0.2.2.tar.gz
  • Upload date:
  • Size: 41.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragbandit_core-0.2.2.tar.gz
Algorithm Hash digest
SHA256 5e0b260642ca9d2d553e6127142cf6d35ebdd1c8cf6025867755c9b1861d2956
MD5 5c3ce9510d4ea2b808889e96ba42d4a8
BLAKE2b-256 764d3f28298902f97768d7a88f1bed5696ad6ccae76fdd46bff84e5f33b7ac75

See more details on using hashes here.

File details

Details for the file ragbandit_core-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: ragbandit_core-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 55.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ragbandit_core-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c254862aa1b2d327da127428068069443e1c95d8f38feaf8bdcbd52d315211f2
MD5 f2f85428071f02bb51ec283ba78f0f52
BLAKE2b-256 9cc6bd01cb1960dd0588d81f0eb52b92cdfb305a08ccd3758c75738f58389bf3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page