Skip to main content

Semantic Document Processing Library

Project description

Kallia

Semantic Document Processing Library

Kallia is a FastAPI-based document processing service that converts documents into intelligent semantic chunks. The library specializes in extracting meaningful content segments from documents while preserving context and semantic relationships.

๐Ÿš€ Features

  • Document-to-Markdown Conversion: Standardized processing pipeline for consistent output
  • Semantic Chunking: Intelligent content segmentation that respects document structure and meaning
  • PDF Support: Currently supports PDF documents with extensible architecture for additional formats
  • RESTful API: Clean, well-documented API interface with comprehensive error handling
  • Configurable Processing: Adjustable parameters for temperature, token limits, and page selection
  • Docker Ready: Containerized deployment with Docker and docker-compose support
  • Vision-Language Model Integration: Leverages advanced AI models for content understanding

๐Ÿ“‹ Prerequisites

  • Python 3.9 or higher (3.9, 3.10, 3.11, 3.12, 3.13 supported)
  • Docker (optional, for containerized deployment)
  • Access to a compatible language model API (OpenRouter, Ollama, etc.)

๐Ÿ› ๏ธ Installation

PyPi Installation (Recommended)

Install Kallia directly from PyPi:

pip install kallia

After installation, you can run the application:

# Configure environment variables first
export KALLIA_PROVIDER_API_KEY=your_api_key_here
export KALLIA_PROVIDER_BASE_URL=https://openrouter.ai/api/v1
export KALLIA_PROVIDER_MODEL=qwen/qwen2.5-vl-32b-instruct

# Run the application
python -m kallia.main

Or create a .env file in your working directory:

KALLIA_PROVIDER_API_KEY=your_api_key_here
KALLIA_PROVIDER_BASE_URL=https://openrouter.ai/api/v1
KALLIA_PROVIDER_MODEL=qwen/qwen2.5-vl-32b-instruct

Local Development Setup

  1. Clone the repository

    git clone https://github.com/kallia-project/kallia.git
    cd kallia
    
  2. Install dependencies

    pip install -r requirements.txt
    
  3. Configure environment variables

    cp .env.example .env
    

    Edit .env with your configuration:

    KALLIA_PROVIDER_API_KEY=your_api_key_here
    KALLIA_PROVIDER_BASE_URL=https://openrouter.ai/api/v1
    KALLIA_PROVIDER_MODEL=qwen/qwen2.5-vl-32b-instruct
    
  4. Run the application

    fastapi run kallia/main.py --port 8000
    

Docker Deployment

  1. Using Docker Compose (Recommended)

    docker-compose up -d
    
  2. Manual Docker Build

    docker build -t overheatsystem/kallia:0.1.1 .
    docker run -p 8000:80 -e KALLIA_PROVIDER_API_KEY=ollama -e KALLIA_PROVIDER_BASE_URL=http://localhost:11434/v1 -e KALLIA_PROVIDER_MODEL=qwen2.5vl:32b overheatsystem/kallia:0.1.1
    

โš™๏ธ Configuration

Environment Variables

Variable Description Example
KALLIA_PROVIDER_API_KEY API key for your language model provider ollama
KALLIA_PROVIDER_BASE_URL Base URL for the API endpoint http://localhost:11434/v1
KALLIA_PROVIDER_MODEL Model identifier to use qwen2.5vl:32b

Supported Providers

  • OpenRouter: Use OpenRouter API for access to various models
  • Ollama: Local model deployment with Ollama
  • Custom Endpoints: Any OpenAI-compatible API endpoint

๐Ÿ“– Usage

API Endpoint

POST /chunks

Converts a document into semantic chunks with concise summaries.

Request Body

{
  "url": "https://example.com/document.pdf",
  "page_number": 1,
  "temperature": 0.0,
  "max_tokens": 8192
}

Parameters

  • url (string, required): URL to the document to process
  • page_number (integer, optional): Specific page to process (default: 1)
  • temperature (float, optional): Model temperature for processing (default: 0.0)
  • max_tokens (integer, optional): Maximum tokens for processing (default: 8192)

Response

{
  "chunks": [
    {
      "original_text": "Original document content...",
      "concise_summary": "Concise summary of the content..."
    }
  ]
}

Example Usage

cURL

curl -X POST "http://localhost:8000/chunks" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/document.pdf",
    "page_number": 1,
    "temperature": 0.0,
    "max_tokens": 4096
  }'

Python

import requests

response = requests.post(
    "http://localhost:8000/chunks",
    json={
        "url": "https://example.com/document.pdf",
        "page_number": 1,
        "temperature": 0.0,
        "max_tokens": 4096
    }
)

chunks = response.json()["chunks"]
for chunk in chunks:
    print(f"Summary: {chunk['concise_summary']}")
    print(f"Original: {chunk['original_text']}")

๐Ÿ—๏ธ Project Structure

kallia/
โ”œโ”€โ”€ kallia/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ main.py              # FastAPI application entry point
โ”‚   โ”œโ”€โ”€ models.py            # Pydantic models for API
โ”‚   โ”œโ”€โ”€ constants.py         # Application constants
โ”‚   โ”œโ”€โ”€ documents.py         # Document processing logic
โ”‚   โ”œโ”€โ”€ chunker.py           # Semantic chunking implementation
โ”‚   โ”œโ”€โ”€ utils.py             # Utility functions
โ”‚   โ”œโ”€โ”€ logger.py            # Logging configuration
โ”‚   โ”œโ”€โ”€ settings.py          # Application settings
โ”‚   โ”œโ”€โ”€ exceptions.py        # Custom exceptions
โ”‚   โ”œโ”€โ”€ messages.py          # Message handling
โ”‚   โ”œโ”€โ”€ prompts.py           # AI model prompts
โ”‚   โ”œโ”€โ”€ image_caption_serializer.py
โ”‚   โ””โ”€โ”€ unordered_list_serializer.py
โ”œโ”€โ”€ tests/                   # Test suite
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ test_pdf_to_markdown.py
โ”‚   โ””โ”€โ”€ test_markdown_to_chunks.py
โ”œโ”€โ”€ assets/                  # Test assets
โ”‚   โ””โ”€โ”€ pdf/
โ”‚       โ””โ”€โ”€ 01.pdf          # Sample PDF for testing
โ”œโ”€โ”€ requirements.txt         # Python dependencies
โ”œโ”€โ”€ pyproject.toml          # Project configuration
โ”œโ”€โ”€ Dockerfile              # Docker container configuration
โ”œโ”€โ”€ docker-compose.yml      # Docker Compose setup
โ”œโ”€โ”€ .env.example           # Environment variables template
โ””โ”€โ”€ README.md              # This file

๐Ÿ”ง Development

Code Style

The project follows Python best practices and uses:

  • FastAPI for web framework
  • Pydantic for data validation
  • Structured logging
  • Comprehensive error handling

Testing

The project includes comprehensive tests for core functionality:

# Run tests
python -m pytest tests/

# Run specific tests
python -m pytest tests/test_pdf_to_markdown.py
python -m pytest tests/test_markdown_to_chunks.py

Test coverage includes:

  • PDF to markdown conversion
  • Markdown to semantic chunks processing
  • End-to-end document processing pipeline

๐Ÿ“ฆ Dependencies

Core Dependencies

  • FastAPI: Modern, fast web framework for building APIs
  • Docling: Document processing and conversion library
  • RapidOCR: OCR capabilities for text extraction
  • OpenCV: Computer vision library for image processing

Full Dependency List

See requirements.txt for complete dependency specifications:

  • fastapi[standard]==0.115.14
  • docling==2.38.1
  • rapidocr-onnxruntime==1.4.4
  • opencv-python-headless==4.11.0.86

๐Ÿšจ Error Handling

The API provides comprehensive error handling with appropriate HTTP status codes:

  • 400 Bad Request: Invalid parameters or unsupported file format
  • 500 Internal Server Error: Processing errors
  • 503 Service Unavailable: External service connectivity issues

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ‘จโ€๐Ÿ’ป Author

CK

๐Ÿ”— Links

๐Ÿ“ˆ Version

Current version: 0.1.1


Built with โค๏ธ for intelligent document processing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kallia-0.1.1.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kallia-0.1.1-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file kallia-0.1.1.tar.gz.

File metadata

  • Download URL: kallia-0.1.1.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for kallia-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8618e0bb643e385eb206c53b721e1098b15b8e97b50ec772aead9045a60b40ea
MD5 edc3d4f2d35653a95051045eeb775c9e
BLAKE2b-256 35846d57acee7fd051295afd6eddf842f257dba3d896092819f8e047722b96b7

See more details on using hashes here.

File details

Details for the file kallia-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: kallia-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for kallia-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 113587e7b6db70f461209138340be2b9d26d35821a4c55d4464bde8957e72b93
MD5 b7f8d09962c397366914a56deff20a72
BLAKE2b-256 4d0d4a055fb1cbbf9a7796a7037155cdfc9141c96c74be074a73a1acb241d12f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page