Skip to main content

Semantic Document Processing Library

Project description

Kallia

Semantic Document Processing Library

Kallia is a FastAPI-based document processing service that converts documents into intelligent semantic chunks. The library specializes in extracting meaningful content segments from documents while preserving context and semantic relationships.

๐Ÿš€ Features

  • Document-to-Markdown Conversion: Standardized processing pipeline for consistent output
  • Semantic Chunking: Intelligent content segmentation that respects document structure and meaning
  • PDF Support: Currently supports PDF documents with extensible architecture for additional formats
  • RESTful API: Clean, well-documented API interface with comprehensive error handling
  • Configurable Processing: Adjustable parameters for temperature, token limits, and page selection
  • Docker Ready: Containerized deployment with Docker and docker-compose support
  • Vision-Language Model Integration: Leverages advanced AI models for content understanding

๐Ÿ“‹ Prerequisites

  • Python 3.11 or higher
  • Docker (optional, for containerized deployment)
  • Access to a compatible language model API (OpenRouter, Ollama, etc.)

๐Ÿ› ๏ธ Installation

Local Development Setup

  1. Clone the repository

    git clone https://github.com/kallia-project/kallia.git
    cd kallia
    
  2. Install dependencies

    pip install -r requirements.txt
    
  3. Configure environment variables

    cp .env.example .env
    

    Edit .env with your configuration:

    API_KEY=your_api_key_here
    BASE_URL=https://openrouter.ai/api/v1
    MODEL=qwen/qwen2.5-vl-32b-instruct
    
  4. Run the application

    fastapi run kallia/main.py --port 8000
    

Docker Deployment

  1. Using Docker Compose (Recommended)

    docker-compose up -d
    
  2. Manual Docker Build

    docker build -t kallia-project/kallia:0.1.0 .
    docker run -p 8000:80 -e API_KEY=ollama -e BASE_URL=http://localhost:11434/v1 -e MODEL=qwen2.5vl:32b kallia-project/kallia:0.1.0
    

โš™๏ธ Configuration

Environment Variables

Variable Description Example
API_KEY API key for your language model provider ollama
BASE_URL Base URL for the API endpoint http://localhost:11434/v1
MODEL Model identifier to use qwen2.5vl:32b

Supported Providers

  • OpenRouter: Use OpenRouter API for access to various models
  • Ollama: Local model deployment with Ollama
  • Custom Endpoints: Any OpenAI-compatible API endpoint

๐Ÿ“– Usage

API Endpoint

POST /chunks

Converts a document into semantic chunks with concise summaries.

Request Body

{
  "url": "https://example.com/document.pdf",
  "page_number": 1,
  "temperature": 0.0,
  "max_tokens": 8192
}

Parameters

  • url (string, required): URL to the document to process
  • page_number (integer, optional): Specific page to process (default: 1)
  • temperature (float, optional): Model temperature for processing (default: 0.0)
  • max_tokens (integer, optional): Maximum tokens for processing (default: 8192)

Response

{
  "chunks": [
    {
      "original_text": "Original document content...",
      "concise_summary": "Concise summary of the content..."
    }
  ]
}

Example Usage

cURL

curl -X POST "http://localhost:8000/chunks" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/document.pdf",
    "page_number": 1,
    "temperature": 0.0,
    "max_tokens": 4096
  }'

Python

import requests

response = requests.post(
    "http://localhost:8000/chunks",
    json={
        "url": "https://example.com/document.pdf",
        "page_number": 1,
        "temperature": 0.0,
        "max_tokens": 4096
    }
)

chunks = response.json()["chunks"]
for chunk in chunks:
    print(f"Summary: {chunk['concise_summary']}")
    print(f"Original: {chunk['original_text']}")

๐Ÿ—๏ธ Project Structure

kallia/
โ”œโ”€โ”€ kallia/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ main.py              # FastAPI application entry point
โ”‚   โ”œโ”€โ”€ models.py            # Pydantic models for API
โ”‚   โ”œโ”€โ”€ constants.py         # Application constants
โ”‚   โ”œโ”€โ”€ documents.py         # Document processing logic
โ”‚   โ”œโ”€โ”€ chunker.py           # Semantic chunking implementation
โ”‚   โ”œโ”€โ”€ utils.py             # Utility functions
โ”‚   โ”œโ”€โ”€ logger.py            # Logging configuration
โ”‚   โ”œโ”€โ”€ settings.py          # Application settings
โ”‚   โ”œโ”€โ”€ exceptions.py        # Custom exceptions
โ”‚   โ”œโ”€โ”€ messages.py          # Message handling
โ”‚   โ”œโ”€โ”€ prompts.py           # AI model prompts
โ”‚   โ”œโ”€โ”€ image_caption_serializer.py
โ”‚   โ””โ”€โ”€ unordered_list_serializer.py
โ”œโ”€โ”€ requirements.txt         # Python dependencies
โ”œโ”€โ”€ Dockerfile              # Docker container configuration
โ”œโ”€โ”€ docker-compose.yml      # Docker Compose setup
โ”œโ”€โ”€ .env.example           # Environment variables template
โ””โ”€โ”€ README.md              # This file

๐Ÿ”ง Development

Code Style

The project follows Python best practices and uses:

  • FastAPI for web framework
  • Pydantic for data validation
  • Structured logging
  • Comprehensive error handling

๐Ÿ“ฆ Dependencies

Core Dependencies

  • FastAPI: Modern, fast web framework for building APIs
  • Docling: Document processing and conversion library
  • RapidOCR: OCR capabilities for text extraction
  • OpenCV: Computer vision library for image processing

Full Dependency List

See requirements.txt for complete dependency specifications:

  • fastapi[standard]==0.115.14
  • docling==2.38.1
  • rapidocr-onnxruntime==1.4.4
  • opencv-python-headless==4.11.0.86

๐Ÿšจ Error Handling

The API provides comprehensive error handling with appropriate HTTP status codes:

  • 400 Bad Request: Invalid parameters or unsupported file format
  • 500 Internal Server Error: Processing errors
  • 503 Service Unavailable: External service connectivity issues

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ‘จโ€๐Ÿ’ป Author

CK

๐Ÿ”— Links

๐Ÿ“ˆ Version

Current version: 0.1.0


Built with โค๏ธ for intelligent document processing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kallia-0.1.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kallia-0.1.0-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file kallia-0.1.0.tar.gz.

File metadata

  • Download URL: kallia-0.1.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for kallia-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4fbaa86062861c301bffe1a0812c0e7deb81c0c62d495a4ea906e85be4a03f0b
MD5 a055ef68b06b000d592cfcbd09978e85
BLAKE2b-256 6fd6c01673265ae748799d30ce2ef7cbcae4c7d0efc52db49edf610f58023cfb

See more details on using hashes here.

File details

Details for the file kallia-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kallia-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for kallia-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 227d516c7f2cd26c162a9fd937f66c503fe609954630ac6ce44c13539760a0bf
MD5 7f59b3eef7c728f6975773c2b2fbe40d
BLAKE2b-256 41669ef8b08a6639e16c502446f56dadca6b3a18bf76a0945b87497a0e202f27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page