Semantic Document Processing Library
Project description
Kallia
Semantic Document Processing Library
Kallia is a FastAPI-based document processing service that converts documents into intelligent semantic chunks. The library specializes in extracting meaningful content segments from documents while preserving context and semantic relationships.
๐ Features
- Document-to-Markdown Conversion: Standardized processing pipeline for consistent output
- Semantic Chunking: Intelligent content segmentation that respects document structure and meaning
- PDF Support: Currently supports PDF documents with extensible architecture for additional formats
- RESTful API: Clean, well-documented API interface with comprehensive error handling
- Configurable Processing: Adjustable parameters for temperature, token limits, and page selection
- Docker Ready: Containerized deployment with Docker and docker-compose support
- Vision-Language Model Integration: Leverages advanced AI models for content understanding
๐ Prerequisites
- Python 3.9 or higher (3.9, 3.10, 3.11, 3.12, 3.13 supported)
- Docker (optional, for containerized deployment)
- Access to a compatible language model API (OpenRouter, Ollama, etc.)
๐ ๏ธ Installation
PyPi Installation (Recommended)
Install Kallia directly from PyPi:
pip install kallia
After installation, you can run the application:
# Configure environment variables first
export KALLIA_PROVIDER_API_KEY=your_api_key_here
export KALLIA_PROVIDER_BASE_URL=https://openrouter.ai/api/v1
export KALLIA_PROVIDER_MODEL=qwen/qwen2.5-vl-32b-instruct
# Run the application
python -m kallia.main
Or create a .env file in your working directory:
KALLIA_PROVIDER_API_KEY=your_api_key_here
KALLIA_PROVIDER_BASE_URL=https://openrouter.ai/api/v1
KALLIA_PROVIDER_MODEL=qwen/qwen2.5-vl-32b-instruct
Local Development Setup
-
Clone the repository
git clone https://github.com/kallia-project/kallia.git cd kallia
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
cp .env.example .env
Edit
.envwith your configuration:KALLIA_PROVIDER_API_KEY=your_api_key_here KALLIA_PROVIDER_BASE_URL=https://openrouter.ai/api/v1 KALLIA_PROVIDER_MODEL=qwen/qwen2.5-vl-32b-instruct
-
Run the application
fastapi run kallia/main.py --port 8000
Docker Deployment
-
Using Docker Compose (Recommended)
docker-compose up -d
-
Manual Docker Build
docker build -t overheatsystem/kallia:0.1.1 . docker run -p 8000:80 -e KALLIA_PROVIDER_API_KEY=ollama -e KALLIA_PROVIDER_BASE_URL=http://localhost:11434/v1 -e KALLIA_PROVIDER_MODEL=qwen2.5vl:32b overheatsystem/kallia:0.1.1
โ๏ธ Configuration
Environment Variables
| Variable | Description | Example |
|---|---|---|
KALLIA_PROVIDER_API_KEY |
API key for your language model provider | ollama |
KALLIA_PROVIDER_BASE_URL |
Base URL for the API endpoint | http://localhost:11434/v1 |
KALLIA_PROVIDER_MODEL |
Model identifier to use | qwen2.5vl:32b |
Supported Providers
- OpenRouter: Use OpenRouter API for access to various models
- Ollama: Local model deployment with Ollama
- Custom Endpoints: Any OpenAI-compatible API endpoint
๐ Usage
API Endpoint
POST /chunks
Converts a document into semantic chunks with concise summaries.
Request Body
{
"url": "https://example.com/document.pdf",
"page_number": 1,
"temperature": 0.0,
"max_tokens": 8192
}
Parameters
url(string, required): URL to the document to processpage_number(integer, optional): Specific page to process (default: 1)temperature(float, optional): Model temperature for processing (default: 0.0)max_tokens(integer, optional): Maximum tokens for processing (default: 8192)
Response
{
"chunks": [
{
"original_text": "Original document content...",
"concise_summary": "Concise summary of the content..."
}
]
}
Example Usage
cURL
curl -X POST "http://localhost:8000/chunks" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/document.pdf",
"page_number": 1,
"temperature": 0.0,
"max_tokens": 4096
}'
Python
import requests
response = requests.post(
"http://localhost:8000/chunks",
json={
"url": "https://example.com/document.pdf",
"page_number": 1,
"temperature": 0.0,
"max_tokens": 4096
}
)
chunks = response.json()["chunks"]
for chunk in chunks:
print(f"Summary: {chunk['concise_summary']}")
print(f"Original: {chunk['original_text']}")
๐๏ธ Project Structure
kallia/
โโโ kallia/
โ โโโ __init__.py
โ โโโ main.py # FastAPI application entry point
โ โโโ models.py # Pydantic models for API
โ โโโ constants.py # Application constants
โ โโโ documents.py # Document processing logic
โ โโโ chunker.py # Semantic chunking implementation
โ โโโ utils.py # Utility functions
โ โโโ logger.py # Logging configuration
โ โโโ settings.py # Application settings
โ โโโ exceptions.py # Custom exceptions
โ โโโ messages.py # Message handling
โ โโโ prompts.py # AI model prompts
โ โโโ image_caption_serializer.py
โ โโโ unordered_list_serializer.py
โโโ tests/ # Test suite
โ โโโ __init__.py
โ โโโ test_pdf_to_markdown.py
โ โโโ test_markdown_to_chunks.py
โโโ assets/ # Test assets
โ โโโ pdf/
โ โโโ 01.pdf # Sample PDF for testing
โโโ requirements.txt # Python dependencies
โโโ pyproject.toml # Project configuration
โโโ Dockerfile # Docker container configuration
โโโ docker-compose.yml # Docker Compose setup
โโโ .env.example # Environment variables template
โโโ README.md # This file
๐ง Development
Code Style
The project follows Python best practices and uses:
- FastAPI for web framework
- Pydantic for data validation
- Structured logging
- Comprehensive error handling
Testing
The project includes comprehensive tests for core functionality:
# Run tests
python -m pytest tests/
# Run specific tests
python -m pytest tests/test_pdf_to_markdown.py
python -m pytest tests/test_markdown_to_chunks.py
Test coverage includes:
- PDF to markdown conversion
- Markdown to semantic chunks processing
- End-to-end document processing pipeline
๐ฆ Dependencies
Core Dependencies
- FastAPI: Modern, fast web framework for building APIs
- Docling: Document processing and conversion library
- RapidOCR: OCR capabilities for text extraction
- OpenCV: Computer vision library for image processing
Full Dependency List
See requirements.txt for complete dependency specifications:
fastapi[standard]==0.115.14docling==2.38.1rapidocr-onnxruntime==1.4.4opencv-python-headless==4.11.0.86
๐จ Error Handling
The API provides comprehensive error handling with appropriate HTTP status codes:
- 400 Bad Request: Invalid parameters or unsupported file format
- 500 Internal Server Error: Processing errors
- 503 Service Unavailable: External service connectivity issues
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐จโ๐ป Author
CK
- Email: ck@kallia.net
- GitHub: @kallia-project
๐ Links
๐ Version
Current version: 0.1.1
Built with โค๏ธ for intelligent document processing
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kallia-0.1.1.tar.gz.
File metadata
- Download URL: kallia-0.1.1.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8618e0bb643e385eb206c53b721e1098b15b8e97b50ec772aead9045a60b40ea
|
|
| MD5 |
edc3d4f2d35653a95051045eeb775c9e
|
|
| BLAKE2b-256 |
35846d57acee7fd051295afd6eddf842f257dba3d896092819f8e047722b96b7
|
File details
Details for the file kallia-0.1.1-py3-none-any.whl.
File metadata
- Download URL: kallia-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
113587e7b6db70f461209138340be2b9d26d35821a4c55d4464bde8957e72b93
|
|
| MD5 |
b7f8d09962c397366914a56deff20a72
|
|
| BLAKE2b-256 |
4d0d4a055fb1cbbf9a7796a7037155cdfc9141c96c74be074a73a1acb241d12f
|