Skip to main content

Semantic Document Processing Library

Project description

Kallia

Version License Python Docker

Kallia is a semantic document processing library that converts documents into intelligent semantic chunks. The library specializes in extracting meaningful content segments from documents while preserving context and semantic relationships.

๐Ÿš€ Features

  • Document-to-Markdown Conversion: Standardized processing pipeline for various document formats
  • Semantic Chunking: Intelligent content segmentation that respects document structure and meaning
  • PDF Support: Robust PDF processing with extensible architecture for additional formats
  • RESTful API: FastAPI-based service with comprehensive error handling
  • Interactive Playground: Chainlit-powered chat interface for document Q&A
  • Memory Management: Long-term and short-term memory systems for conversational context
  • Configurable Processing: Adjustable parameters (temperature, token limits, page selection)
  • Docker Support: Containerized deployment for both core API and playground

๐Ÿ“‹ Requirements

  • Python 3.11 or higher
  • FastAPI 0.115.14
  • Docling 2.41.0

๐Ÿ› ๏ธ Installation

Using pip

pip install kallia

From Source

git clone https://github.com/kallia-project/kallia.git
cd kallia
pip install -e .

๐Ÿ—๏ธ Project Structure

kallia/
โ”œโ”€โ”€ kallia/
โ”‚   โ”œโ”€โ”€ core/                    # Core API service
โ”‚   โ”‚   โ”œโ”€โ”€ kallia_core/         # Main library modules
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ main.py          # FastAPI application
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ documents.py     # Document processing
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ chunker.py       # Semantic chunking
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ memories.py      # Memory management
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ models.py        # Data models
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”‚   โ”œโ”€โ”€ requirements.txt     # Core dependencies
โ”‚   โ”‚   โ”œโ”€โ”€ Dockerfile          # Core service container
โ”‚   โ”‚   โ””โ”€โ”€ docker-compose.yml  # Core service orchestration
โ”‚   โ””โ”€โ”€ playground/             # Interactive chat interface
โ”‚       โ”œโ”€โ”€ kallia_playground/  # Playground modules
โ”‚       โ”‚   โ”œโ”€โ”€ main.py         # Chainlit application
โ”‚       โ”‚   โ”œโ”€โ”€ qa.py           # Q&A functionality
โ”‚       โ”‚   โ””โ”€โ”€ ...
โ”‚       โ”œโ”€โ”€ requirements.txt    # Playground dependencies
โ”‚       โ”œโ”€โ”€ Dockerfile         # Playground container
โ”‚       โ””โ”€โ”€ docker-compose.yml # Playground orchestration
โ”œโ”€โ”€ tests/                     # Test suite
โ”œโ”€โ”€ assets/                    # Sample documents
โ””โ”€โ”€ pyproject.toml            # Project configuration

๐Ÿš€ Quick Start

1. Core API Service

Start the FastAPI service:

cd kallia/core
pip install -r requirements.txt
uvicorn kallia_core.main:app --reload

The API will be available at http://localhost:8000

API Endpoints

Process Documents

POST /documents

Request body:

{
  "url": "path/to/document.pdf",
  "page_number": 1,
  "temperature": 0.7,
  "max_tokens": 4000
}

Create Memories

POST /memories

Request body:

{
  "messages": [
    { "role": "user", "content": "Hello" },
    { "role": "assistant", "content": "Hi there!" }
  ],
  "temperature": 0.7,
  "max_tokens": 4000
}

2. Interactive Playground

Start the Chainlit chat interface:

cd kallia/playground
pip install -r requirements.txt
chainlit run kallia_playground/main.py

The playground will be available at http://localhost:8000

3. Docker Deployment

Core Service

cd kallia/core
docker-compose up -d

Playground

cd kallia/playground
docker-compose up -d

๐Ÿ’ก Usage Examples

Python API

from kallia_core.documents import Documents
from kallia_core.chunker import Chunker
from kallia_core.memories import Memories

# Convert document to markdown
markdown_content = Documents.to_markdown(
    source="document.pdf",
    page_number=1,
    temperature=0.7,
    max_tokens=4000
)

# Create semantic chunks
chunks = Chunker.create(
    text=markdown_content,
    temperature=0.7,
    max_tokens=4000
)

# Generate memories from conversation
messages = [
    {"role": "user", "content": "What is this document about?"},
    {"role": "assistant", "content": "This document discusses..."}
]
memories = Memories.create(messages)

REST API

# Process a document
curl -X POST "http://localhost:8000/documents" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://raw.githubusercontent.com/kallia-project/kallia/refs/tags/v0.1.6/assets/pdf/01.pdf",
    "page_number": 1,
    "temperature": 0.7,
    "max_tokens": 4000
  }'

# Create memories
curl -X POST "http://localhost:8000/memories" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Hello"},
      {"role": "assistant", "content": "Hi there!"}
    ],
    "temperature": 0.7,
    "max_tokens": 4000
  }'

๐Ÿ“Š Benchmark Results

Kallia has been extensively benchmarked against other popular document processing libraries using a comprehensive RAG (Retrieval-Augmented Generation) evaluation framework. The benchmark evaluates the quality of document chunking and retrieval performance across 100 test questions.

Performance Comparison

Benchmark Results

System Mean Score Perfect Score Rate Ranking
Kallia 4.600 81.0% ๐Ÿฅ‡ 1st
LlamaIndex 4.300 71.0% ๐Ÿฅˆ 2nd
PyMuPDF 4.060 65.0% ๐Ÿฅ‰ 3rd
Unstructured 3.950 63.0% 4th

Key Advantages

  • Highest Accuracy: Kallia achieves the highest mean score of 4.6/5.0
  • Superior Perfect Score Rate: 81% of questions received perfect scores vs. 71% for the next best
  • Semantic Chunking: Uses intelligent semantic chunking vs. fixed 500-character chunks with 0 overlap used by competitors

Benchmark Details

  • Evaluation Model: Qwen3 30B A3B Instruct 2507
  • Test Questions: 100 comprehensive questions across various document types
  • Scoring: 1-5 scale (1=Poor, 2=Below Average, 3=Average, 4=Good, 5=Excellent)
  • Chunking Method: Kallia uses semantic chunking with Qwen2.5 VL 32B Instruct
  • Competitor Methods: Fixed 500-character chunks with 0 overlap

The benchmark results demonstrate Kallia's superior performance in document processing and retrieval tasks, making it the optimal choice for applications requiring high-quality document understanding and semantic chunking.

For detailed benchmark results and visualizations, see the benchmark/ directory.

๐Ÿงช Testing

Run the test suite:

python -m pytest tests/

Available tests:

  • test_pdf_to_markdown.py - Document conversion tests
  • test_markdown_to_chunks.py - Chunking functionality tests
  • test_histories_to_memories.py - Memory creation tests

๐Ÿ”ง Configuration

Environment Variables

Create a .env file based on the provided .env.example template in each directory:

Core Service:

cd kallia/core
cp .env.example .env
# Edit .env with your configuration

Playground:

cd kallia/playground
cp .env.example .env
# Edit .env with your configuration

Supported File Formats

Currently supported:

  • PDF documents

The architecture is designed to be extensible for additional formats.

๐Ÿ“ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ”— Links

๐Ÿ‘จโ€๐Ÿ’ป Author

CK - ck@kallia.net

๐Ÿท๏ธ Keywords

  • document-processing
  • semantic-chunking
  • document-analysis
  • text-processing
  • machine-learning
  • fastapi
  • chainlit
  • pdf-processing
  • nlp
  • ai

Built with โค๏ธ for intelligent document processing

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kallia-0.1.6.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kallia-0.1.6-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file kallia-0.1.6.tar.gz.

File metadata

  • Download URL: kallia-0.1.6.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for kallia-0.1.6.tar.gz
Algorithm Hash digest
SHA256 39f6a8df11cf1e68a8813d23ae80f98508bdbfc22fe7d3384a957f3cd2fd97d2
MD5 b91a4fdb756e8cc7b0cdaf223e82b7a6
BLAKE2b-256 011a366a59fb64f840ecb5087fa482714b9b677ef2a975f8fe6072404d2907c9

See more details on using hashes here.

File details

Details for the file kallia-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: kallia-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 19.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for kallia-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 c9588cdfd378dad9390f07ebbadeaed51723ad440b89eb141cdf265d3ba63c5a
MD5 2ebbf074f4c649c7d0fa7932b1d8a19f
BLAKE2b-256 87da81e69a2f15379197f38ff4b24e67ed07a5854b8f2be24ae81d8c3e73a5c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page