Skip to main content

A lightweight OCR library for Khmer and English documents

Project description

Kiri OCR 📄

PyPI version License Python Versions Downloads

Kiri OCR is a lightweight, OCR library for English and Khmer documents. It provides document-level text detection, recognition, and rendering capabilities in a compact package (~13MB model).

Kiri OCR

✨ Key Features

  • Lightweight: Only ~13MB model size (Lite version).
  • Bi-lingual: Native support for English and Khmer (and mixed).
  • Document Processing: Automatic text line and word detection.
  • Robust Detection: Works on both light and dark backgrounds (Dark Mode support).
  • Easy to Use: Simple Python API.
  • Visualizations: Generate annotated images and HTML reports.

📊 Dataset

The model is trained on the mrrtmob/km_en_image_line dataset, which contains 5 million synthetic images of Khmer and English text lines.

📈 Benchmark

Results on synthetic test images (10 popular fonts):

Benchmark Graph

Benchmark Table

📦 Installation

Install easily via pip:

pip install kiri-ocr

Or install from source:

git clone https://github.com/mrrtmob/kiri-ocr.git
cd kiri-ocr
pip install .

💻 Usage

CLI Tool (Inference)

Run OCR on an image and save results:

kiri-ocr predict path/to/document.jpg --output results/

(Or simply kiri-ocr path/to/document.jpg)

Python API

from kiri_ocr import OCR

# Initialize Lite Model
ocr = OCR()

# Process document
results = ocr.process_document()

# Extract text
text, _ = ocr.extract_text('document.jpg')
print(text)

🎓 Training a New Model

Follow this guide to train a custom model from scratch.

Step 1: Generate Training Data

Create synthetic training images from a text file.

  1. Prepare text file: Create data/textlines.txt with your training text (one sentence per line).

  2. Generate dataset:

    kiri-ocr generate \
        --train-file data/textlines.txt \
        --output data \
        --fonts-dir fonts \
        --augment 1 \
        --random-augment
    
    • --fonts-dir: Directory containing .ttf files (Khmer/English fonts).
    • --augment: How many variations to generate per line (e.g., 2).
    • --random-augment: Apply random noise/rotation even if augment is 1.

Custom Dataset Structure

If you have your own data (not generated), organize it as follows:

data/
  ├── train/
  │   ├── labels.txt       # Tab-separated: filename <tab> text
  │   └── images/          # Image files
  │       ├── img_001.png
  │       ├── img_002.jpg
  │       └── ...
  └── val/
      ├── labels.txt
      └── images/

Format of labels.txt:

img_001.png    Hello World
img_002.jpg    This is a test

Note: Images must be in an images/ subdirectory relative to the labels.txt file.

Step 2: Train the Model

You can train using CLI arguments or a configuration file.

Option A: Using Configuration File (Recommended)

  1. Generate default config:
    kiri-ocr init-config -o config.json
    
  2. Edit config.json to adjust hyperparameters (epochs, batch size, etc.).
  3. Start training:
    kiri-ocr train --config config.json
    

Option B: Using CLI Arguments

kiri-ocr train \
    --train-labels data/train/labels.txt \
    --val-labels data/val/labels.txt \
    --epochs 100 \
    --batch-size 32 \
    --device cuda

Option C: Training with Hugging Face Dataset

You can train directly using a dataset from Hugging Face Hub. The dataset should contain image and text columns.

kiri-ocr train \
    --hf-dataset mrrtmob/km_en_image_line \
    --epochs 50 \
    --batch-size 32

Advanced HF Options:

  • --hf-train-split: Specify training split name (default: "train").
  • --hf-val-split: Specify validation split name. If not provided, it tries "validation", "val", "test", or automatically splits the training set.
  • --hf-val-percent: Percentage of training data to use for validation if no validation split is found (default: 0.1 for 10%).
  • --hf-image-col: Column name for images (default: "image").
  • --hf-text-col: Column name for text labels (default: "text").
  • --hf-subset: Dataset configuration/subset name (optional).

To use a specific subset/config (if the dataset has multiple):

kiri-ocr train \
    --hf-dataset mrrtmob/km_en_image_line \
    --hf-subset default \
    ...

Fine-Tuning

To fine-tune an existing model on new data:

kiri-ocr train \
    --config config.yaml \
    --from-model models/model.kiri

This loads the weights from models/model.kiri before starting training. Useful for domain adaptation or adding languages.

The trained model will be saved to models/model.kiri (or specified output_dir).

☕ Support

If you find this project useful, you can support me here:

⚖️ License

Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kiri_ocr-0.1.2.tar.gz (41.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kiri_ocr-0.1.2-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file kiri_ocr-0.1.2.tar.gz.

File metadata

  • Download URL: kiri_ocr-0.1.2.tar.gz
  • Upload date:
  • Size: 41.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kiri_ocr-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0daae750260e52edc2e30aed1dccc6c6ccdbb5076ed34bcc37faf939d15b708b
MD5 ff382e35ad3c05c9f5d90c6bb1c534ae
BLAKE2b-256 89c0b443de01c1fc61f511a2a9dab027934225fc655b62730c02ac763eaf7566

See more details on using hashes here.

File details

Details for the file kiri_ocr-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: kiri_ocr-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kiri_ocr-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6c2429fc4639f4f296aefcca4c16f11384afa4d281fb2019b8e7bf1912ca11a1
MD5 38ba5549fc18718d9cc362ac2309c03f
BLAKE2b-256 d19510dd6b6be2d5edc11e3f3a108cbe04710d46f45a877e16f80e7d3060d203

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page