A lightweight OCR library for Khmer and English documents
Project description
Kiri OCR 📄
Kiri OCR is a lightweight, OCR library for English and Khmer documents. It provides document-level text detection, recognition, and rendering capabilities in a compact package (~13MB model).
✨ Key Features
- Lightweight: Only ~13MB model size (Lite version).
- Bi-lingual: Native support for English and Khmer (and mixed).
- Document Processing: Automatic text line and word detection.
- Robust Detection: Works on both light and dark backgrounds (Dark Mode support).
- Easy to Use: Simple Python API.
- Visualizations: Generate annotated images and HTML reports.
📊 Dataset
The model is trained on the mrrtmob/km_en_image_line dataset, which contains 5 million synthetic images of Khmer and English text lines.
📈 Benchmark
Results on synthetic test images (10 popular fonts):
� Installation
You can install the package directly from the source:
https://github.com/mrrtmob/kiri-ocr.git
cd kiri_ocr
pip install .
💻 Usage
CLI Tool (Inference)
Run OCR on an image and save results:
kiri-ocr predict path/to/document.jpg --output results/
(Or simply kiri-ocr path/to/document.jpg)
Python API
from kiri_ocr import OCR
# Initialize Lite Model
ocr = OCR()
# Process document
results = ocr.process_document()
# Extract text
text, _ = ocr.extract_text('document.jpg')
print(text)
🎓 Training a New Model
Follow this guide to train a custom model from scratch.
Step 1: Generate Training Data
Create synthetic training images from a text file.
-
Prepare text file: Create
data/textlines.txtwith your training text (one sentence per line). -
Generate dataset:
kiri-ocr generate \ --train-file data/textlines.txt \ --output data \ --fonts-dir fonts \ --augment 1 \ --random-augment
--fonts-dir: Directory containing.ttffiles (Khmer/English fonts).--augment: How many variations to generate per line (e.g., 2).--random-augment: Apply random noise/rotation even ifaugmentis 1.
Custom Dataset Structure
If you have your own data (not generated), organize it as follows:
data/
├── train/
│ ├── labels.txt # Tab-separated: filename <tab> text
│ └── images/ # Image files
│ ├── img_001.png
│ ├── img_002.jpg
│ └── ...
└── val/
├── labels.txt
└── images/
Format of labels.txt:
img_001.png Hello World
img_002.jpg This is a test
Note: Images must be in an images/ subdirectory relative to the labels.txt file.
Step 2: Train the Model
You can train using CLI arguments or a configuration file.
Option A: Using Configuration File (Recommended)
- Generate default config:
kiri-ocr init-config -o config.json
- Edit
config.jsonto adjust hyperparameters (epochs, batch size, etc.). - Start training:
kiri-ocr train --config config.json
Option B: Using CLI Arguments
kiri-ocr train \
--train-labels data/train/labels.txt \
--val-labels data/val/labels.txt \
--epochs 100 \
--batch-size 32 \
--device cuda
Option C: Training with Hugging Face Dataset
You can train directly using a dataset from Hugging Face Hub. The dataset should contain image and text columns.
kiri-ocr train \
--hf-dataset mrrtmob/km_en_image_line \
--epochs 50 \
--batch-size 32
Advanced HF Options:
--hf-train-split: Specify training split name (default: "train").--hf-val-split: Specify validation split name. If not provided, it tries "validation", "val", "test", or automatically splits the training set.--hf-val-percent: Percentage of training data to use for validation if no validation split is found (default: 0.1 for 10%).--hf-image-col: Column name for images (default: "image").--hf-text-col: Column name for text labels (default: "text").--hf-subset: Dataset configuration/subset name (optional).
To use a specific subset/config (if the dataset has multiple):
kiri-ocr train \
--hf-dataset mrrtmob/km_en_image_line \
--hf-subset default \
...
Fine-Tuning
To fine-tune an existing model on new data:
kiri-ocr train \
--config config.yaml \
--from-model models/model.kiri
This loads the weights from models/model.kiri before starting training. Useful for domain adaptation or adding languages.
The trained model will be saved to models/model.kiri (or specified output_dir).
☕ Support
If you find this project useful, you can support me here:
⚖️ License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kiri_ocr-0.1.0.tar.gz.
File metadata
- Download URL: kiri_ocr-0.1.0.tar.gz
- Upload date:
- Size: 41.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca193902c788ecc060e2481dff4a3f290ee4dbb4fc7432c377d49d1b21016446
|
|
| MD5 |
3132337c16444bd2ff46a09861c0c1b1
|
|
| BLAKE2b-256 |
6440c279960e536f3f57915c577d28caade6ca54d47f0462a529055a4e05a0f4
|
File details
Details for the file kiri_ocr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: kiri_ocr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 41.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
404d980e64fda93cad89dea265bdb3fc8dd28d66f988f6502fa25fed04326a26
|
|
| MD5 |
8ca2571c7d2237a3ea68bb1274cfa75f
|
|
| BLAKE2b-256 |
762be9026b0da6a2a5605b22e2be6a540b2621e2e481e7235ef602ed868e843a
|