Optical Character Recoginition Model

Project description

TrorYong OCR Model

TrorYongOCR, is an Optical Character Recognition Model implemented by KrorngAI.

TrorYong (ត្រយ៉ង) is Khmer word for giant ibis, the bird that symbolises Cambodia.

Support My Work

While this work comes truly from the heart, each project represents a significant investment of time -- from deep-dive research and code preparation to the final narrative and editing process. I am incredibly passionate about sharing this knowledge, but maintaining this level of quality is a major undertaking. If you find my work helpful and are in a position to do so, please consider supporting my work with a donation. You can click here to donate or scan the QR code below. Your generosity acts as a huge encouragement and helps ensure that I can continue creating in-depth, valuable content for you.

Using Cambodian bank account, you can donate by scanning my ABA QR code here. (or click here. Make sure that receiver's name is 'Khun Kim Ang'.)

Installation

You can easily install tror-yong-ocr using pip command as the following:

pip install tror-yong-ocr

Usage

Loading tokenizer

TrorYongOCR is a small optical character recognition model that you can train from scratch. With this goal, you can use your own tokenizer to pair with TrorYongOCR. Just make sure that the tokenizer used for training and the tokenizer used for inference is the same.

Your tokenizer must contain begin of sequence (bos), end of sequence (eos) and padding (pad) tokens. bos token id and eos token id are used in decoding function. pad token id is used during training.

I also provide a tokenizer that supports Khmer and English.

from tror_yong_ocr import get_tokenizer

tokenizer = get_tokenizer(charset=None)
print(len(tokenizer)) # you should receive 185
text = 'Amazon បង្កើនការវិនិយោគជិត១'
print(tokenizer.decode(tokenizer.encode(data[0]['text'], add_special_tokens=True), ignore_special_tokens=False))
# this should print <s>Amazon បង្កើនការវិនិយោគជិត១</s>

When preparing a dataset to train TrorYongOCRModel, you just need to transform the text into token ids using the tokenizer

sentence = 'Cambodia needs peace.'
token_ids = tokenizer.encode(sentence, add_special_tokens=True)

NOTE: I want to highlight that my tokenizer works at character level.

Loading TrorYongOCRModel

Get started with the code below

import torch
from torchvision.transforms import v2 as transforms
from PIL import Image # pip install pillow
from tror_yong_ocr import get_tokenizer, TrorYongOCRModel

img = Image.open("your/file/image").convert('RGB')

processor = transforms.Compose(
    [
        transforms.Resize((32, 128)),
        transforms.ToImage(),
        transforms.ToDtype(torch.float32, scale=True),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ]
)

img_tensor = processor(img)

tokenizer = get_tokenizer()
model = TrorYongOCRModel.from_pretrained('KrorngAI/TrorYongOCR')
model.eval()

# suppose that you have an image array in numpy
pred_ids = model.decode(img_tensor, 192, temperature=0.01, top_k=25)
print(tokenizer.decode(pred_ids[0].tolist(), ignore_special_tokens=True))

TrorYongOCR is designed as the following: given $L$ transformer blocks

$L-1$ are encoding blocks that encode a given image
the last block is a single decoding block without cross-attention mechanism
each transformer is implemented with exclusive self-attention [@zhai2026exclusive] style and SwiGLU in MLP

For the single decoding block,

the latent state of an image (the output of encoding blocks) is concatenated with the input character embedding (token embedding including bos token) to create context vector, i.e. key and value vectors (think of it like a prefill prompt)

The architecture of TrorYongOCR can be found in Figure 1 below.

Figure 1: TrorYongOCR architecture overview. The input image is transformed into patch embedding. Image embedding is obtained by additioning patch embedding and position embedding. The image embedding is passed through L-1 encoder blocks to generate image encoding (latent state). The image encoding is concatenated with character embedding (i.e. token embedding plus position embedding) before undergoing causal self-attention mechanism in the single decoder block to generate next token.

Compared to PARSeq

For PARSeq model which is an encoder-decoder architecture, text decoder uses position embedding as query vector, character embedding (token embedding plus position embedding) as context vector, and the latent state from image encoder as memory for the cross-attention mechanism (see Figure 3 of their paper).

Compared to DTrOCR

For DTrOCR which is a decoder-only architecture, the image embedding (patch embedding plus position embedding) is concatenated with input character embedding (a [SEP] token is added at the beginning of input character embedding to indicate sequence separation. [SEP] token is equivalent to bos token in TrorYongOCR), and causal self-attention mechanism is applied to the concatenation from layer to layer to generate text autoregressively (see Figure 2 of their paper).

Fine-tuning TrorYongOCR

You can check out the notebook below to train your own Small OCR Model.

I also have a video about training TrorYongOCR below

TODO:

implement model with KV cache TrorYongOCRModel
notebook colab for fine-tuning TrorYongOCRModel
benchmarking

Project details

Release history Release notifications | RSS feed

0.2.6

Jun 15, 2026

0.2.5

May 29, 2026

0.2.4

May 24, 2026

0.2.3

May 23, 2026

0.2.2

May 23, 2026

0.2.1

May 16, 2026

This version

0.2.0

May 16, 2026

0.1.1

Feb 20, 2026

0.1.0

Feb 20, 2026

0.0.6

Feb 19, 2026

0.0.5

Feb 19, 2026

0.0.4

Feb 18, 2026

0.0.3

Feb 18, 2026

0.0.2

Feb 17, 2026

0.0.1

Feb 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tror_yong_ocr-0.2.0.tar.gz (15.6 kB view details)

Uploaded May 16, 2026 Source

File details

Details for the file tror_yong_ocr-0.2.0.tar.gz.

File metadata

Download URL: tror_yong_ocr-0.2.0.tar.gz
Upload date: May 16, 2026
Size: 15.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for tror_yong_ocr-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`944835de1596aa7a6ad15ab22982cab0c0eddfdaba769862fdc27c3bd9b10920`
MD5	`41650ed9f722a8ae14fc344f027df8a2`
BLAKE2b-256	`5606a2351c8db54ef55b3d4981707bb544a896c9b1bd099de9461e4840893206`

See more details on using hashes here.

tror-yong-ocr 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta