Skip to main content

Multilingual handwritten OCR for student notes - production-grade text extraction

Project description

๐Ÿ–‹๏ธ HandScribe OCR

Production-grade multilingual handwritten OCR for student notes

HandScribe Logo

CI Publish Docs License: GPL-3.0 Python 3.9+ Docker


What is HandScribe?

HandScribe extracts text from handwritten student notes in 80+ languages. It wraps three OCR engines โ€” EasyOCR, PaddleOCR, and TrOCR โ€” behind a single interface with advanced image preprocessing, a CLI, a REST API, and one-command Docker deployment.

Built for Tanzanian students. Designed for everyone.


Quick Start

Method Command
Docker (zero setup) docker run -p 8000:8000 ronaldgosso/handscribe
pip install pip install handscribe
From source git clone https://github.com/ronaldgosso/handscribe.git && cd handscribe && pip install -e .

Usage

CLI

handscribe extract notes.jpg -b easyocr -l en,sw     # extract text
handscribe extract notes.jpg --json                   # output as JSON
handscribe extract notes.jpg -o output.txt -c 0.6     # save to file
handscribe batch ./images/ -o ./results/               # batch process
handscribe compare notes.jpg -l en,sw                  # compare all backends

REST API

uvicorn ocr_engine.api:api --port 8000
# Interactive docs โ†’ http://localhost:8000/docs

curl -X POST http://localhost:8000/ocr \
  -F "file=@notes.jpg" -F "backend=easyocr" -F "languages=en,sw"

Python

from ocr_engine import OCREngine, OCRBackend

engine = OCREngine(backend=OCRBackend.EASYOCR, languages=["en", "sw"])
text = engine.extract_text("student_notes.jpg")

OCR Backends

Backend Best For Languages Speed Accuracy
EasyOCR Quick setup, mixed scripts 80+ โšกโšกโšก โญโญโญโญ
PaddleOCR Fast processing, documents 80+ โšกโšกโšกโšก โญโญโญโญ
TrOCR Handwriting accuracy English* โšกโšก โญโญโญโญโญ

*TrOCR can be fine-tuned for other languages.

Language Codes

Language EasyOCR PaddleOCR
English en en
Swahili sw en (Latin script)
Arabic ar arabic
Hindi hi hi
French fr french

CI/CD Pipeline

HandScribe uses two separate GitHub Actions workflows:

Workflow File Triggers What It Does
CI ci.yml Push, PR Lint โ†’ Test โ†’ Build & push Docker image
Publish publish.yml Tag push (v*), Release, Manual Build & publish to PyPI
Pages pages.yml Push to main (docs/) Deploy landing page to GitHub Pages

How It Works

push / PR
   โ”‚
   โ”œโ”€โ”€ lint โ”€โ”€โ”€ ruff โ”€ black โ”€ mypy
   โ”‚
   โ””โ”€โ”€ test โ”€โ”€โ”€ pytest (41 tests, 60% coverage)
         โ”‚
         โ””โ”€โ”€ on main โ”€โ”€ build & push Docker image to Docker Hub


push tag v0.1.0
   โ”‚
   โ””โ”€โ”€ publish โ”€โ”€โ”€ build โ”€โ”€โ”€ twine check โ”€โ”€โ”€ upload to PyPI

Status Badges

Badge Status
CI Build CI
PyPI Publish Publish

Publishing a Release

# 1. Bump version in pyproject.toml
# 2. Tag and push
git tag v0.1.0
git push origin v0.1.0

# โ†’ GitHub Actions auto-publishes to PyPI

Docker

docker run -p 8000:8000 ronaldgosso/handscribe        # run
docker build -t handscribe .                           # build
docker compose up -d                                   # compose

Full instructions in CONTRIBUTING.md.


Architecture

handscribe/
โ”œโ”€โ”€ ocr_engine/
โ”‚   โ”œโ”€โ”€ engine.py            # Core OCR engine (3 backends)
โ”‚   โ”œโ”€โ”€ preprocessing.py     # Denoise, CLAHE, binarize, deskew
โ”‚   โ”œโ”€โ”€ cli.py               # CLI (Typer) โ€” extract, batch, compare, info
โ”‚   โ””โ”€โ”€ api.py               # REST API (FastAPI) โ€” /ocr, /ocr/text, /ocr/batch
โ”œโ”€โ”€ tests/
โ”‚   โ””โ”€โ”€ test_engine.py       # 41 tests
โ”œโ”€โ”€ .github/workflows/
โ”‚   โ”œโ”€โ”€ ci.yml               # Lint โ†’ Test โ†’ Docker
โ”‚   โ””โ”€โ”€ publish.yml          # Build โ†’ PyPI
โ”œโ”€โ”€ Dockerfile               # Optimized single-stage build
โ”œโ”€โ”€ docker-compose.yml
โ””โ”€โ”€ pyproject.toml

๐Ÿ”ง Add Other OCR Backends

EasyOCR is included by default. Add PaddleOCR or TrOCR as optional extras:

pip install handscribe[paddle]    # PaddleOCR โ€” faster, document-style
pip install handscribe[trocr]     # TrOCR โ€” highest handwriting accuracy
pip install handscribe[all]       # Both PaddleOCR + TrOCR
from ocr_engine import OCREngine, OCRBackend

# Switch backend at runtime
engine = OCREngine(backend=OCRBackend.PADDLE, languages=["en"])
engine = OCREngine(backend=OCRBackend.TROCR)

๐Ÿงช Development

git clone https://github.com/ronaldgosso/handscribe.git
cd handscribe
python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[all,dev]"
pytest tests/ -v

Full guide โ†’ CONTRIBUTING.md


Acknowledgments

  • EasyOCR โ€” Jaided AI
  • PaddleOCR โ€” PaddlePaddle
  • TrOCR โ€” Microsoft
  • Tanzanian Students โ€” the inspiration

Ronald Gosso โ€” ronaldgosso@gmail.com ยท GitHub

Made with โค๏ธ for students everywhere

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

handscribe-0.1.2.tar.gz (29.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

handscribe-0.1.2-py3-none-any.whl (27.8 kB view details)

Uploaded Python 3

File details

Details for the file handscribe-0.1.2.tar.gz.

File metadata

  • Download URL: handscribe-0.1.2.tar.gz
  • Upload date:
  • Size: 29.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for handscribe-0.1.2.tar.gz
Algorithm Hash digest
SHA256 81a355d302e843141ccf8a161c820525a5d214263a0015229d565531a3d3bcd9
MD5 f7d9955d4fa95b681d87510b64a690d3
BLAKE2b-256 bca6c50cad774c19f0d3b466c2bf793d67b86db1099d4d3b716fcbe911bc0599

See more details on using hashes here.

Provenance

The following attestation bundles were made for handscribe-0.1.2.tar.gz:

Publisher: publish.yml on ronaldgosso/handscribe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file handscribe-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: handscribe-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 27.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for handscribe-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e729609a58297a994198c31973e1067a90c3b751b11e0fd29a33f12c917630b9
MD5 894db8a0951a761d813d6c31ea0fc8b6
BLAKE2b-256 16704f4e4283dd8d2a06220fe8946079f5f598c538e8266ecf18cfb883b8d185

See more details on using hashes here.

Provenance

The following attestation bundles were made for handscribe-0.1.2-py3-none-any.whl:

Publisher: publish.yml on ronaldgosso/handscribe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page