Skip to main content

Convert Sinhala PDF documents into clean Markdown using OCR and text extraction

Project description

sinhala-pdf2md

Convert Sinhala PDF documents into clean, readable Markdown — with or without OCR.

PyPI Python License: MIT CI Coverage


Why Does This Exist?

Working with Sinhala text in PDF format is painful. Most tools either ignore Unicode entirely, mangle the script's conjunct consonants, or produce OCR output full of garbled characters.

sinhala-pdf2md was built to solve this specifically for Sinhala documents. It:

  • Knows the difference between a text-based PDF (which has a proper text layer) and a scanned one (which is just an image)
  • Picks the right tool for each page — direct text extraction for digital PDFs, OCR for scanned ones
  • Fixes Unicode issues that appear after OCR — broken ZWJ sequences, misplaced combining marks, control characters
  • Optionally runs an LLM (OpenAI, Gemini, or a local Ollama model) to clean up structure

If you're digitising Sinhala books, government documents, or any scanned Sinhala content, this tool handles the messy bits so you can focus on the content.


Features

  • Smart page classification — detects whether each page needs OCR or direct extraction
  • Two OCR engines — Tesseract (default, free) or Surya (transformer-based, higher accuracy)
  • Image pre-processing — deskew, denoise, and binarize scanned images before OCR
  • Heading detection — infers headings from font size ratios (not just guessing)
  • Table support — extracts PDF tables and renders them as GitHub Flavored Markdown
  • List detection — recognises bullets, numbered lists, and common Sinhala bullet characters
  • Unicode repair — NFC normalisation + ZWJ and virama (්‍) sequence fixing
  • AI cleanup — optional post-processing with OpenAI, Gemini, or Ollama
  • Batch conversion — convert an entire directory in one command
  • Python API — use as a library in your own pipelines
  • Env-var config — all settings configurable via PDF2MD_* environment variables

Architecture Overview

PDF File
   │
   ▼
PageAnalyzer ──────── Classifies each page (text / scanned / mixed)
   │
   ├─── TEXT page ──► PDFExtractor (pdfplumber) ──► MarkdownFormatter
   │
   ├─── SCANNED page ──► PageRenderer (PyMuPDF) ──► Image Preprocessor
   │                                                       │
   │                                                       ▼
   │                                               OCREngine (Tesseract / Surya)
   │                                                       │
   │                                                       ▼
   │                                               MarkdownFormatter
   │
   └─── MIXED page ──► Both paths, combined
           │
           ▼
   MarkdownCleaner (Unicode repair, whitespace)
           │
           ▼
   [Optional] AIFormatter (OpenAI / Gemini / Ollama)
           │
           ▼
   Output .md file

See docs/architecture.md for the full breakdown with Mermaid diagrams.


Installation

Prerequisites

Tesseract (required for OCR on scanned pages):

# Ubuntu / Debian
sudo apt-get install tesseract-ocr tesseract-ocr-sin

# macOS
brew install tesseract tesseract-lang

# Windows — download from https://github.com/UB-Mannheim/tesseract/wiki
# Then add to PATH. Make sure the "sin" language data is included.

Install the Package

pip install sinhala-pdf2md

Optional Extras

# Surya OCR engine (transformer-based, higher accuracy)
# ⚠️ Non-commercial use only — see https://github.com/VikParuchuri/surya
pip install sinhala-pdf2md[surya]

# AI cleanup with OpenAI
pip install sinhala-pdf2md[ai-openai]

# AI cleanup with Gemini
pip install sinhala-pdf2md[ai-gemini]

# AI cleanup with Ollama (local)
pip install sinhala-pdf2md[ai-ollama]

# Everything at once (dev included)
pip install sinhala-pdf2md[all]

See docs/installation.md for detailed platform-specific instructions.


Quick Start

Command Line

# Convert a single PDF
pdf2md document.pdf

# Specify output path
pdf2md document.pdf -o output.md

# Use Surya OCR engine
pdf2md document.pdf --ocr-engine surya

# Higher DPI for better OCR quality
pdf2md document.pdf --dpi 400

# Enable AI cleanup (requires OpenAI API key)
pdf2md document.pdf --ai-cleanup openai

# Verbose logging
pdf2md document.pdf --verbose

Batch Conversion

# Convert all PDFs in a directory
pdf2md batch ./documents/

# With output directory and recursive search
pdf2md batch ./documents/ --output-dir ./markdown/ --recursive

Python API

from sinhala_pdf2md import Converter

# Simple conversion
converter = Converter()
output_path = converter.convert("document.pdf")

# Custom configuration
converter = Converter(
    ocr_engine="tesseract",
    ocr_language="si",
    page_render_dpi=400,
    preserve_page_breaks=True,
)
converter.convert("document.pdf", "output.md")

# Get Markdown as a string (don't write a file)
markdown = converter.convert_to_string("document.pdf")

# Batch convert a directory
results = converter.convert_batch("./pdfs/", output_dir="./output/", recursive=True)
print(f"Converted {len(results)} files")

CLI Usage

Usage: pdf2md [COMMAND] [OPTIONS]

Commands:
  convert   Convert a single PDF file to Markdown. (default)
  batch     Convert all PDF files in a directory to Markdown.

Convert Options:
  PDF_PATH              Path to the input PDF file
  -o, --output PATH     Output Markdown file path
  -e, --ocr-engine TEXT OCR engine: tesseract (default) or surya
  -l, --lang TEXT       Language code: si (Sinhala, default), en, ta, hi
  -d, --dpi INT         Render DPI for scanned pages (72–600, default 300)
  -v, --verbose         Enable debug logging
  --ai-cleanup TEXT     AI provider for post-processing: openai, gemini, ollama

Batch Options:
  INPUT_DIR             Directory containing PDF files
  -o, --output-dir DIR  Output directory for .md files
  -r, --recursive       Search subdirectories
  (plus all convert options above)

Examples

# Basic usage — output next to input file
pdf2md report.pdf
# → report.md

# English document
pdf2md letter.pdf --lang en

# High-quality scanned document
pdf2md scanned_book.pdf --dpi 450 --ocr-engine tesseract

# With AI cleanup via local Ollama
pdf2md document.pdf --ai-cleanup ollama

# Batch with verbose output
pdf2md batch ./inbox/ --output-dir ./processed/ --recursive --verbose

Python API Usage

Basic

from sinhala_pdf2md import Converter

converter = Converter()
path = converter.convert("input.pdf", "output.md")
print(f"Saved to: {path}")

Full Configuration via ConverterConfig

from sinhala_pdf2md import Converter, ConverterConfig, OCREngineType, AIProviderType

config = ConverterConfig(
    ocr_engine=OCREngineType.TESSERACT,
    ocr_language="si",
    page_render_dpi=400,
    ocr_confidence_threshold=0.6,
    preserve_page_breaks=True,
    heading_detection_enabled=True,
    table_detection_enabled=True,
    heading_font_size_ratio=1.3,
    image_preprocess_enabled=True,
    ai_provider=AIProviderType.OPENAI,
    ai_model="gpt-4o",
    ai_api_key="sk-...",
)

converter = Converter(config=config)
converter.convert("document.pdf", "output.md")

Batch Processing with Error Handling

from sinhala_pdf2md import Converter
from sinhala_pdf2md.exceptions import BatchConversionError

converter = Converter(ocr_engine="tesseract")

try:
    results = converter.convert_batch("./pdfs/", output_dir="./out/", recursive=True)
    print(f"Successfully converted {len(results)} files")
except BatchConversionError as e:
    print(f"Some files failed: {e.failures}")

Return Markdown Without Writing a File

converter = Converter()
markdown_text = converter.convert_to_string("document.pdf")
# Process the string however you like

Configuration

All settings can be set via constructor arguments, a ConverterConfig object, or environment variables with the PDF2MD_ prefix.

Setting Default Env Var Description
ocr_engine tesseract PDF2MD_OCR_ENGINE OCR backend (tesseract or surya)
ocr_language si PDF2MD_OCR_LANGUAGE ISO 639-1 language code
ocr_confidence_threshold 0.5 PDF2MD_OCR_CONFIDENCE_THRESHOLD Min confidence score to keep OCR output
page_render_dpi 300 PDF2MD_PAGE_RENDER_DPI DPI for rendering scanned pages (72–600)
preserve_page_breaks true PDF2MD_PRESERVE_PAGE_BREAKS Insert <!-- page-break --> between pages
heading_detection_enabled true PDF2MD_HEADING_DETECTION_ENABLED Enable font-size heading detection
table_detection_enabled true PDF2MD_TABLE_DETECTION_ENABLED Enable table extraction
heading_font_size_ratio 1.3 PDF2MD_HEADING_FONT_SIZE_RATIO Font size ratio threshold for headings
image_preprocess_enabled true PDF2MD_IMAGE_PREPROCESS_ENABLED Deskew/denoise/binarize before OCR
ai_provider None PDF2MD_AI_PROVIDER AI cleanup provider (openai, gemini, ollama)
ai_model None PDF2MD_AI_MODEL Model name for the AI provider
ai_api_key None PDF2MD_AI_API_KEY API key for the AI provider
ai_base_url None PDF2MD_AI_BASE_URL Base URL override (for Ollama or custom endpoints)
output_dir None PDF2MD_OUTPUT_DIR Default output directory
log_level INFO PDF2MD_LOG_LEVEL Logging level

Example with environment variables:

export PDF2MD_OCR_ENGINE=surya
export PDF2MD_PAGE_RENDER_DPI=400
export PDF2MD_AI_PROVIDER=openai
export OPENAI_API_KEY=sk-...

pdf2md document.pdf

See docs/configuration.md for the full reference.


Supported OCR Engines

Tesseract (Default)

  • Free, open source, widely available
  • Requires the tesseract binary and language data files
  • Good accuracy for clean, high-DPI scans
  • Supports Sinhala (sin), English (eng), Tamil (tam), Hindi (hin)
  • Install: pip install sinhala-pdf2md (Tesseract binary installed separately)

Surya (Optional)

  • Transformer-based, generally higher accuracy
  • Language-agnostic (handles Sinhala without explicit training data)
  • Requires PyTorch and significant RAM/GPU
  • ⚠️ Non-commercial use only for startups above $5M revenue/funding
  • Install: pip install sinhala-pdf2md[surya]

Limitations

  • Scanned page quality matters — very low-resolution or heavily degraded scans will produce poor OCR results regardless of which engine you use. 300+ DPI is recommended.
  • Complex layouts — multi-column documents, footnotes, and sidebar text may not reconstruct perfectly. The formatter works page-by-page and doesn't do global layout analysis.
  • Surya licensing — the Surya engine is not free for commercial use above the license thresholds. Check the Surya license before using it in production.
  • AI cleanup costs money — OpenAI and Gemini API calls are billed per token. Large documents with many pages can accumulate costs quickly.
  • Mixed pages — pages that have both text and images use a heuristic: if the text layer has 100+ characters, OCR is skipped. This works well in practice but isn't perfect.
  • No image extraction — embedded images in PDFs are not extracted or described.

Performance Notes

  • Text-based PDFs are fast — pdfplumber extracts text in milliseconds per page.
  • Scanned pages take longer — rendering + image preprocessing + OCR can take 2–10 seconds per page depending on DPI and hardware.
  • Surya loads a transformer model on first use — there's a cold-start delay of several seconds, but subsequent pages are faster.
  • Image preprocessing (deskew, denoise, binarize) adds ~0.5–2 seconds per page but significantly improves OCR accuracy on noisy scans.
  • The OCR engine is lazily initialised — if your document has no scanned pages, no OCR overhead is incurred at all.

Contributing

Contributions are welcome. If you're fixing a bug, adding a feature, or writing tests, please:

  1. Fork the repository and create a branch from main.
  2. Install dev dependencies: pip install -e ".[dev]"
  3. Set up pre-commit hooks: pre-commit install
  4. Run tests: make test or pytest tests/
  5. Check types: make typecheck
  6. Lint and format: make format
  7. Open a pull request with a clear description.

See docs/developer-guide.md for detailed contribution instructions, including how to add a new OCR engine or AI provider.

Common make Targets

make dev         # Install in editable mode with dev dependencies
make test        # Run the full test suite
make test-unit   # Unit tests only
make lint        # Check code style
make format      # Auto-fix formatting
make typecheck   # mypy static analysis
make clean       # Remove build artifacts

Documentation

Document Purpose
Architecture How the system works, data flow, component responsibilities
Design Decisions Why specific libraries and patterns were chosen
Project Structure Folder and file layout explained
Workflows Step-by-step processing flows with diagrams
Developer Guide How to extend the project
API Reference Public classes, methods, and exceptions
Testing Guide How to run tests and contribute test coverage
Configuration Full configuration reference
Installation Platform-specific setup instructions
Changelog Version history

License

MIT — free to use, modify, and distribute.

Note on Surya: The optional Surya OCR engine uses a modified Open Rail-M license that restricts commercial use. The sinhala-pdf2md library itself is MIT — the restriction only applies if you install and use the [surya] extra. See Surya's license for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinhala_pdf2md-0.1.0.tar.gz (59.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sinhala_pdf2md-0.1.0-py3-none-any.whl (47.6 kB view details)

Uploaded Python 3

File details

Details for the file sinhala_pdf2md-0.1.0.tar.gz.

File metadata

  • Download URL: sinhala_pdf2md-0.1.0.tar.gz
  • Upload date:
  • Size: 59.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sinhala_pdf2md-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0fba2abf2e25ca425f92cb8761f222f8232262efd7f5ad399cd132dbb1b9208a
MD5 2dad499f876c2d89761015aac6600018
BLAKE2b-256 68a53f21d524de929614a42b91750f2f161e077343ca499515a3a4988e13ae55

See more details on using hashes here.

Provenance

The following attestation bundles were made for sinhala_pdf2md-0.1.0.tar.gz:

Publisher: release.yml on RMCV-Rajapaksha/Sinhala-OCR

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sinhala_pdf2md-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sinhala_pdf2md-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 47.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sinhala_pdf2md-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b002764ad5534ea7f6c256643c67fe7affa3352025a28b189baa96f3a99b7672
MD5 39279f8dd56aa325f35ad788a518a06c
BLAKE2b-256 3cdbfeadadd910976d5b49690c08bb9a7a6aa6be2f753ab54e41704f3ada0337

See more details on using hashes here.

Provenance

The following attestation bundles were made for sinhala_pdf2md-0.1.0-py3-none-any.whl:

Publisher: release.yml on RMCV-Rajapaksha/Sinhala-OCR

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page