Convert Sinhala PDF documents into clean Markdown using OCR and text extraction

These details have not been verified by PyPI

Project description

sinhala-pdf2md

Version: 0.2.1

Convert Sinhala PDF documents into clean, readable Markdown — with or without OCR.

Why Does This Exist?

Working with Sinhala text in PDF format is painful. Most tools either ignore Unicode entirely, mangle the script's conjunct consonants, or produce OCR output full of garbled characters.

sinhala-pdf2md was built to solve this specifically for Sinhala documents. It:

Knows the difference between a text-based PDF (which has a proper text layer) and a scanned one (which is just an image)
Picks the right tool for each page — direct text extraction for digital PDFs, OCR for scanned ones
Fixes Unicode issues that appear after OCR — broken ZWJ sequences, misplaced combining marks, control characters
Optionally runs an LLM (OpenAI, Gemini, or a local Ollama model) to clean up structure

If you're digitising Sinhala books, government documents, or any scanned Sinhala content, this tool handles the messy bits so you can focus on the content.

Features

Smart page classification — detects whether each page needs OCR or direct extraction
Two OCR engines — Tesseract (default, free) or Surya (transformer-based, higher accuracy)
Image pre-processing — deskew, denoise, and binarize scanned images before OCR
Heading detection — infers headings from font size ratios (not just guessing)
Table support — extracts PDF tables and renders them as GitHub Flavored Markdown
List detection — recognises bullets, numbered lists, and common Sinhala bullet characters
Unicode repair — NFC normalisation + ZWJ and virama (්‍) sequence fixing
AI cleanup — optional post-processing with OpenAI, Gemini, or Ollama
Batch conversion — convert an entire directory in one command
Python API — use as a library in your own pipelines
Env-var config — all settings configurable via PDF2MD_* environment variables

Architecture Overview

PDF File
   │
   ▼
PageAnalyzer ──────── Classifies each page (text / scanned / mixed)
   │
   ├─── TEXT page ──► PDFExtractor (pdfplumber) ──► MarkdownFormatter
   │
   ├─── SCANNED page ──► PageRenderer (PyMuPDF) ──► Image Preprocessor
   │                                                       │
   │                                                       ▼
   │                                               OCREngine (Tesseract / Surya)
   │                                                       │
   │                                                       ▼
   │                                               MarkdownFormatter
   │
   └─── MIXED page ──► Both paths, combined
           │
           ▼
   MarkdownCleaner (Unicode repair, whitespace)
           │
           ▼
   [Optional] AIFormatter (OpenAI / Gemini / Ollama)
           │
           ▼
   Output .md file

See docs/architecture.md for the full breakdown with Mermaid diagrams.

Installation

Prerequisites

Tesseract (required for OCR on scanned pages):

# Ubuntu / Debian
sudo apt-get install tesseract-ocr tesseract-ocr-sin

# macOS
brew install tesseract tesseract-lang

# Windows — download from https://github.com/UB-Mannheim/tesseract/wiki
# Then add to PATH. Make sure the "sin" language data is included.

Install the Package

pip install sinhala-pdf2md

Optional Extras

# Surya OCR engine (transformer-based, higher accuracy)
# ⚠️ Non-commercial use only — see https://github.com/VikParuchuri/surya
pip install sinhala-pdf2md[surya]

# AI cleanup with OpenAI
pip install sinhala-pdf2md[ai-openai]

# AI cleanup with Gemini
pip install sinhala-pdf2md[ai-gemini]

# AI cleanup with Ollama (local)
pip install sinhala-pdf2md[ai-ollama]

# Everything at once (dev included)
pip install sinhala-pdf2md[all]

See docs/installation.md for detailed platform-specific instructions.

Quick Start

Command Line

# Convert a single PDF
pdf2md document.pdf

# Specify output path
pdf2md document.pdf -o output.md

# Use Surya OCR engine
pdf2md document.pdf --ocr-engine surya

# Higher DPI for better OCR quality
pdf2md document.pdf --dpi 400

# Enable AI cleanup (requires OpenAI API key)
pdf2md document.pdf --ai-cleanup openai

# Verbose logging
pdf2md document.pdf --verbose

Batch Conversion

# Convert all PDFs in a directory
pdf2md batch ./documents/

# With output directory and recursive search
pdf2md batch ./documents/ --output-dir ./markdown/ --recursive

Python API

from sinhala_pdf2md import Converter

# Simple conversion
converter = Converter()
output_path = converter.convert("document.pdf")

# Custom configuration
converter = Converter(
    ocr_engine="tesseract",
    ocr_language="si",
    page_render_dpi=400,
    preserve_page_breaks=True,
)
converter.convert("document.pdf", "output.md")

# Get Markdown as a string (don't write a file)
markdown = converter.convert_to_string("document.pdf")

# Batch convert a directory
results = converter.convert_batch("./pdfs/", output_dir="./output/", recursive=True)
print(f"Converted {len(results)} files")

CLI Usage

Usage: pdf2md [COMMAND] [OPTIONS]

Commands:
  convert   Convert a single PDF file to Markdown. (default)
  batch     Convert all PDF files in a directory to Markdown.

Convert Options:
  PDF_PATH              Path to the input PDF file
  -o, --output PATH     Output Markdown file path
  -e, --ocr-engine TEXT OCR engine: tesseract (default) or surya
  -l, --lang TEXT       Language code: si (Sinhala, default), en, ta, hi
  -d, --dpi INT         Render DPI for scanned pages (72–600, default 300)
  -v, --verbose         Enable debug logging
  --ai-cleanup TEXT     AI provider for post-processing: openai, gemini, ollama

Batch Options:
  INPUT_DIR             Directory containing PDF files
  -o, --output-dir DIR  Output directory for .md files
  -r, --recursive       Search subdirectories
  (plus all convert options above)

Examples

# Basic usage — output next to input file
pdf2md report.pdf
# → report.md

# English document
pdf2md letter.pdf --lang en

# High-quality scanned document
pdf2md scanned_book.pdf --dpi 450 --ocr-engine tesseract

# With AI cleanup via local Ollama
pdf2md document.pdf --ai-cleanup ollama

# Batch with verbose output
pdf2md batch ./inbox/ --output-dir ./processed/ --recursive --verbose

Python API Usage

Basic

from sinhala_pdf2md import Converter

converter = Converter()
path = converter.convert("input.pdf", "output.md")
print(f"Saved to: {path}")

Full Configuration via `ConverterConfig`

from sinhala_pdf2md import Converter, ConverterConfig, OCREngineType, AIProviderType

config = ConverterConfig(
    ocr_engine=OCREngineType.TESSERACT,
    ocr_language="si",
    page_render_dpi=400,
    ocr_confidence_threshold=0.6,
    preserve_page_breaks=True,
    heading_detection_enabled=True,
    table_detection_enabled=True,
    heading_font_size_ratio=1.3,
    image_preprocess_enabled=True,
    ai_provider=AIProviderType.OPENAI,
    ai_model="gpt-4o",
    ai_api_key="sk-...",
)

converter = Converter(config=config)
converter.convert("document.pdf", "output.md")

Batch Processing with Error Handling

from sinhala_pdf2md import Converter
from sinhala_pdf2md.exceptions import BatchConversionError

converter = Converter(ocr_engine="tesseract")

try:
    results = converter.convert_batch("./pdfs/", output_dir="./out/", recursive=True)
    print(f"Successfully converted {len(results)} files")
except BatchConversionError as e:
    print(f"Some files failed: {e.failures}")

Return Markdown Without Writing a File

converter = Converter()
markdown_text = converter.convert_to_string("document.pdf")
# Process the string however you like

Configuration

All settings can be set via constructor arguments, a ConverterConfig object, or environment variables with the PDF2MD_ prefix.

Setting	Default	Env Var	Description
`ocr_engine`	`tesseract`	`PDF2MD_OCR_ENGINE`	OCR backend (`tesseract` or `surya`)
`ocr_language`	`si`	`PDF2MD_OCR_LANGUAGE`	ISO 639-1 language code
`ocr_confidence_threshold`	`0.5`	`PDF2MD_OCR_CONFIDENCE_THRESHOLD`	Min confidence score to keep OCR output
`page_render_dpi`	`300`	`PDF2MD_PAGE_RENDER_DPI`	DPI for rendering scanned pages (72–600)
`preserve_page_breaks`	`true`	`PDF2MD_PRESERVE_PAGE_BREAKS`	Insert `<!-- page-break -->` between pages
`heading_detection_enabled`	`true`	`PDF2MD_HEADING_DETECTION_ENABLED`	Enable font-size heading detection
`table_detection_enabled`	`true`	`PDF2MD_TABLE_DETECTION_ENABLED`	Enable table extraction
`heading_font_size_ratio`	`1.3`	`PDF2MD_HEADING_FONT_SIZE_RATIO`	Font size ratio threshold for headings
`image_preprocess_enabled`	`true`	`PDF2MD_IMAGE_PREPROCESS_ENABLED`	Deskew/denoise/binarize before OCR
`ai_provider`	`None`	`PDF2MD_AI_PROVIDER`	AI cleanup provider (`openai`, `gemini`, `ollama`)
`ai_model`	`None`	`PDF2MD_AI_MODEL`	Model name for the AI provider
`ai_api_key`	`None`	`PDF2MD_AI_API_KEY`	API key for the AI provider
`ai_base_url`	`None`	`PDF2MD_AI_BASE_URL`	Base URL override (for Ollama or custom endpoints)
`output_dir`	`None`	`PDF2MD_OUTPUT_DIR`	Default output directory
`log_level`	`INFO`	`PDF2MD_LOG_LEVEL`	Logging level

Example with environment variables:

export PDF2MD_OCR_ENGINE=surya
export PDF2MD_PAGE_RENDER_DPI=400
export PDF2MD_AI_PROVIDER=openai
export OPENAI_API_KEY=sk-...

pdf2md document.pdf

See docs/configuration.md for the full reference.

Supported OCR Engines

Tesseract (Default)

Free, open source, widely available
Requires the tesseract binary and language data files
Good accuracy for clean, high-DPI scans
Supports Sinhala (sin), English (eng), Tamil (tam), Hindi (hin)
Install: pip install sinhala-pdf2md (Tesseract binary installed separately)

Surya (Optional)

Transformer-based, generally higher accuracy
Language-agnostic (handles Sinhala without explicit training data)
Requires PyTorch and significant RAM/GPU
⚠️ Non-commercial use only for startups above $5M revenue/funding
Install: pip install sinhala-pdf2md[surya]

Limitations

Scanned page quality matters — very low-resolution or heavily degraded scans will produce poor OCR results regardless of which engine you use. 300+ DPI is recommended.
Complex layouts — multi-column documents, footnotes, and sidebar text may not reconstruct perfectly. The formatter works page-by-page and doesn't do global layout analysis.
Surya licensing — the Surya engine is not free for commercial use above the license thresholds. Check the Surya license before using it in production.
AI cleanup costs money — OpenAI and Gemini API calls are billed per token. Large documents with many pages can accumulate costs quickly.
Mixed pages — pages that have both text and images use a heuristic: if the text layer has 100+ characters, OCR is skipped. This works well in practice but isn't perfect.
No image extraction — embedded images in PDFs are not extracted or described.

Performance Notes

Text-based PDFs are fast — pdfplumber extracts text in milliseconds per page.
Scanned pages take longer — rendering + image preprocessing + OCR can take 2–10 seconds per page depending on DPI and hardware.
Surya loads a transformer model on first use — there's a cold-start delay of several seconds, but subsequent pages are faster.
Image preprocessing (deskew, denoise, binarize) adds ~0.5–2 seconds per page but significantly improves OCR accuracy on noisy scans.
The OCR engine is lazily initialised — if your document has no scanned pages, no OCR overhead is incurred at all.

Contributing

Contributions are welcome. If you're fixing a bug, adding a feature, or writing tests, please:

Fork the repository and create a branch from main.
Install dev dependencies: pip install -e ".[dev]"
Set up pre-commit hooks: pre-commit install
Run tests: make test or pytest tests/
Check types: make typecheck
Lint and format: make format
Open a pull request with a clear description.

See docs/developer-guide.md for detailed contribution instructions, including how to add a new OCR engine or AI provider.

Common `make` Targets

make dev         # Install in editable mode with dev dependencies
make test        # Run the full test suite
make test-unit   # Unit tests only
make lint        # Check code style
make format      # Auto-fix formatting
make typecheck   # mypy static analysis
make clean       # Remove build artifacts

Documentation

Document	Purpose
Architecture	How the system works, data flow, component responsibilities
Design Decisions	Why specific libraries and patterns were chosen
Project Structure	Folder and file layout explained
Workflows	Step-by-step processing flows with diagrams
Developer Guide	How to extend the project
API Reference	Public classes, methods, and exceptions
Testing Guide	How to run tests and contribute test coverage
Configuration	Full configuration reference
Installation	Platform-specific setup instructions
Changelog	Version history

License

MIT — free to use, modify, and distribute.

Note on Surya: The optional Surya OCR engine uses a modified Open Rail-M license that restricts commercial use. The sinhala-pdf2md library itself is MIT — the restriction only applies if you install and use the [surya] extra. See Surya's license for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.1

Jun 1, 2026

0.2.0

May 31, 2026

0.1.0

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinhala_pdf2md-0.2.1.tar.gz (59.5 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sinhala_pdf2md-0.2.1-py3-none-any.whl (47.6 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file sinhala_pdf2md-0.2.1.tar.gz.

File metadata

Download URL: sinhala_pdf2md-0.2.1.tar.gz
Upload date: Jun 1, 2026
Size: 59.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sinhala_pdf2md-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`1c85b0219720823caef3eaf63283a53b7816a1cf06cdc1b6944b175678e0ea19`
MD5	`3a8f0eccded468282a49bf2d5b56c824`
BLAKE2b-256	`f9f179c8c0c656435626a6f26bfe7851c479876291fdae3f769726f432c27504`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sinhala_pdf2md-0.2.1.tar.gz:

Publisher: release.yml on RMCV-Rajapaksha/Sinhala-OCR

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sinhala_pdf2md-0.2.1.tar.gz
- Subject digest: 1c85b0219720823caef3eaf63283a53b7816a1cf06cdc1b6944b175678e0ea19
- Sigstore transparency entry: 1688309429
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: RMCV-Rajapaksha/Sinhala-OCR@37f3e2aff51381ef67238af4e0c12c8441413caf
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/RMCV-Rajapaksha
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@37f3e2aff51381ef67238af4e0c12c8441413caf
- Trigger Event: push

File details

Details for the file sinhala_pdf2md-0.2.1-py3-none-any.whl.

File metadata

Download URL: sinhala_pdf2md-0.2.1-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 47.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sinhala_pdf2md-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`20871d1c4a55a25c4b63ba2e7bb7ef6e24b1281a8f9da589de4dc313f6d344bc`
MD5	`1ab13fe944bf4dbe329f068d2ff25ce5`
BLAKE2b-256	`68da1bbdd1fc934d43c9af9ff07ed6302fe75cf5c9156b3332c400c11e53bca2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sinhala_pdf2md-0.2.1-py3-none-any.whl:

Publisher: release.yml on RMCV-Rajapaksha/Sinhala-OCR

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sinhala_pdf2md-0.2.1-py3-none-any.whl
- Subject digest: 20871d1c4a55a25c4b63ba2e7bb7ef6e24b1281a8f9da589de4dc313f6d344bc
- Sigstore transparency entry: 1688309484
- Sigstore integration time: Jun 1, 2026
Source repository:
- Permalink: RMCV-Rajapaksha/Sinhala-OCR@37f3e2aff51381ef67238af4e0c12c8441413caf
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/RMCV-Rajapaksha
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@37f3e2aff51381ef67238af4e0c12c8441413caf
- Trigger Event: push

sinhala-pdf2md 0.2.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

sinhala-pdf2md

Why Does This Exist?

Features

Architecture Overview

Installation

Prerequisites

Install the Package

Optional Extras

Quick Start

Command Line

Batch Conversion

Python API

CLI Usage

Examples

Python API Usage

Basic

Full Configuration via ConverterConfig

Batch Processing with Error Handling

Return Markdown Without Writing a File

Configuration

Supported OCR Engines

Tesseract (Default)

Surya (Optional)

Limitations

Performance Notes

Contributing

Common make Targets

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Full Configuration via `ConverterConfig`

Common `make` Targets