Convert Sinhala PDF documents into clean Markdown using OCR and text extraction
Project description
sinhala-pdf2md
Version: 0.2.1
Convert Sinhala PDF documents into clean, readable Markdown — with or without OCR.
Why Does This Exist?
Working with Sinhala text in PDF format is painful. Most tools either ignore Unicode entirely, mangle the script's conjunct consonants, or produce OCR output full of garbled characters.
sinhala-pdf2md was built to solve this specifically for Sinhala documents. It:
- Knows the difference between a text-based PDF (which has a proper text layer) and a scanned one (which is just an image)
- Picks the right tool for each page — direct text extraction for digital PDFs, OCR for scanned ones
- Fixes Unicode issues that appear after OCR — broken ZWJ sequences, misplaced combining marks, control characters
- Optionally runs an LLM (OpenAI, Gemini, or a local Ollama model) to clean up structure
If you're digitising Sinhala books, government documents, or any scanned Sinhala content, this tool handles the messy bits so you can focus on the content.
Features
- Smart page classification — detects whether each page needs OCR or direct extraction
- Two OCR engines — Tesseract (default, free) or Surya (transformer-based, higher accuracy)
- Image pre-processing — deskew, denoise, and binarize scanned images before OCR
- Heading detection — infers headings from font size ratios (not just guessing)
- Table support — extracts PDF tables and renders them as GitHub Flavored Markdown
- List detection — recognises bullets, numbered lists, and common Sinhala bullet characters
- Unicode repair — NFC normalisation + ZWJ and virama (්) sequence fixing
- AI cleanup — optional post-processing with OpenAI, Gemini, or Ollama
- Batch conversion — convert an entire directory in one command
- Python API — use as a library in your own pipelines
- Env-var config — all settings configurable via
PDF2MD_*environment variables
Architecture Overview
PDF File
│
▼
PageAnalyzer ──────── Classifies each page (text / scanned / mixed)
│
├─── TEXT page ──► PDFExtractor (pdfplumber) ──► MarkdownFormatter
│
├─── SCANNED page ──► PageRenderer (PyMuPDF) ──► Image Preprocessor
│ │
│ ▼
│ OCREngine (Tesseract / Surya)
│ │
│ ▼
│ MarkdownFormatter
│
└─── MIXED page ──► Both paths, combined
│
▼
MarkdownCleaner (Unicode repair, whitespace)
│
▼
[Optional] AIFormatter (OpenAI / Gemini / Ollama)
│
▼
Output .md file
See docs/architecture.md for the full breakdown with Mermaid diagrams.
Installation
Prerequisites
Tesseract (required for OCR on scanned pages):
# Ubuntu / Debian
sudo apt-get install tesseract-ocr tesseract-ocr-sin
# macOS
brew install tesseract tesseract-lang
# Windows — download from https://github.com/UB-Mannheim/tesseract/wiki
# Then add to PATH. Make sure the "sin" language data is included.
Install the Package
pip install sinhala-pdf2md
Optional Extras
# Surya OCR engine (transformer-based, higher accuracy)
# ⚠️ Non-commercial use only — see https://github.com/VikParuchuri/surya
pip install sinhala-pdf2md[surya]
# AI cleanup with OpenAI
pip install sinhala-pdf2md[ai-openai]
# AI cleanup with Gemini
pip install sinhala-pdf2md[ai-gemini]
# AI cleanup with Ollama (local)
pip install sinhala-pdf2md[ai-ollama]
# Everything at once (dev included)
pip install sinhala-pdf2md[all]
See docs/installation.md for detailed platform-specific instructions.
Quick Start
Command Line
# Convert a single PDF
pdf2md document.pdf
# Specify output path
pdf2md document.pdf -o output.md
# Use Surya OCR engine
pdf2md document.pdf --ocr-engine surya
# Higher DPI for better OCR quality
pdf2md document.pdf --dpi 400
# Enable AI cleanup (requires OpenAI API key)
pdf2md document.pdf --ai-cleanup openai
# Verbose logging
pdf2md document.pdf --verbose
Batch Conversion
# Convert all PDFs in a directory
pdf2md batch ./documents/
# With output directory and recursive search
pdf2md batch ./documents/ --output-dir ./markdown/ --recursive
Python API
from sinhala_pdf2md import Converter
# Simple conversion
converter = Converter()
output_path = converter.convert("document.pdf")
# Custom configuration
converter = Converter(
ocr_engine="tesseract",
ocr_language="si",
page_render_dpi=400,
preserve_page_breaks=True,
)
converter.convert("document.pdf", "output.md")
# Get Markdown as a string (don't write a file)
markdown = converter.convert_to_string("document.pdf")
# Batch convert a directory
results = converter.convert_batch("./pdfs/", output_dir="./output/", recursive=True)
print(f"Converted {len(results)} files")
CLI Usage
Usage: pdf2md [COMMAND] [OPTIONS]
Commands:
convert Convert a single PDF file to Markdown. (default)
batch Convert all PDF files in a directory to Markdown.
Convert Options:
PDF_PATH Path to the input PDF file
-o, --output PATH Output Markdown file path
-e, --ocr-engine TEXT OCR engine: tesseract (default) or surya
-l, --lang TEXT Language code: si (Sinhala, default), en, ta, hi
-d, --dpi INT Render DPI for scanned pages (72–600, default 300)
-v, --verbose Enable debug logging
--ai-cleanup TEXT AI provider for post-processing: openai, gemini, ollama
Batch Options:
INPUT_DIR Directory containing PDF files
-o, --output-dir DIR Output directory for .md files
-r, --recursive Search subdirectories
(plus all convert options above)
Examples
# Basic usage — output next to input file
pdf2md report.pdf
# → report.md
# English document
pdf2md letter.pdf --lang en
# High-quality scanned document
pdf2md scanned_book.pdf --dpi 450 --ocr-engine tesseract
# With AI cleanup via local Ollama
pdf2md document.pdf --ai-cleanup ollama
# Batch with verbose output
pdf2md batch ./inbox/ --output-dir ./processed/ --recursive --verbose
Python API Usage
Basic
from sinhala_pdf2md import Converter
converter = Converter()
path = converter.convert("input.pdf", "output.md")
print(f"Saved to: {path}")
Full Configuration via ConverterConfig
from sinhala_pdf2md import Converter, ConverterConfig, OCREngineType, AIProviderType
config = ConverterConfig(
ocr_engine=OCREngineType.TESSERACT,
ocr_language="si",
page_render_dpi=400,
ocr_confidence_threshold=0.6,
preserve_page_breaks=True,
heading_detection_enabled=True,
table_detection_enabled=True,
heading_font_size_ratio=1.3,
image_preprocess_enabled=True,
ai_provider=AIProviderType.OPENAI,
ai_model="gpt-4o",
ai_api_key="sk-...",
)
converter = Converter(config=config)
converter.convert("document.pdf", "output.md")
Batch Processing with Error Handling
from sinhala_pdf2md import Converter
from sinhala_pdf2md.exceptions import BatchConversionError
converter = Converter(ocr_engine="tesseract")
try:
results = converter.convert_batch("./pdfs/", output_dir="./out/", recursive=True)
print(f"Successfully converted {len(results)} files")
except BatchConversionError as e:
print(f"Some files failed: {e.failures}")
Return Markdown Without Writing a File
converter = Converter()
markdown_text = converter.convert_to_string("document.pdf")
# Process the string however you like
Configuration
All settings can be set via constructor arguments, a ConverterConfig object, or environment variables with the PDF2MD_ prefix.
| Setting | Default | Env Var | Description |
|---|---|---|---|
ocr_engine |
tesseract |
PDF2MD_OCR_ENGINE |
OCR backend (tesseract or surya) |
ocr_language |
si |
PDF2MD_OCR_LANGUAGE |
ISO 639-1 language code |
ocr_confidence_threshold |
0.5 |
PDF2MD_OCR_CONFIDENCE_THRESHOLD |
Min confidence score to keep OCR output |
page_render_dpi |
300 |
PDF2MD_PAGE_RENDER_DPI |
DPI for rendering scanned pages (72–600) |
preserve_page_breaks |
true |
PDF2MD_PRESERVE_PAGE_BREAKS |
Insert <!-- page-break --> between pages |
heading_detection_enabled |
true |
PDF2MD_HEADING_DETECTION_ENABLED |
Enable font-size heading detection |
table_detection_enabled |
true |
PDF2MD_TABLE_DETECTION_ENABLED |
Enable table extraction |
heading_font_size_ratio |
1.3 |
PDF2MD_HEADING_FONT_SIZE_RATIO |
Font size ratio threshold for headings |
image_preprocess_enabled |
true |
PDF2MD_IMAGE_PREPROCESS_ENABLED |
Deskew/denoise/binarize before OCR |
ai_provider |
None |
PDF2MD_AI_PROVIDER |
AI cleanup provider (openai, gemini, ollama) |
ai_model |
None |
PDF2MD_AI_MODEL |
Model name for the AI provider |
ai_api_key |
None |
PDF2MD_AI_API_KEY |
API key for the AI provider |
ai_base_url |
None |
PDF2MD_AI_BASE_URL |
Base URL override (for Ollama or custom endpoints) |
output_dir |
None |
PDF2MD_OUTPUT_DIR |
Default output directory |
log_level |
INFO |
PDF2MD_LOG_LEVEL |
Logging level |
Example with environment variables:
export PDF2MD_OCR_ENGINE=surya
export PDF2MD_PAGE_RENDER_DPI=400
export PDF2MD_AI_PROVIDER=openai
export OPENAI_API_KEY=sk-...
pdf2md document.pdf
See docs/configuration.md for the full reference.
Supported OCR Engines
Tesseract (Default)
- Free, open source, widely available
- Requires the
tesseractbinary and language data files - Good accuracy for clean, high-DPI scans
- Supports Sinhala (
sin), English (eng), Tamil (tam), Hindi (hin) - Install:
pip install sinhala-pdf2md(Tesseract binary installed separately)
Surya (Optional)
- Transformer-based, generally higher accuracy
- Language-agnostic (handles Sinhala without explicit training data)
- Requires PyTorch and significant RAM/GPU
- ⚠️ Non-commercial use only for startups above $5M revenue/funding
- Install:
pip install sinhala-pdf2md[surya]
Limitations
- Scanned page quality matters — very low-resolution or heavily degraded scans will produce poor OCR results regardless of which engine you use. 300+ DPI is recommended.
- Complex layouts — multi-column documents, footnotes, and sidebar text may not reconstruct perfectly. The formatter works page-by-page and doesn't do global layout analysis.
- Surya licensing — the Surya engine is not free for commercial use above the license thresholds. Check the Surya license before using it in production.
- AI cleanup costs money — OpenAI and Gemini API calls are billed per token. Large documents with many pages can accumulate costs quickly.
- Mixed pages — pages that have both text and images use a heuristic: if the text layer has 100+ characters, OCR is skipped. This works well in practice but isn't perfect.
- No image extraction — embedded images in PDFs are not extracted or described.
Performance Notes
- Text-based PDFs are fast — pdfplumber extracts text in milliseconds per page.
- Scanned pages take longer — rendering + image preprocessing + OCR can take 2–10 seconds per page depending on DPI and hardware.
- Surya loads a transformer model on first use — there's a cold-start delay of several seconds, but subsequent pages are faster.
- Image preprocessing (deskew, denoise, binarize) adds ~0.5–2 seconds per page but significantly improves OCR accuracy on noisy scans.
- The OCR engine is lazily initialised — if your document has no scanned pages, no OCR overhead is incurred at all.
Contributing
Contributions are welcome. If you're fixing a bug, adding a feature, or writing tests, please:
- Fork the repository and create a branch from
main. - Install dev dependencies:
pip install -e ".[dev]" - Set up pre-commit hooks:
pre-commit install - Run tests:
make testorpytest tests/ - Check types:
make typecheck - Lint and format:
make format - Open a pull request with a clear description.
See docs/developer-guide.md for detailed contribution instructions, including how to add a new OCR engine or AI provider.
Common make Targets
make dev # Install in editable mode with dev dependencies
make test # Run the full test suite
make test-unit # Unit tests only
make lint # Check code style
make format # Auto-fix formatting
make typecheck # mypy static analysis
make clean # Remove build artifacts
Documentation
| Document | Purpose |
|---|---|
| Architecture | How the system works, data flow, component responsibilities |
| Design Decisions | Why specific libraries and patterns were chosen |
| Project Structure | Folder and file layout explained |
| Workflows | Step-by-step processing flows with diagrams |
| Developer Guide | How to extend the project |
| API Reference | Public classes, methods, and exceptions |
| Testing Guide | How to run tests and contribute test coverage |
| Configuration | Full configuration reference |
| Installation | Platform-specific setup instructions |
| Changelog | Version history |
License
MIT — free to use, modify, and distribute.
Note on Surya: The optional Surya OCR engine uses a modified Open Rail-M license that restricts commercial use. The
sinhala-pdf2mdlibrary itself is MIT — the restriction only applies if you install and use the[surya]extra. See Surya's license for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sinhala_pdf2md-0.2.1.tar.gz.
File metadata
- Download URL: sinhala_pdf2md-0.2.1.tar.gz
- Upload date:
- Size: 59.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c85b0219720823caef3eaf63283a53b7816a1cf06cdc1b6944b175678e0ea19
|
|
| MD5 |
3a8f0eccded468282a49bf2d5b56c824
|
|
| BLAKE2b-256 |
f9f179c8c0c656435626a6f26bfe7851c479876291fdae3f769726f432c27504
|
Provenance
The following attestation bundles were made for sinhala_pdf2md-0.2.1.tar.gz:
Publisher:
release.yml on RMCV-Rajapaksha/Sinhala-OCR
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sinhala_pdf2md-0.2.1.tar.gz -
Subject digest:
1c85b0219720823caef3eaf63283a53b7816a1cf06cdc1b6944b175678e0ea19 - Sigstore transparency entry: 1688309429
- Sigstore integration time:
-
Permalink:
RMCV-Rajapaksha/Sinhala-OCR@37f3e2aff51381ef67238af4e0c12c8441413caf -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/RMCV-Rajapaksha
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@37f3e2aff51381ef67238af4e0c12c8441413caf -
Trigger Event:
push
-
Statement type:
File details
Details for the file sinhala_pdf2md-0.2.1-py3-none-any.whl.
File metadata
- Download URL: sinhala_pdf2md-0.2.1-py3-none-any.whl
- Upload date:
- Size: 47.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20871d1c4a55a25c4b63ba2e7bb7ef6e24b1281a8f9da589de4dc313f6d344bc
|
|
| MD5 |
1ab13fe944bf4dbe329f068d2ff25ce5
|
|
| BLAKE2b-256 |
68da1bbdd1fc934d43c9af9ff07ed6302fe75cf5c9156b3332c400c11e53bca2
|
Provenance
The following attestation bundles were made for sinhala_pdf2md-0.2.1-py3-none-any.whl:
Publisher:
release.yml on RMCV-Rajapaksha/Sinhala-OCR
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sinhala_pdf2md-0.2.1-py3-none-any.whl -
Subject digest:
20871d1c4a55a25c4b63ba2e7bb7ef6e24b1281a8f9da589de4dc313f6d344bc - Sigstore transparency entry: 1688309484
- Sigstore integration time:
-
Permalink:
RMCV-Rajapaksha/Sinhala-OCR@37f3e2aff51381ef67238af4e0c12c8441413caf -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/RMCV-Rajapaksha
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@37f3e2aff51381ef67238af4e0c12c8441413caf -
Trigger Event:
push
-
Statement type: