MarkItDown plugin for Nepali PDFs and legacy DOC files
Project description
likhit
likhit is Jawafdehi's public MarkItDown plugin for Nepal-specific document support.
It extends MarkItDown with Nepal-specific PDF repair, layout-aware Markdown assembly, optional OCR fallback for image-dominant PDFs, and legacy .doc support. For PDFs, likhit now evaluates multiple extraction paths and returns the best result instead of relying on a single fixed pipeline.
Owned and maintained by Jawafdehi.
Installation
pip install likhit
Project Links
- Website: https://jawafdehi.org/
- GitHub: https://github.com/Jawafdehi/likhit/
- Contact: inquiry@jawafdehi.org
Usage
likhit is primarily used as a MarkItDown plugin.
Python
Once installed, enable plugins when creating a MarkItDown instance:
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=True)
result = md.convert("path/to/nepali-document.pdf")
print(result.text_content)
MarkItDown CLI
You can also use likhit through the standard MarkItDown CLI:
markitdown --use-plugins path/to/nepali-document.pdf
To write the output to a file:
markitdown --use-plugins path/to/nepali-document.pdf -o output.md
To verify the plugin is registered:
markitdown --list-plugins
You should see likhit in the output.
likhit-save CLI
This package also installs a small helper CLI that runs MarkItDown with the likhit plugin enabled and writes Markdown files for you:
likhit-save path/to/nepali-document.pdf --out output.md
Convert multiple files into a directory:
likhit-save samples/pressrelease.pdf samples/kanunpatrika.pdf --out-dir converted/
Extract only one page or a page range from a PDF:
likhit-save path/to/nepali-document.pdf --pages 5 --out page-5.md
likhit-save path/to/nepali-document.pdf --pages 2-4 --out pages-2-4.md
What likhit does
likhit adds behavior beyond MarkItDown in these places:
- PDF:
likhitintercepts PDF inputs, runs the default MarkItDown PDF converter first, and then decides whether to keep that result, retry with Nepal-specific extraction, or add an OCR candidate for image-dominant pages. It prefers directlikhitextraction immediately when known Nepali repair fonts are detected. - DOC: Legacy Microsoft Word
.docfiles are handled bylikhit's own extraction pipeline. - DOCX:
.docxfiles are still handled by MarkItDown's built-in Word converter, even when plugins are enabled.
Supported document types
- PDFs, including Nepal-specific born-digital PDFs and image-dominant PDFs that may need OCR
- Legacy
.docfiles .docxpassthrough via MarkItDown
OCR Configuration
For image-dominant or scanned PDFs, likhit can add an OCR extraction candidate through markitdown-ocr when OCR is configured.
Required model configuration:
export MARKITDOWN_OCR_MODEL="your-model-name"
You can also provide the model through OPENAI_MODEL or GEMINI_MODEL.
Authentication options:
- OpenAI-compatible provider with a standard OpenAI key:
export OPENAI_API_KEY="your-api-key"
- OpenAI-compatible provider with a custom base URL:
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://your-provider.example/v1/"
export MARKITDOWN_OCR_MODEL="your-model-name"
- Gemini using the OpenAI compatibility endpoint:
export GEMINI_API_KEY="your-gemini-api-key"
export GEMINI_MODEL="gemini-2.5-flash"
When GEMINI_API_KEY is set, likhit automatically uses Gemini's OpenAI-compatible base URL unless you explicitly override OPENAI_BASE_URL.
Optional variables:
export MARKITDOWN_OCR_PROMPT="Custom OCR instructions"
Architecture
The high-level PDF pipeline is:
- MarkItDown loads the plugin when
enable_plugins=Trueor--use-pluginsis used. - For PDF inputs,
likhitreads the file and optionally slices it to the requested page range. likhitscans embedded fonts. If it detects known Nepali repair fonts such as Kalimati broken-CMap fonts or legacy remap fonts, it tries the Nepal-specific extraction pipeline immediately.likhitalso runs the default MarkItDown PDF converter and keeps that result as a candidate.likhitanalyzes the PDF pages. If the file looks image-dominant with a suspicious text layer and OCR is configured, it adds an OCR candidate.- If the default Markdown output looks suspicious for Nepali text,
likhitretries extraction with its own PDF pipeline. - The Nepal-specific PDF pipeline can apply:
- Kalimati broken-CMap repair
- Devanagari reordering
- Devanagari spacing normalization
- Legacy-font remapping through
npttf2utf
- After extraction,
likhitchecks whether the document matches a whole-document semantic structure such as a single-column notice. - PDF layout ordering is assigned locally while assembling content blocks, so single-column, row-aligned, and two-column regions can coexist in one file.
- If multiple candidate outputs exist,
likhitscores them and returns the best one.
Project Layout
src/likhit/_plugin.py: MarkItDown plugin entry point and converter registrationsrc/likhit/converters/: plugin converters for PDF and legacy DOC inputssrc/likhit/nepali_pdf_repair.py: reusable Nepal-specific PDF repair layersrc/likhit/markdown_assembly.py: generic Markdown assembly for the default conversion pathsrc/likhit/extractors/: extraction strategies (PDF, DOC)font_based.py: PDF extraction with Nepali font repairdocx_based.py: legacy DOC text extraction
src/likhit/handlers/: structure-aware handlers and detection logicsrc/likhit/renderers/: Markdown renderingtests/: conversion, extraction, and plugin coveragetests/integration/: end-to-end integration teststests/integration/test_data/: committed test fixtures (PDF, DOCX, DOC samples)
Testing
Running Tests
Run all tests:
poetry run pytest
References
- MarkItDown: https://github.com/microsoft/markitdown
- MarkItDown sample plugin: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin
Ownership
likhit is owned and maintained by Jawafdehi.
- Organization: Jawafdehi
- Website: https://jawafdehi.org/
- GitHub: https://github.com/Jawafdehi/likhit/
- Contact: inquiry@jawafdehi.org
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file likhit-0.1.7.tar.gz.
File metadata
- Download URL: likhit-0.1.7.tar.gz
- Upload date:
- Size: 46.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
200e0666a1910aa2c58ddc432c9c2a425fe5cedf69ddb86aa12c091ac3a07847
|
|
| MD5 |
618569f8f2afddea3c003e516c834939
|
|
| BLAKE2b-256 |
f3e79bbc5ae68743ac4d3bbbe6be5e754854b0f562d4409f4287dbd8ca7fb178
|
Provenance
The following attestation bundles were made for likhit-0.1.7.tar.gz:
Publisher:
pypi-publish.yml on Jawafdehi/likhit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
likhit-0.1.7.tar.gz -
Subject digest:
200e0666a1910aa2c58ddc432c9c2a425fe5cedf69ddb86aa12c091ac3a07847 - Sigstore transparency entry: 1436995306
- Sigstore integration time:
-
Permalink:
Jawafdehi/likhit@0637d2f2e6a6d73f8b1b8e35357d44fdf11f2d7c -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/Jawafdehi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@0637d2f2e6a6d73f8b1b8e35357d44fdf11f2d7c -
Trigger Event:
push
-
Statement type:
File details
Details for the file likhit-0.1.7-py3-none-any.whl.
File metadata
- Download URL: likhit-0.1.7-py3-none-any.whl
- Upload date:
- Size: 59.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4b9cd125103cd622f70bbb3d3985fdc08efb3639e7c940c3f682322645eb2b7b
|
|
| MD5 |
ee6907dfb7a3d699698b2637d3ba95b2
|
|
| BLAKE2b-256 |
ac4ecf7e0513cd50711601db7bf84165011254d00b4916f96a234c5096ed6803
|
Provenance
The following attestation bundles were made for likhit-0.1.7-py3-none-any.whl:
Publisher:
pypi-publish.yml on Jawafdehi/likhit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
likhit-0.1.7-py3-none-any.whl -
Subject digest:
4b9cd125103cd622f70bbb3d3985fdc08efb3639e7c940c3f682322645eb2b7b - Sigstore transparency entry: 1436995312
- Sigstore integration time:
-
Permalink:
Jawafdehi/likhit@0637d2f2e6a6d73f8b1b8e35357d44fdf11f2d7c -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/Jawafdehi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@0637d2f2e6a6d73f8b1b8e35357d44fdf11f2d7c -
Trigger Event:
push
-
Statement type: