MarkItDown plugin for Nepali PDFs and legacy DOC files

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

notashwinii

These details have not been verified by PyPI

Project links

Homepage

Project description

likhit

likhit is Jawafdehi's public MarkItDown plugin for Nepal-specific document support.

It extends MarkItDown with Nepal-specific PDF repair, layout-aware Markdown assembly, optional OCR fallback for image-dominant PDFs, and legacy .doc support. For PDFs, likhit now evaluates multiple extraction paths and returns the best result instead of relying on a single fixed pipeline.

Owned and maintained by Jawafdehi.

Installation

pip install likhit

Project Links

Website: https://jawafdehi.org/
GitHub: https://github.com/Jawafdehi/likhit/
Contact: inquiry@jawafdehi.org

Usage

likhit is primarily used as a MarkItDown plugin.

Python

Once installed, enable plugins when creating a MarkItDown instance:

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=True)
result = md.convert("path/to/nepali-document.pdf")
print(result.text_content)

MarkItDown CLI

You can also use likhit through the standard MarkItDown CLI:

markitdown --use-plugins path/to/nepali-document.pdf

To write the output to a file:

markitdown --use-plugins path/to/nepali-document.pdf -o output.md

To verify the plugin is registered:

markitdown --list-plugins

You should see likhit in the output.

`likhit-save` CLI

This package also installs a small helper CLI that runs MarkItDown with the likhit plugin enabled and writes Markdown files for you:

likhit-save path/to/nepali-document.pdf --out output.md

Convert multiple files into a directory:

likhit-save samples/pressrelease.pdf samples/kanunpatrika.pdf --out-dir converted/

Extract only one page or a page range from a PDF:

likhit-save path/to/nepali-document.pdf --pages 5 --out page-5.md
likhit-save path/to/nepali-document.pdf --pages 2-4 --out pages-2-4.md

What likhit does

likhit adds behavior beyond MarkItDown in these places:

PDF: likhit intercepts PDF inputs, runs the default MarkItDown PDF converter first, and then decides whether to keep that result, retry with Nepal-specific extraction, or add an OCR candidate for image-dominant pages. It prefers direct likhit extraction immediately when known Nepali repair fonts are detected.
DOC: Legacy Microsoft Word .doc files are handled by likhit's own extraction pipeline.
DOCX: .docx files are still handled by MarkItDown's built-in Word converter, even when plugins are enabled.

Supported document types

PDFs, including Nepal-specific born-digital PDFs and image-dominant PDFs that may need OCR
Legacy .doc files
.docx passthrough via MarkItDown

OCR Configuration

For image-dominant or scanned PDFs, likhit can add an OCR extraction candidate through markitdown-ocr when OCR is configured.

Required model configuration:

export MARKITDOWN_OCR_MODEL="your-model-name"

You can also provide the model through OPENAI_MODEL or GEMINI_MODEL.

Authentication options:

OpenAI-compatible provider with a standard OpenAI key:

export OPENAI_API_KEY="your-api-key"

OpenAI-compatible provider with a custom base URL:

export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://your-provider.example/v1/"
export MARKITDOWN_OCR_MODEL="your-model-name"

Gemini using the OpenAI compatibility endpoint:

export GEMINI_API_KEY="your-gemini-api-key"
export GEMINI_MODEL="gemini-2.5-flash"

When GEMINI_API_KEY is set, likhit automatically uses Gemini's OpenAI-compatible base URL unless you explicitly override OPENAI_BASE_URL.

Optional variables:

export MARKITDOWN_OCR_PROMPT="Custom OCR instructions"

Architecture

The high-level PDF pipeline is:

MarkItDown loads the plugin when enable_plugins=True or --use-plugins is used.
For PDF inputs, likhit reads the file and optionally slices it to the requested page range.
likhit scans embedded fonts. If it detects known Nepali repair fonts such as Kalimati broken-CMap fonts or legacy remap fonts, it tries the Nepal-specific extraction pipeline immediately.
likhit also runs the default MarkItDown PDF converter and keeps that result as a candidate.
likhit analyzes the PDF pages. If the file looks image-dominant with a suspicious text layer and OCR is configured, it adds an OCR candidate.
If the default Markdown output looks suspicious for Nepali text, likhit retries extraction with its own PDF pipeline.
The Nepal-specific PDF pipeline can apply:
- Kalimati broken-CMap repair
- Devanagari reordering
- Devanagari spacing normalization
- Legacy-font remapping through npttf2utf
After extraction, likhit checks whether the document matches a whole-document semantic structure such as a single-column notice.
PDF layout ordering is assigned locally while assembling content blocks, so single-column, row-aligned, and two-column regions can coexist in one file.
If multiple candidate outputs exist, likhit scores them and returns the best one.

Project Layout

src/likhit/_plugin.py: MarkItDown plugin entry point and converter registration
src/likhit/converters/: plugin converters for PDF and legacy DOC inputs
src/likhit/nepali_pdf_repair.py: reusable Nepal-specific PDF repair layer
src/likhit/markdown_assembly.py: generic Markdown assembly for the default conversion path
src/likhit/extractors/: extraction strategies (PDF, DOC)
- font_based.py: PDF extraction with Nepali font repair
- docx_based.py: legacy DOC text extraction
src/likhit/handlers/: structure-aware handlers and detection logic
src/likhit/renderers/: Markdown rendering
tests/: conversion, extraction, and plugin coverage
- tests/integration/: end-to-end integration tests
- tests/integration/test_data/: committed test fixtures (PDF, DOCX, DOC samples)

Testing

Running Tests

Run all tests:

poetry run pytest

References

MarkItDown: https://github.com/microsoft/markitdown
MarkItDown sample plugin: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin

Ownership

likhit is owned and maintained by Jawafdehi.

Organization: Jawafdehi
Website: https://jawafdehi.org/
GitHub: https://github.com/Jawafdehi/likhit/
Contact: inquiry@jawafdehi.org

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

notashwinii

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.7

May 4, 2026

0.1.6

Apr 28, 2026

0.1.1

Mar 25, 2026

0.1.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

likhit-0.1.7.tar.gz (46.7 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

likhit-0.1.7-py3-none-any.whl (59.0 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file likhit-0.1.7.tar.gz.

File metadata

Download URL: likhit-0.1.7.tar.gz
Upload date: May 4, 2026
Size: 46.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for likhit-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`200e0666a1910aa2c58ddc432c9c2a425fe5cedf69ddb86aa12c091ac3a07847`
MD5	`618569f8f2afddea3c003e516c834939`
BLAKE2b-256	`f3e79bbc5ae68743ac4d3bbbe6be5e754854b0f562d4409f4287dbd8ca7fb178`

See more details on using hashes here.

Provenance

The following attestation bundles were made for likhit-0.1.7.tar.gz:

Publisher: pypi-publish.yml on Jawafdehi/likhit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: likhit-0.1.7.tar.gz
- Subject digest: 200e0666a1910aa2c58ddc432c9c2a425fe5cedf69ddb86aa12c091ac3a07847
- Sigstore transparency entry: 1436995306
- Sigstore integration time: May 4, 2026
Source repository:
- Permalink: Jawafdehi/likhit@0637d2f2e6a6d73f8b1b8e35357d44fdf11f2d7c
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/Jawafdehi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@0637d2f2e6a6d73f8b1b8e35357d44fdf11f2d7c
- Trigger Event: push

File details

Details for the file likhit-0.1.7-py3-none-any.whl.

File metadata

Download URL: likhit-0.1.7-py3-none-any.whl
Upload date: May 4, 2026
Size: 59.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for likhit-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4b9cd125103cd622f70bbb3d3985fdc08efb3639e7c940c3f682322645eb2b7b`
MD5	`ee6907dfb7a3d699698b2637d3ba95b2`
BLAKE2b-256	`ac4ecf7e0513cd50711601db7bf84165011254d00b4916f96a234c5096ed6803`

See more details on using hashes here.

Provenance

The following attestation bundles were made for likhit-0.1.7-py3-none-any.whl:

Publisher: pypi-publish.yml on Jawafdehi/likhit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: likhit-0.1.7-py3-none-any.whl
- Subject digest: 4b9cd125103cd622f70bbb3d3985fdc08efb3639e7c940c3f682322645eb2b7b
- Sigstore transparency entry: 1436995312
- Sigstore integration time: May 4, 2026
Source repository:
- Permalink: Jawafdehi/likhit@0637d2f2e6a6d73f8b1b8e35357d44fdf11f2d7c
- Branch / Tag: refs/tags/v0.1.7
- Owner: https://github.com/Jawafdehi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@0637d2f2e6a6d73f8b1b8e35357d44fdf11f2d7c
- Trigger Event: push

likhit 0.1.7

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

likhit

Installation

Project Links

Usage

Python

MarkItDown CLI

likhit-save CLI

What likhit does

Supported document types

OCR Configuration

Architecture

Project Layout

Testing

Running Tests

References

Ownership

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`likhit-save` CLI