Skip to main content

MarkItDown plugin for Nepali PDFs and legacy DOC files

Project description

likhit

likhit is Jawafdehi's public MarkItDown plugin for Nepal-specific document support.

It extends MarkItDown with Nepal-specific PDF repair, layout-aware Markdown assembly, optional OCR fallback for image-dominant PDFs, and legacy .doc support. For PDFs, likhit now evaluates multiple extraction paths and returns the best result instead of relying on a single fixed pipeline.

Owned and maintained by Jawafdehi.

Installation

pip install likhit

Project Links

Usage

likhit is primarily used as a MarkItDown plugin.

Python

Once installed, enable plugins when creating a MarkItDown instance:

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=True)
result = md.convert("path/to/nepali-document.pdf")
print(result.text_content)

MarkItDown CLI

You can also use likhit through the standard MarkItDown CLI:

markitdown --use-plugins path/to/nepali-document.pdf

To write the output to a file:

markitdown --use-plugins path/to/nepali-document.pdf -o output.md

To verify the plugin is registered:

markitdown --list-plugins

You should see likhit in the output.

likhit-save CLI

This package also installs a small helper CLI that runs MarkItDown with the likhit plugin enabled and writes Markdown files for you:

likhit-save path/to/nepali-document.pdf --out output.md

Convert multiple files into a directory:

likhit-save samples/pressrelease.pdf samples/kanunpatrika.pdf --out-dir converted/

Extract only one page or a page range from a PDF:

likhit-save path/to/nepali-document.pdf --pages 5 --out page-5.md
likhit-save path/to/nepali-document.pdf --pages 2-4 --out pages-2-4.md

What likhit does

likhit adds behavior beyond MarkItDown in these places:

  • PDF: likhit intercepts PDF inputs, runs the default MarkItDown PDF converter first, and then decides whether to keep that result, retry with Nepal-specific extraction, or add an OCR candidate for image-dominant pages. It prefers direct likhit extraction immediately when known Nepali repair fonts are detected.
  • DOC: Legacy Microsoft Word .doc files are handled by likhit's own extraction pipeline.
  • DOCX: .docx files are still handled by MarkItDown's built-in Word converter, even when plugins are enabled.

Supported document types

  • PDFs, including Nepal-specific born-digital PDFs and image-dominant PDFs that may need OCR
  • Legacy .doc files
  • .docx passthrough via MarkItDown

OCR Configuration

For image-dominant or scanned PDFs, likhit can add an OCR extraction candidate through markitdown-ocr when OCR is configured.

Required model configuration:

export MARKITDOWN_OCR_MODEL="your-model-name"

You can also provide the model through OPENAI_MODEL or GEMINI_MODEL.

Authentication options:

  1. OpenAI-compatible provider with a standard OpenAI key:
export OPENAI_API_KEY="your-api-key"
  1. OpenAI-compatible provider with a custom base URL:
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://your-provider.example/v1/"
export MARKITDOWN_OCR_MODEL="your-model-name"
  1. Gemini using the OpenAI compatibility endpoint:
export GEMINI_API_KEY="your-gemini-api-key"
export GEMINI_MODEL="gemini-2.5-flash"

When GEMINI_API_KEY is set, likhit automatically uses Gemini's OpenAI-compatible base URL unless you explicitly override OPENAI_BASE_URL.

Optional variables:

export MARKITDOWN_OCR_PROMPT="Custom OCR instructions"

Architecture

The high-level PDF pipeline is:

  1. MarkItDown loads the plugin when enable_plugins=True or --use-plugins is used.
  2. For PDF inputs, likhit reads the file and optionally slices it to the requested page range.
  3. likhit scans embedded fonts. If it detects known Nepali repair fonts such as Kalimati broken-CMap fonts or legacy remap fonts, it tries the Nepal-specific extraction pipeline immediately.
  4. likhit also runs the default MarkItDown PDF converter and keeps that result as a candidate.
  5. likhit analyzes the PDF pages. If the file looks image-dominant with a suspicious text layer and OCR is configured, it adds an OCR candidate.
  6. If the default Markdown output looks suspicious for Nepali text, likhit retries extraction with its own PDF pipeline.
  7. The Nepal-specific PDF pipeline can apply:
    • Kalimati broken-CMap repair
    • Devanagari reordering
    • Devanagari spacing normalization
    • Legacy-font remapping through npttf2utf
  8. After extraction, likhit checks whether the document matches a known structure such as a single-column notice or a dense two-column layout.
  9. If a known structure is detected, likhit applies structure-aware ordering, block assembly, and Markdown rendering.
  10. If multiple candidate outputs exist, likhit scores them and returns the best one.

Project Layout

  • src/likhit/_plugin.py: MarkItDown plugin entry point and converter registration
  • src/likhit/converters/: plugin converters for PDF and legacy DOC inputs
  • src/likhit/nepali_pdf_repair.py: reusable Nepal-specific PDF repair layer
  • src/likhit/markdown_assembly.py: generic Markdown assembly for the default conversion path
  • src/likhit/extractors/: extraction strategies (PDF, DOC)
    • font_based.py: PDF extraction with Nepali font repair
    • docx_based.py: legacy DOC text extraction
  • src/likhit/handlers/: structure-aware handlers and detection logic
  • src/likhit/renderers/: Markdown rendering
  • tests/: conversion, extraction, and plugin coverage
    • tests/integration/: end-to-end integration tests
    • tests/integration/test_data/: committed test fixtures (PDF, DOCX, DOC samples)

Testing

Running Tests

Run all tests:

poetry run pytest

References

Ownership

likhit is owned and maintained by Jawafdehi.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

likhit-0.1.6.tar.gz (46.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

likhit-0.1.6-py3-none-any.whl (58.1 kB view details)

Uploaded Python 3

File details

Details for the file likhit-0.1.6.tar.gz.

File metadata

  • Download URL: likhit-0.1.6.tar.gz
  • Upload date:
  • Size: 46.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for likhit-0.1.6.tar.gz
Algorithm Hash digest
SHA256 b3bc7cc9336e28375a7856c50875e5bbea37e76484d8a32ed589f229f15df3f1
MD5 d059f6aeab566195d18a27fec5e8886f
BLAKE2b-256 bbf21aacceb7b1a07e1ea3bbdb824311259e18b0538e80b4a2644093754aa3dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for likhit-0.1.6.tar.gz:

Publisher: pypi-publish.yml on Jawafdehi/likhit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file likhit-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: likhit-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 58.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for likhit-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 ac92dcdc35e8b211a63f765632342a0e0e84ab432998c3535836881c9cfa9390
MD5 68aa6e3672925967fe76b4af5dcb21c7
BLAKE2b-256 a5e6a3826ce5ac9ee9648eb1304c1c8c8c199b83a1aa96c1312ab2bd6600d438

See more details on using hashes here.

Provenance

The following attestation bundles were made for likhit-0.1.6-py3-none-any.whl:

Publisher: pypi-publish.yml on Jawafdehi/likhit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page