Nepali PDF-to-Markdown converter powered by MarkItDown
Project description
likhit
likhit is a public PDF-to-Markdown tool for Nepali government documents.
The default path is powered by MarkItDown, with likhit intercepting born-digital Nepali PDFs that need Nepal-specific repair before Markdown is emitted. That repair layer handles Kalimati broken-CMap fixes, Devanagari reordering and spacing normalization, and legacy Nepali font remapping where applicable.
Installation
With Poetry
poetry install
Run project commands with poetry run:
poetry run pytest
poetry run ruff check .
poetry run black --check .
Recommended Usage
Convert a single PDF to editable Markdown:
poetry run likhit convert path/to/document.pdf --out path/to/document.md
Convert multiple PDFs at once:
poetry run likhit convert path/to/a.pdf path/to/b.pdf --out-dir path/to/output-dir
If --out or --out-dir is omitted, likhit writes Markdown files in the current directory using the input filename stem.
Usage
convert is the public path.
- Input scope: born-digital PDFs only
- Output: generic editable Markdown
- Engine: MarkItDown by default
likhitvalue-add: Nepali PDF repair before Markdown output when needed- Recognized document layouts such as Kanun Patrika and CIAA-style PDFs are auto-detected internally so
likhitcan preserve better text order and structure without a--typeflag - No OCR support is included in this branch
Architecture
The new default pipeline is:
likhit convertopens the PDF and checks whether it matches a known structure-aware document type.- If the PDF matches a known layout such as Kanun Patrika or a CIAA-style document,
likhitreuses its existing structure-aware extraction logic internally. - Otherwise, MarkItDown handles the default conversion path.
- When the PDF needs Nepali repair,
likhitrepairs the text first:- Kalimati broken-CMap repair
- Devanagari reordering
- Devanagari spacing normalization
- Legacy-font remapping through
npttf2utf
likhitassembles repaired text blocks into Markdown.
This keeps the public product story simple: likhit is the tool users call, while MarkItDown is embedded infrastructure.
Current Scope
- Supported default input: PDF only
- Supported default output: Markdown only
- Supported default document class: born-digital PDFs
- Unsupported in this branch: OCR, scanned/image-only PDFs,
.doc,.docx, and image inputs
Project Layout
src/likhit/core.py: publicconvertandconvert_manyentry pointssrc/likhit/markitdown_integration.py: MarkItDown instance setup and custom PDF convertersrc/likhit/nepali_pdf_repair.py: reusable Nepal-specific PDF repair layersrc/likhit/markdown_assembly.py: generic Markdown assembly for the default conversion pathsrc/likhit/extractors/,src/likhit/handlers/,src/likhit/renderers/: internal Nepali PDF repair and legacy extraction internalstests/: conversion, extraction, and CLI coverage
References
- MarkItDown: https://github.com/microsoft/markitdown
- MarkItDown sample plugin: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file likhit-0.1.0.tar.gz.
File metadata
- Download URL: likhit-0.1.0.tar.gz
- Upload date:
- Size: 29.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e89ab19fdbca6b4c02e46be24553acb3900f3198b9b28cabf0d3b4a447396342
|
|
| MD5 |
0753ec2808107d9df296e9acc208c8d3
|
|
| BLAKE2b-256 |
043dd37373e26db3803ee446687fdcbdf4fbde663c221942664db21e73349fd1
|
File details
Details for the file likhit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: likhit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f557b5eca1c460e7a55ca535210f3255dda4048c4a6ed5590a9612210541f281
|
|
| MD5 |
92fdf725da991a3bac62a8ff65268cf3
|
|
| BLAKE2b-256 |
0169c66b455fd9353526e3674e9bf67c3bced89ef7efa80b243efb44bf20dcea
|