Nepali PDF-to-Markdown converter powered by MarkItDown
Project description
likhit
likhit is a public PDF-to-Markdown tool for Nepali government documents.
The default path is powered by MarkItDown, with likhit intercepting born-digital Nepali PDFs that need Nepal-specific repair before Markdown is emitted. That repair layer handles Kalimati broken-CMap fixes, Devanagari reordering and spacing normalization, and legacy Nepali font remapping where applicable.
Installation
With Poetry
poetry install
Run project commands with poetry run:
poetry run pytest
poetry run ruff check .
poetry run black --check .
Release Process
likhit uses a tag-driven PyPI release flow with GitHub Actions Trusted Publishing.
- Update the package version in
pyproject.toml. - Commit that change.
- Create a matching git tag such as
v0.1.1. - Push the commit and tag to GitHub.
Example:
poetry version patch
git add pyproject.toml poetry.lock
git commit -m "Bump version to 0.1.1"
git tag v0.1.1
git push origin main --follow-tags
The publish workflow verifies that the git tag matches the version in pyproject.toml before uploading to PyPI.
Recommended Usage
Convert a single document to editable Markdown:
# PDF
poetry run likhit convert path/to/document.pdf --out path/to/document.md
# DOCX (all document types)
poetry run likhit convert path/to/document.docx --out path/to/document.md
# DOC (legacy Word format - CIAA documents only, Linux/Mac only)
poetry run likhit convert path/to/ciaa-document.doc --out path/to/document.md
Note: DOC files are only supported for CIAA press releases and require Linux/Mac. For other document types or Windows users, convert DOC to DOCX first.
Convert multiple documents at once:
poetry run likhit convert path/to/a.pdf path/to/b.docx --out-dir path/to/output-dir
If --out or --out-dir is omitted, likhit writes Markdown files in the current directory using the input filename stem.
Usage
convert is the public path.
- Input scope: born-digital PDFs only
- Output: generic editable Markdown
- Engine: MarkItDown by default
likhitvalue-add: Nepali PDF repair before Markdown output when needed- Recognized document layouts such as Kanun Patrika and CIAA-style PDFs are auto-detected internally so
likhitcan preserve better text order and structure without a--typeflag - No OCR support is included in this branch
Architecture
The new default pipeline is:
likhit convertopens the PDF and checks whether it matches a known structure-aware document type.- If the PDF matches a known layout such as Kanun Patrika or a CIAA-style document,
likhitreuses its existing structure-aware extraction logic internally. - Otherwise, MarkItDown handles the default conversion path.
- When the PDF needs Nepali repair,
likhitrepairs the text first:- Kalimati broken-CMap repair
- Devanagari reordering
- Devanagari spacing normalization
- Legacy-font remapping through
npttf2utf
likhitassembles repaired text blocks into Markdown.
This keeps the public product story simple: likhit is the tool users call, while MarkItDown is embedded infrastructure.
Current Scope
- Supported input formats:
- PDF (born-digital, with Nepali text repair)
- DOCX (Microsoft Word 2007+, text extraction only, all document types)
- DOC (legacy Microsoft Word, CIAA documents only, Linux/Mac only)
- Supported output: Markdown only
- Supported document types: CIAA press releases, Kanun Patrika journals
- Unsupported in this branch: OCR, scanned/image-only PDFs, image inputs
DOCX/DOC Support Notes
- Text-first extraction approach (no table structure preservation)
- DOCX files: Supported for all document types (CIAA, Kanun Patrika, generic)
- DOC files: Only supported for CIAA press releases
- Kanun Patrika documents in DOC format are not supported (convert to DOCX or PDF)
- Generic/unknown DOC documents may work but are not officially supported
- Windows limitation: DOC file extraction does not work on Windows due to antiword binary compatibility
- Windows users must convert DOC files to DOCX format first
- Use Microsoft Word, LibreOffice, or online converters
- Linux/Mac users can process DOC files directly
- Tables are extracted as plain text
- No formatting preservation (bold, italic, etc.)
Project Layout
src/likhit/core.py: publicconvertandconvert_manyentry pointssrc/likhit/markitdown_integration.py: MarkItDown instance setup and custom PDF convertersrc/likhit/nepali_pdf_repair.py: reusable Nepal-specific PDF repair layersrc/likhit/markdown_assembly.py: generic Markdown assembly for the default conversion pathsrc/likhit/extractors/: extraction strategies (PDF, DOCX, DOC)font_based.py: PDF extraction with Nepali font repairdocx_based.py: DOCX/DOC text extraction
src/likhit/handlers/: document type handlers (CIAA, Kanun Patrika)src/likhit/renderers/: Markdown renderingtests/: conversion, extraction, and CLI coveragetests/integration/: end-to-end integration tests with real document fixturestests/integration/test_data/: committed test fixtures (PDF, DOCX, DOC samples)
Testing
Running Tests
Run all tests (unit + integration):
poetry run pytest
Run only integration tests:
poetry run pytest tests/integration -v
Run with coverage:
poetry run pytest --cov=likhit
Integration Test Fixtures
Integration tests use real document fixtures stored in tests/integration/test_data/:
- Size policy: Total fixture size kept under 50 MB (currently ~2.35 MB)
- Formats: PDF, DOCX, DOC samples covering CIAA and Kanun Patrika documents
- Platform notes: DOC tests automatically skip on Windows (requires antiword)
See tests/integration/README.md for fixture governance and how to add new samples.
References
- MarkItDown: https://github.com/microsoft/markitdown
- MarkItDown sample plugin: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file likhit-0.1.1.tar.gz.
File metadata
- Download URL: likhit-0.1.1.tar.gz
- Upload date:
- Size: 34.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce87c97b853df0cf1541c58e7bd9140a784e9f0a54424c1a6784cf5aff9f6c0c
|
|
| MD5 |
75e479163baac8248e91879eb2ba119a
|
|
| BLAKE2b-256 |
f7e716d350996d9f78345ec9f199cb42a4dffa50a38c803fd47e6bd437a30849
|
Provenance
The following attestation bundles were made for likhit-0.1.1.tar.gz:
Publisher:
pypi-publish.yml on NewNepal-org/likhit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
likhit-0.1.1.tar.gz -
Subject digest:
ce87c97b853df0cf1541c58e7bd9140a784e9f0a54424c1a6784cf5aff9f6c0c - Sigstore transparency entry: 1180073010
- Sigstore integration time:
-
Permalink:
NewNepal-org/likhit@40cbbf525603afc523a7cd56152274468c4486ee -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/NewNepal-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@40cbbf525603afc523a7cd56152274468c4486ee -
Trigger Event:
push
-
Statement type:
File details
Details for the file likhit-0.1.1-py3-none-any.whl.
File metadata
- Download URL: likhit-0.1.1-py3-none-any.whl
- Upload date:
- Size: 44.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb5958d48586d5f2c660dcffae2c35708de9eb6c5e7e7a3ec59f21592e31659f
|
|
| MD5 |
ad9198c56021da9e162c11c6d5d04c40
|
|
| BLAKE2b-256 |
d6449965894d653687be0b6c40594953a17ad050cf79cc8d65a1d429ed16d9f1
|
Provenance
The following attestation bundles were made for likhit-0.1.1-py3-none-any.whl:
Publisher:
pypi-publish.yml on NewNepal-org/likhit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
likhit-0.1.1-py3-none-any.whl -
Subject digest:
bb5958d48586d5f2c660dcffae2c35708de9eb6c5e7e7a3ec59f21592e31659f - Sigstore transparency entry: 1180073016
- Sigstore integration time:
-
Permalink:
NewNepal-org/likhit@40cbbf525603afc523a7cd56152274468c4486ee -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/NewNepal-org
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@40cbbf525603afc523a7cd56152274468c4486ee -
Trigger Event:
push
-
Statement type: