Nepali PDF-to-Markdown converter powered by MarkItDown

These details have not been verified by PyPI

Project description

likhit

likhit is a public PDF-to-Markdown tool for Nepali government documents.

The default path is powered by MarkItDown, with likhit intercepting born-digital Nepali PDFs that need Nepal-specific repair before Markdown is emitted. That repair layer handles Kalimati broken-CMap fixes, Devanagari reordering and spacing normalization, and legacy Nepali font remapping where applicable.

Installation

With Poetry

poetry install

Run project commands with poetry run:

poetry run pytest
poetry run ruff check .
poetry run black --check .

Release Process

likhit uses a tag-driven PyPI release flow with GitHub Actions Trusted Publishing.

Update the package version in pyproject.toml.
Commit that change.
Create a matching git tag such as v0.1.1.
Push the commit and tag to GitHub.

Example:

poetry version patch
git add pyproject.toml poetry.lock
git commit -m "Bump version to 0.1.1"
git tag v0.1.1
git push origin main --follow-tags

The publish workflow verifies that the git tag matches the version in pyproject.toml before uploading to PyPI.

Recommended Usage

Convert a single document to editable Markdown:

# PDF
poetry run likhit convert path/to/document.pdf --out path/to/document.md

# DOCX (all document types)
poetry run likhit convert path/to/document.docx --out path/to/document.md

# DOC (legacy Word format - CIAA documents only, Linux/Mac only)
poetry run likhit convert path/to/ciaa-document.doc --out path/to/document.md

Note: DOC files are only supported for CIAA press releases and require Linux/Mac. For other document types or Windows users, convert DOC to DOCX first.

Convert multiple documents at once:

poetry run likhit convert path/to/a.pdf path/to/b.docx --out-dir path/to/output-dir

If --out or --out-dir is omitted, likhit writes Markdown files in the current directory using the input filename stem.

Usage

convert is the public path.

Input scope: born-digital PDFs only
Output: generic editable Markdown
Engine: MarkItDown by default
likhit value-add: Nepali PDF repair before Markdown output when needed
Recognized document layouts such as Kanun Patrika and CIAA-style PDFs are auto-detected internally so likhit can preserve better text order and structure without a --type flag
No OCR support is included in this branch

Architecture

The new default pipeline is:

likhit convert opens the PDF and checks whether it matches a known structure-aware document type.
If the PDF matches a known layout such as Kanun Patrika or a CIAA-style document, likhit reuses its existing structure-aware extraction logic internally.
Otherwise, MarkItDown handles the default conversion path.
When the PDF needs Nepali repair, likhit repairs the text first:
- Kalimati broken-CMap repair
- Devanagari reordering
- Devanagari spacing normalization
- Legacy-font remapping through npttf2utf
likhit assembles repaired text blocks into Markdown.

This keeps the public product story simple: likhit is the tool users call, while MarkItDown is embedded infrastructure.

Current Scope

Supported input formats:
- PDF (born-digital, with Nepali text repair)
- DOCX (Microsoft Word 2007+, text extraction only, all document types)
- DOC (legacy Microsoft Word, CIAA documents only, Linux/Mac only)
Supported output: Markdown only
Supported document types: CIAA press releases, Kanun Patrika journals
Unsupported in this branch: OCR, scanned/image-only PDFs, image inputs

DOCX/DOC Support Notes

Text-first extraction approach (no table structure preservation)
DOCX files: Supported for all document types (CIAA, Kanun Patrika, generic)
DOC files: Only supported for CIAA press releases
- Kanun Patrika documents in DOC format are not supported (convert to DOCX or PDF)
- Generic/unknown DOC documents may work but are not officially supported
Windows limitation: DOC file extraction does not work on Windows due to antiword binary compatibility
- Windows users must convert DOC files to DOCX format first
- Use Microsoft Word, LibreOffice, or online converters
- Linux/Mac users can process DOC files directly
Tables are extracted as plain text
No formatting preservation (bold, italic, etc.)

Project Layout

src/likhit/core.py: public convert and convert_many entry points
src/likhit/markitdown_integration.py: MarkItDown instance setup and custom PDF converter
src/likhit/nepali_pdf_repair.py: reusable Nepal-specific PDF repair layer
src/likhit/markdown_assembly.py: generic Markdown assembly for the default conversion path
src/likhit/extractors/: extraction strategies (PDF, DOCX, DOC)
- font_based.py: PDF extraction with Nepali font repair
- docx_based.py: DOCX/DOC text extraction
src/likhit/handlers/: document type handlers (CIAA, Kanun Patrika)
src/likhit/renderers/: Markdown rendering
tests/: conversion, extraction, and CLI coverage
- tests/integration/: end-to-end integration tests with real document fixtures
- tests/integration/test_data/: committed test fixtures (PDF, DOCX, DOC samples)

Testing

Running Tests

Run all tests (unit + integration):

poetry run pytest

Run only integration tests:

poetry run pytest tests/integration -v

Run with coverage:

poetry run pytest --cov=likhit

Integration Test Fixtures

Integration tests use real document fixtures stored in tests/integration/test_data/:

Size policy: Total fixture size kept under 50 MB (currently ~2.35 MB)
Formats: PDF, DOCX, DOC samples covering CIAA and Kanun Patrika documents
Platform notes: DOC tests automatically skip on Windows (requires antiword)

See tests/integration/README.md for fixture governance and how to add new samples.

References

MarkItDown: https://github.com/microsoft/markitdown
MarkItDown sample plugin: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.7

May 4, 2026

0.1.6

Apr 28, 2026

This version

0.1.1

Mar 25, 2026

0.1.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

likhit-0.1.1.tar.gz (34.4 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

likhit-0.1.1-py3-none-any.whl (44.0 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file likhit-0.1.1.tar.gz.

File metadata

Download URL: likhit-0.1.1.tar.gz
Upload date: Mar 25, 2026
Size: 34.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for likhit-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ce87c97b853df0cf1541c58e7bd9140a784e9f0a54424c1a6784cf5aff9f6c0c`
MD5	`75e479163baac8248e91879eb2ba119a`
BLAKE2b-256	`f7e716d350996d9f78345ec9f199cb42a4dffa50a38c803fd47e6bd437a30849`

See more details on using hashes here.

Provenance

The following attestation bundles were made for likhit-0.1.1.tar.gz:

Publisher: pypi-publish.yml on NewNepal-org/likhit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: likhit-0.1.1.tar.gz
- Subject digest: ce87c97b853df0cf1541c58e7bd9140a784e9f0a54424c1a6784cf5aff9f6c0c
- Sigstore transparency entry: 1180073010
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: NewNepal-org/likhit@40cbbf525603afc523a7cd56152274468c4486ee
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/NewNepal-org
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@40cbbf525603afc523a7cd56152274468c4486ee
- Trigger Event: push

File details

Details for the file likhit-0.1.1-py3-none-any.whl.

File metadata

Download URL: likhit-0.1.1-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 44.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for likhit-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bb5958d48586d5f2c660dcffae2c35708de9eb6c5e7e7a3ec59f21592e31659f`
MD5	`ad9198c56021da9e162c11c6d5d04c40`
BLAKE2b-256	`d6449965894d653687be0b6c40594953a17ad050cf79cc8d65a1d429ed16d9f1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for likhit-0.1.1-py3-none-any.whl:

Publisher: pypi-publish.yml on NewNepal-org/likhit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: likhit-0.1.1-py3-none-any.whl
- Subject digest: bb5958d48586d5f2c660dcffae2c35708de9eb6c5e7e7a3ec59f21592e31659f
- Sigstore transparency entry: 1180073016
- Sigstore integration time: Mar 25, 2026
Source repository:
- Permalink: NewNepal-org/likhit@40cbbf525603afc523a7cd56152274468c4486ee
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/NewNepal-org
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@40cbbf525603afc523a7cd56152274468c4486ee
- Trigger Event: push

likhit 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

likhit

Installation

With Poetry

Release Process

Recommended Usage

Usage

Architecture

Current Scope

DOCX/DOC Support Notes

Project Layout

Testing

Running Tests

Integration Test Fixtures

References

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance