Skip to main content

Nepali PDF-to-Markdown converter powered by MarkItDown

Project description

likhit

likhit is a public PDF-to-Markdown tool for Nepali government documents.

The default path is powered by MarkItDown, with likhit intercepting born-digital Nepali PDFs that need Nepal-specific repair before Markdown is emitted. That repair layer handles Kalimati broken-CMap fixes, Devanagari reordering and spacing normalization, and legacy Nepali font remapping where applicable.

Installation

With Poetry

poetry install

Run project commands with poetry run:

poetry run pytest
poetry run ruff check .
poetry run black --check .

Recommended Usage

Convert a single PDF to editable Markdown:

poetry run likhit convert path/to/document.pdf --out path/to/document.md

Convert multiple PDFs at once:

poetry run likhit convert path/to/a.pdf path/to/b.pdf --out-dir path/to/output-dir

If --out or --out-dir is omitted, likhit writes Markdown files in the current directory using the input filename stem.

Usage

convert is the public path.

  • Input scope: born-digital PDFs only
  • Output: generic editable Markdown
  • Engine: MarkItDown by default
  • likhit value-add: Nepali PDF repair before Markdown output when needed
  • Recognized document layouts such as Kanun Patrika and CIAA-style PDFs are auto-detected internally so likhit can preserve better text order and structure without a --type flag
  • No OCR support is included in this branch

Architecture

The new default pipeline is:

  1. likhit convert opens the PDF and checks whether it matches a known structure-aware document type.
  2. If the PDF matches a known layout such as Kanun Patrika or a CIAA-style document, likhit reuses its existing structure-aware extraction logic internally.
  3. Otherwise, MarkItDown handles the default conversion path.
  4. When the PDF needs Nepali repair, likhit repairs the text first:
    • Kalimati broken-CMap repair
    • Devanagari reordering
    • Devanagari spacing normalization
    • Legacy-font remapping through npttf2utf
  5. likhit assembles repaired text blocks into Markdown.

This keeps the public product story simple: likhit is the tool users call, while MarkItDown is embedded infrastructure.

Current Scope

  • Supported default input: PDF only
  • Supported default output: Markdown only
  • Supported default document class: born-digital PDFs
  • Unsupported in this branch: OCR, scanned/image-only PDFs, .doc, .docx, and image inputs

Project Layout

  • src/likhit/core.py: public convert and convert_many entry points
  • src/likhit/markitdown_integration.py: MarkItDown instance setup and custom PDF converter
  • src/likhit/nepali_pdf_repair.py: reusable Nepal-specific PDF repair layer
  • src/likhit/markdown_assembly.py: generic Markdown assembly for the default conversion path
  • src/likhit/extractors/, src/likhit/handlers/, src/likhit/renderers/: internal Nepali PDF repair and legacy extraction internals
  • tests/: conversion, extraction, and CLI coverage

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

likhit-0.1.0.tar.gz (29.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

likhit-0.1.0-py3-none-any.whl (39.6 kB view details)

Uploaded Python 3

File details

Details for the file likhit-0.1.0.tar.gz.

File metadata

  • Download URL: likhit-0.1.0.tar.gz
  • Upload date:
  • Size: 29.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for likhit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e89ab19fdbca6b4c02e46be24553acb3900f3198b9b28cabf0d3b4a447396342
MD5 0753ec2808107d9df296e9acc208c8d3
BLAKE2b-256 043dd37373e26db3803ee446687fdcbdf4fbde663c221942664db21e73349fd1

See more details on using hashes here.

File details

Details for the file likhit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: likhit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for likhit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f557b5eca1c460e7a55ca535210f3255dda4048c4a6ed5590a9612210541f281
MD5 92fdf725da991a3bac62a8ff65268cf3
BLAKE2b-256 0169c66b455fd9353526e3674e9bf67c3bced89ef7efa80b243efb44bf20dcea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page