Legacy Hindi font (KrutiDev/Chanakya) to Unicode Devanagari toolkit with PDF splitting

These details have not been verified by PyPI

Project links

Project description

Lipi

Part of the Aparsoft open-source EdTech toolchain Built for the Apar Academy Hindi PDF content ingestion pipeline - open-sourced for the Indian EdTech community.

Decode legacy Hindi/Indic PDFs. KrutiDev, Chanakya → Unicode.

What this does

Split PDFs by page range - extract chapters, lectures, or units out of a large PDF into separate files, with optional batch processing via a JSON config.
Extract Unicode text from legacy Hindi-font PDFs - detect KrutiDev / Chanakya encoded PDFs and convert the extracted text to proper Unicode Devanagari, making it searchable, copy-pasteable, and usable in NLP pipelines.
Optional second-stage lexicon correction - a conservative, heuristic pass that catches noisy tokens the character-level mapping misses, using bounded Levenshtein distance against a built-in Hindi word list.

Why this exists

Old legacy Hindi textbooks, state board materials, government circulars, and Hindi newspapers were typeset in glyph-substitution fonts like KrutiDev and Chanakya before Unicode became the standard. These PDFs look correct in a viewer but the underlying bytes are ASCII - not Devanagari. When you extract text with any standard library (pypdf, pdfplumber, pdfminer) you get gibberish like osQ kjk Fk Hk.

This toolkit detects that situation and applies a character-level reverse-mapping to give you usable Hindi text.

Known Limitations

Limitation	Detail
Conversion is ~85-92% accurate	KrutiDev glyph mapping is context-free. Some characters (e.g. `k`) can be the `ा` matra or part of a consonant cluster. Perfect accuracy requires a context-aware parser or an LLM correction pass.
PDF fonts are NOT re-encoded	`split_pdf()` copies pages byte-for-byte. The output PDFs will still render correctly in viewers, but the underlying bytes remain in the legacy encoding. Use `extract_unicode_text()` when you need the text, not the file.
Chanakya support is partial	The Chanakya mapping covers the most common characters. Documents using uncommon ligatures or regional variants may need manual review.
Second-stage correction is heuristic	The optional lexicon pass is off by default and only runs on legacy-detected extraction paths. It can improve noisy KrutiDev output, but it is still a heuristic layer and should be reviewed on important documents.

Installation

# Core (PDF splitting + text extraction)
pip install lipi-aparsoft

# With Flask web UI
pip install "lipi-aparsoft[flask]"

# Development
pip install "lipi-aparsoft[dev]"

Or clone and install in editable mode:

git clone https://github.com/aparsoft/lipi.git
cd lipi
pip install -e ".[dev]"

Note: The PyPI distribution name is lipi-aparsoft, but the Python import name remains lipi:
from lipi import HindiPreprocessor  # import name is always 'lipi'

Quick Start

Extract Unicode text from a Hindi PDF

from lipi import HindiPreprocessor

# Convert raw KrutiDev text
unicode_text = HindiPreprocessor.convert("osQ kjk Fk", font_type="krutidev")
print(unicode_text)  # के ारा थ

# Auto-detect and convert
result = HindiPreprocessor.correct_hindi_text("eSaus gSjku gksdj ns[kk")

Extract from a PDF

from lipi.extractor import extract_unicode_text

result = extract_unicode_text("old_hindi_textbook.pdf")
print(result["has_encoding_issues"])   # True
print(result["detected_font_type"])    # "krutidev"
print(result["full_text"][:500])       # Clean Devanagari Unicode

# Optional second-stage lexicon correction for legacy-detected PDFs
improved = extract_unicode_text(
        "old_hindi_textbook.pdf",
        second_stage="lexicon",
)
print(improved["correction_stats"])

Run the regression harness over real samples

from lipi.regression import run_regression_harness

report = run_regression_harness([
        "temp/jhkr102.pdf",
        "temp/ihkr101.pdf",
])
print(report["improved_pages"])
print(report["average_quality_delta"])

Split a PDF

from lipi.splitter import PDFSplitter

PDFSplitter.split_pdf(
    input_file  = "hindi_science_class10.pdf",
    output_dir  = "chapters/",
    page_ranges = [
        (1,  18, "Chapter1_ChemicalReactions"),
        (19, 40, "Chapter2_Acids"),
        (41, 65, "Chapter3_Metals"),
    ],
    prefix    = "HindiPDF_Sci10",
    unit_name = "Science",
)

Detect encoding

from lipi import HindiPreprocessor

has_issues, font_type = HindiPreprocessor.detect_encoding(raw_text)
# → (True, "krutidev")

CLI

# Extract text from a PDF
lipi extract hindi.pdf

# Extract with optional second-stage lexicon correction
lipi extract hindi.pdf --second-stage lexicon

# Extract with JSON output
lipi extract hindi.pdf --json

# Extract specific pages
lipi extract hindi.pdf --page-range 1-10

# Split a PDF
lipi split book.pdf --ranges "1-20:Ch1,21-45:Ch2" --output-dir chapters/

# Show PDF info
lipi info hindi.pdf

# Benchmark one or more PDFs page-by-page
lipi regress temp/jhkr102.pdf temp/ihkr101.pdf

# Opt in to a more aggressive contextual lexicon built from repeated clean tokens
lipi regress temp/jhkr102.pdf --bootstrap-lexicon

Flask Web UI

pip install "lipi-aparsoft[flask]"
python web/flask_app.py
# → http://localhost:5000

Features:

Upload & preview PDF info (page count, size, encoding detection)
Single PDF splitting with named ranges
Batch directory processing with JSON config
Hindi text extraction with before/after comparison (raw pypdf vs lipi-aparsoft output)
Conversion summary badges (legacy detected, text changed, etc.)
JSON config editor
Output file browser with download/delete

Project structure

lipi/
├── src/lipi/
│   ├── __init__.py              # Public API: HindiPreprocessor, HindiLexiconCorrector, run_regression_harness
│   ├── preprocessor.py          # Convert + detect + post-process
│   ├── extractor.py             # PDF text extraction (pypdf) + optional lexicon stage
│   ├── correction.py            # HindiLexiconCorrector (bounded Levenshtein, suspicious-token heuristics)
│   ├── regression.py            # Page-by-page quality harness with quality metrics
│   ├── splitter.py              # PDF splitting + batch processing
│   ├── cli.py                   # Command-line interface (extract, split, info, regress)
│   ├── _quality.py              # Garbage text detection (character ratio analysis)
│   ├── _lexicon.py              # Bundled Hindi word list (~300+ words)
│   └── mappings/
│       ├── __init__.py          # FONT_MAPPINGS merged dict
│       ├── krutidev.py          # KrutiDev → Unicode base table
│       ├── chanakya.py          # Chanakya → Unicode table
│       └── walkman_chanakya.py  # Walkman-Chanakya905 overrides
├── web/
│   ├── flask_app.py             # Flask web UI (dual extraction + comparison)
│   └── templates/               # HTML templates
├── tests/
│   ├── test_mappings.py         # Mapping tables: loading, merging, value validation
│   ├── test_preprocessor.py     # Detection, conversion, i-matra reorder, post-process repairs
│   ├── test_extractor.py        # Quality gate, file-not-found, generic cleanup on non-legacy PDFs
│   ├── test_correction.py       # Lexicon corrector: token correction, suspicious token detection
│   ├── test_regression.py       # Quality metrics: quality_index, lexicon_hit_rate, artifact counts
│   ├── test_splitter.py         # Parse ranges, config validation, split files, PDF info
│   └── test_flask_app.py        # Flask route tests
├── pyproject.toml
└── README.md

How the Hindi encoding fix works

PDF file (KrutiDev font)
        |
        v
pypdf.extract_text()   <- returns garbled ASCII: "osQ kjk Fk dj jgk gS"
        |
        v
detect_encoding()      <- heuristic: low Devanagari ratio + KrutiDev fingerprints
        |
        v
convert()              <- longest-match-first substitution using char mapping table
        |
        v
post_process()         <- generic Unicode cleanup:
                           - remove doubled matras (ाा→ा)
                           - fix mark-spacing (consonant SPACE matra → consonant+matra)
                           - fix halant-spacing (् SPACE consonant → ्consonant)
                           - fix duplicate consonant+i-matra (कक→कि, ववक→विक)
                           - fix श्श्ि → श्चि
                           - fix decomposed nukta+i (डड़→ड़ि)
                           - fix common words (अौर→और, अार→आर)
        |
        v
lexicon second stage   <- optional, only on legacy-detected paths:
  (HindiLexiconCorrector)
                           - split text into tokens
                           - detect suspicious tokens (nonstandard nukta, duplicate marks)
                           - find closest lexicon match via bounded Levenshtein
                           - only replace if distance ≤ 2 and match is strong
        |
        v
Unicode text: "के ारा थ कर रहा है"  <- ~85-92% accuracy (improves with lexicon stage)

Contributing

See CONTRIBUTING.md for guidelines on adding font mappings and contributing code.

Development setup

git clone https://github.com/aparsoft/lipi.git
cd lipi
pip install -e ".[dev]"
pytest

Install from PyPI: pip install lipi-aparsoft Python import: from lipi import HindiPreprocessor (import name is lipi, not lipi-aparsoft)

Acknowledgements

Built on pypdf for PDF manipulation
KrutiDev mapping tables cross-referenced against community resources at rajbhasha.net
Inspired by countless developers who hit the "Hindi PDF gibberish" problem on GitHub Issues and Stack Overflow

License

MIT © Aparsoft Private Limited

Aparsoft builds AI-powered EdTech tools for Indian schools and students. Our flagship product Apar AI LMS delivers Hindi curriculum-aligned content to schools across India. This toolkit is part of our internal content processing pipeline, open-sourced for the community.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.9

May 5, 2026

1.0.8

May 5, 2026

1.0.7

May 5, 2026

1.0.6

May 5, 2026

1.0.5

May 5, 2026

This version

1.0.4

May 3, 2026

1.0.3

May 2, 2026

1.0.2

May 2, 2026

1.0.1

May 2, 2026

1.0.0

May 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lipi_aparsoft-1.0.4.tar.gz (37.3 kB view details)

Uploaded May 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lipi_aparsoft-1.0.4-py3-none-any.whl (30.6 kB view details)

Uploaded May 3, 2026 Python 3

File details

Details for the file lipi_aparsoft-1.0.4.tar.gz.

File metadata

Download URL: lipi_aparsoft-1.0.4.tar.gz
Upload date: May 3, 2026
Size: 37.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for lipi_aparsoft-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`7e2e1e7e61f7a2fecafdcd19d4236acffc69d2807c37e019b59bc71c58a42063`
MD5	`6519058311e65949822d0e92f76e1273`
BLAKE2b-256	`ad5a1e6787e8a2882804edd51213548a09a5f2ed6a75afb38422712efb601767`

See more details on using hashes here.

File details

Details for the file lipi_aparsoft-1.0.4-py3-none-any.whl.

File metadata

Download URL: lipi_aparsoft-1.0.4-py3-none-any.whl
Upload date: May 3, 2026
Size: 30.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for lipi_aparsoft-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b807cb6f4e1223b3462158023283e6946eb519dbd6d26f5d694713904fbb3e81`
MD5	`4e960ce0593f4e68a65e012122f96152`
BLAKE2b-256	`9820f02dc348717f3308ca35f212d490339007d917464293f9eae66c722a4fc3`

See more details on using hashes here.

lipi-aparsoft 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Lipi

Decode legacy Hindi/Indic PDFs. KrutiDev, Chanakya → Unicode.

What this does

Why this exists

Known Limitations

Installation

Quick Start

Extract Unicode text from a Hindi PDF

Extract from a PDF

Run the regression harness over real samples

Split a PDF

Detect encoding

CLI

Flask Web UI

Project structure

How the Hindi encoding fix works

Contributing

Development setup

Acknowledgements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes