Legacy Hindi font (KrutiDev/Chanakya) to Unicode Devanagari toolkit with PDF splitting
Project description
Lipi
Part of the Aparsoft open-source EdTech toolchain Built for the Apar Academy Hindi PDF content ingestion pipeline - open-sourced for the Indian EdTech community.
Decode legacy Hindi/Indic PDFs. KrutiDev, Chanakya → Unicode.
What this does
Two things:
-
Split PDFs by page range - extract chapters, lectures, or units out of a large PDF into separate files, with optional batch processing via a JSON config.
-
Extract Unicode text from legacy Hindi-font PDFs - detect KrutiDev / Chanakya encoded PDFs and convert the extracted text to proper Unicode Devanagari, making it searchable, copy-pasteable, and usable in NLP pipelines.
Why this exists
Old legacy Hindi textbooks, state board materials, government circulars, and Hindi newspapers were typeset in glyph-substitution fonts like KrutiDev and Chanakya before Unicode became the standard. These PDFs look correct in a viewer but the underlying bytes are ASCII - not Devanagari. When you extract text with any standard library (pypdf, pdfplumber, pdfminer) you get gibberish like osQ kjk Fk Hk.
This toolkit detects that situation and applies a character-level reverse-mapping to give you usable Hindi text.
Known Limitations
| Limitation | Detail |
|---|---|
| Conversion is ~85-92% accurate | KrutiDev glyph mapping is context-free. Some characters (e.g. k) can be the ा matra or part of a consonant cluster. Perfect accuracy requires a context-aware parser or an LLM correction pass. |
| PDF fonts are NOT re-encoded | split_pdf() copies pages byte-for-byte. The output PDFs will still render correctly in viewers, but the underlying bytes remain in the legacy encoding. Use extract_unicode_text() when you need the text, not the file. |
| Chanakya support is partial | The Chanakya mapping covers the most common characters. Documents using uncommon ligatures or regional variants may need manual review. |
Installation
# Core (PDF splitting + text extraction)
pip install lipi-aparsoft
# With Flask web UI
pip install "lipi-aparsoft[flask]"
# Development
pip install "lipi-aparsoft[dev]"
Or clone and install in editable mode:
git clone https://github.com/aparsoft/lipi.git
cd lipi
pip install -e ".[dev]"
Note: The PyPI distribution name is
lipi-aparsoft, but the Python import name remainslipi:from lipi import HindiPreprocessor # import name is always 'lipi'
Quick Start
Extract Unicode text from a Hindi PDF
from lipi import HindiPreprocessor
# Convert raw KrutiDev text
unicode_text = HindiPreprocessor.convert("osQ kjk Fk", font_type="krutidev")
print(unicode_text) # के ारा थ
# Auto-detect and convert
result = HindiPreprocessor.correct_hindi_text("eSaus gSjku gksdj ns[kk")
Extract from a PDF
from lipi.extractor import extract_unicode_text
result = extract_unicode_text("old_hindi_textbook.pdf")
print(result["has_encoding_issues"]) # True
print(result["detected_font_type"]) # "krutidev"
print(result["full_text"][:500]) # Clean Devanagari Unicode
Split a PDF
from lipi.splitter import PDFSplitter
PDFSplitter.split_pdf(
input_file = "hindi_science_class10.pdf",
output_dir = "chapters/",
page_ranges = [
(1, 18, "Chapter1_ChemicalReactions"),
(19, 40, "Chapter2_Acids"),
(41, 65, "Chapter3_Metals"),
],
prefix = "HindiPDF_Sci10",
unit_name = "Science",
)
Detect encoding
from lipi import HindiPreprocessor
has_issues, font_type = HindiPreprocessor.detect_encoding(raw_text)
# → (True, "krutidev")
CLI
# Extract text from a PDF
lipi extract hindi.pdf
# Extract with JSON output
lipi extract hindi.pdf --json
# Extract specific pages
lipi extract hindi.pdf --page-range 1-10
# Split a PDF
lipi split book.pdf --ranges "1-20:Ch1,21-45:Ch2" --output-dir chapters/
# Show PDF info
lipi info hindi.pdf
Flask Web UI
pip install "lipi-aparsoft[flask]"
python web/flask_app.py
# → http://localhost:5000
Features:
- Upload & preview PDF info (page count, size, encoding detection)
- Single PDF splitting with named ranges
- Batch directory processing with JSON config
- Hindi text extraction with before/after preview
- JSON config editor
- Output file browser with download/delete
Project structure
lipi/
├── src/lipi/
│ ├── __init__.py # Public API (HindiPreprocessor)
│ ├── preprocessor.py # Convert + detect + post-process
│ ├── extractor.py # PDF text extraction (pypdf)
│ ├── splitter.py # PDF splitting + batch processing
│ ├── cli.py # Command-line interface
│ ├── _quality.py # Garbage text detection
│ └── mappings/
│ ├── __init__.py # FONT_MAPPINGS merged dict
│ ├── krutidev.py # KrutiDev → Unicode base table
│ ├── chanakya.py # Chanakya → Unicode table
│ └── walkman_chanakya.py # Walkman-Chanakya905 overrides
├── web/
│ ├── flask_app.py # Flask web UI
│ └── templates/ # HTML templates
├── tests/
│ ├── test_mappings.py
│ ├── test_preprocessor.py
│ ├── test_extractor.py
│ └── test_splitter.py
├── pyproject.toml
└── README.md
How the Hindi encoding fix works
PDF file (KrutiDev font)
|
v
pypdf.extract_text() <- returns garbled ASCII: "osQ kjk Fk dj jgk gS"
|
v
detect_encoding() <- heuristic: low Devanagari ratio + KrutiDev fingerprints
|
v
convert() <- longest-match-first substitution using char mapping table
|
v
post_process() <- removes doubled matras, fixes common word errors
|
v
Unicode text: "के ारा थ कर रहा है" <- ~85-92% accuracy
Contributing
See CONTRIBUTING.md for guidelines on adding font mappings and contributing code.
Development setup
git clone https://github.com/aparsoft/lipi.git
cd lipi
pip install -e ".[dev]"
pytest
Acknowledgements
- Built on
pypdffor PDF manipulation - KrutiDev mapping tables cross-referenced against community resources at rajbhasha.net
- Inspired by countless developers who hit the "Hindi PDF gibberish" problem on GitHub Issues and Stack Overflow
License
MIT © Aparsoft Private Limited
Aparsoft builds AI-powered EdTech tools for Indian schools and students. Our flagship product Apar AI LMS delivers Hindi curriculum-aligned content to schools across India. This toolkit is part of our internal content processing pipeline, open-sourced for the community.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lipi_aparsoft-1.0.1.tar.gz.
File metadata
- Download URL: lipi_aparsoft-1.0.1.tar.gz
- Upload date:
- Size: 25.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b138723f4b4222761a702669d7266b261837d5fbf8eb43bf32f638d5f8696435
|
|
| MD5 |
6c489fe150cb9273d8f1f06c4301c16e
|
|
| BLAKE2b-256 |
da8e514a01ba09bcada3f5651f8bbc26d47697cc3bd1a3499286a114a85fa051
|
File details
Details for the file lipi_aparsoft-1.0.1-py3-none-any.whl.
File metadata
- Download URL: lipi_aparsoft-1.0.1-py3-none-any.whl
- Upload date:
- Size: 20.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c49415cb33978b7de45a52ff9843708e0ded080c3bfa90167dc21cc71fe8fd38
|
|
| MD5 |
b188568c23fe3c7505ac6414c88c09c6
|
|
| BLAKE2b-256 |
7588617da30e92185a8dbc7e30fbf8eb45333228c703684edf0ca6002a578c4c
|