A smart Python tool to extract Khmer text from PDF and image files, using OCR for scanned documents and direct extraction for native PDFs.
Project description
Khmer Document Parser v0.3.0
khmerdocparser is a smart, all-in-one command-line tool to extract Khmer text from both PDF and image files.
It intelligently handles PDFs by first attempting a fast, direct text extraction. If that fails (as with a scanned document), it automatically falls back to a powerful OCR engine with image preprocessing to ensure the best possible results.
Features
- Universal Support: Handles both PDF and common image files (
.png,.jpg, etc.). - Smart PDF Parsing: Uses
pdfplumberfor native PDFs and falls back to Tesseract OCR for scanned PDFs. - Advanced OCR: Applies image preprocessing (Grayscaling, Binarization, Noise Removal) for high accuracy on scanned documents.
- User-Friendly: Provides progress bars and detailed logging.
Prerequisites
This package requires two crucial external dependencies: Poppler and Tesseract OCR.
1. Tesseract OCR Installation
You must install the Tesseract engine and the Khmer (khm) language pack.
- Windows: Download and run the installer from UB-Mannheim's GitHub. Ensure you select the Khmer language pack during installation. Add Tesseract to your system's PATH.
- macOS:
brew install tesseract tesseract-lang - Linux (Ubuntu/Debian):
sudo apt-get install tesseract-ocr tesseract-ocr-khm
2. Poppler Installation
- Windows: Download the latest binary from here, extract it, and add the
binfolder to your system's PATH. - macOS:
brew install poppler - Linux (Ubuntu/Debian):
sudo apt-get install poppler-utils
Installation
Once Poppler and Tesseract are installed, you can install or upgrade the package from PyPI:
pip install --upgrade khmerdocparser
Usage
The command is the same for any supported file type.
Extract from a PDF or Image
# Process a PDF
khmerdocparser /path/to/your/document.pdf
# Process an image
khmerdocparser /path/to/your/scanned_image.png
Save Output to a File
This is the recommended way to view Khmer text correctly.
khmerdocparser my_document.pdf -o my_document_text.txt
Specifying Paths Manually (if not in PATH)
khmerdocparser doc.pdf --tesseract_path "C:\Tesseract\tesseract.exe" --poppler_path "C:\Poppler\bin"
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file khmerdocparser-0.3.0.tar.gz.
File metadata
- Download URL: khmerdocparser-0.3.0.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46c9ffcbbd0bada4d28def318c48aba6e358defb4e06269ba42d8304de526070
|
|
| MD5 |
82032908786c9aafcaebcf1cb2c4ac48
|
|
| BLAKE2b-256 |
597fbf59ff53214c2012111e01cd602e30158b2ef2b1ff2c0acda41b343803fb
|
File details
Details for the file khmerdocparser-0.3.0-py3-none-any.whl.
File metadata
- Download URL: khmerdocparser-0.3.0-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2235183efa6f3f3f6674820b1c30901f278c420cd41d25671e8eaaf0ef69c195
|
|
| MD5 |
35cf6cd41d9ba52a0524782316abc27a
|
|
| BLAKE2b-256 |
c1bc2b64bc95eda7b6db264f3197dabbbdee7c9e70a0ab3e11f816b7d0d5cec8
|