A Python tool to extract Khmer text from PDF documents using Tesseract OCR.
Project description
Khmer Document Parser v0.2.0
khmerdocparser is a command-line tool to extract Khmer text from PDF files. It works by converting each page of a PDF into an image and then using Google's Tesseract OCR engine to extract the text.
This tool uses the Pytesseract library as a wrapper for Tesseract.
Features
- Extracts both Khmer and English text from PDFs using Tesseract.
- Simple command-line interface.
- Option to save extracted text to a file.
- Can be used as a library in your own Python projects.
Prerequisites
This package requires two crucial external dependencies: Poppler (for handling PDFs) and Tesseract OCR (for recognizing text). You must install both on your system.
1. Tesseract OCR Installation
You must install the Tesseract engine and the Khmer language pack.
-
Windows:
- Download and run the Tesseract installer from UB-Mannheim's GitHub.
- During installation, make sure to check the box for the Khmer language pack to include it.
- Important: Add the Tesseract installation directory (e.g.,
C:\Program Files\Tesseract-OCR) to your system'sPATHenvironment variable.
-
macOS (using Homebrew):
# Install Tesseract engine brew install tesseract # Install all available language packs, including Khmer brew install tesseract-lang
-
Linux (Debian/Ubuntu):
# Install Tesseract engine sudo apt-get update sudo apt-get install tesseract-ocr # Install the Khmer language pack sudo apt-get install tesseract-ocr-khm
2. Poppler Installation
-
Windows:
- Download the latest Poppler binary from here.
- Extract the archive and add its
bindirectory to your system'sPATH.
-
macOS (using Homebrew):
brew install poppler
-
Linux (Debian/Ubuntu):
sudo apt-get install poppler-utils
Installation
Once Poppler and Tesseract are installed, you can install this package from PyPI:
pip install --upgrade khmerdocparser
Usage
As a Command-Line Tool
To extract text and print it to the console:
khmerdocparser /path/to/your/document.pdf
To save the extracted text to a file:
khmerdocparser /path/to/your/document.pdf -o extracted_text.txt
If Tesseract or Poppler are not in your system's PATH, you can specify their locations:
khmerdocparser doc.pdf --tesseract_path "C:\Tesseract\tesseract.exe" --poppler_path "C:\Poppler\bin"
As a Python Library
from khmerdocparser.main import extract_text_from_pdf
pdf_path = "/path/to/your/document.pdf"
text = extract_text_from_pdf(pdf_path)
print(text)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file khmerdocparser-0.2.0.tar.gz.
File metadata
- Download URL: khmerdocparser-0.2.0.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b496e6f8b7bf89d33d33855482382e83b20b944a91c14e6a3e616315a806523e
|
|
| MD5 |
0c55378c95cb5c9d6f3070b31995f4aa
|
|
| BLAKE2b-256 |
dc434dd7dbb2f8a9e123454ceff1a098ad2ebcf3f02ab2ea3fbabd684e42b290
|
File details
Details for the file khmerdocparser-0.2.0-py3-none-any.whl.
File metadata
- Download URL: khmerdocparser-0.2.0-py3-none-any.whl
- Upload date:
- Size: 5.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9327389552d606e9ddf6e558fde33a8dacef67101b8433d7e432a4b979a5ce90
|
|
| MD5 |
2b998711145760706755ce01d51a89a6
|
|
| BLAKE2b-256 |
c13e03c4297a23c15dd2c84783b04d4e07d033dd256baf1f4eababbc580a82ad
|