A Python tool to extract Khmer text from PDF documents by converting pages to images and using OCR.
Project description
Khmer Document Parser
khmerdocparser is a command-line tool to extract Khmer text from PDF files. It works by converting each page of a PDF into an image and then using optical character recognition (OCR) to extract the text.
This tool uses the powerful EasyOCR library.
Features
- Extracts both Khmer and English text from PDFs.
- Simple command-line interface.
- Option to save extracted text to a file.
- Can be used as a library in your own Python projects.
Prerequisites
This package requires a crucial external dependency called Poppler. You must install it on your system before using this tool.
Poppler Installation
-
Windows:
- Download the latest Poppler binary for Windows from here.
- Extract the archive (e.g., to
C:\Program Files\poppler-23.11.0). - Add the
bindirectory inside the extracted folder (e.g.,C:\Program Files\poppler-23.11.0\bin) to your system's PATH environment variable. - Alternatively, you can use the
--poppler_pathargument when running the script to point to thisbindirectory.
-
macOS (using Homebrew):
brew install poppler
-
Linux (Debian/Ubuntu):
sudo apt-get update sudo apt-get install poppler-utils
Installation
Once Poppler is installed, you can install this package from PyPI:
pip install khmerdocparser
Usage
As a Command-Line Tool
To extract text from a PDF and print it to the console:
khmerdocparser /path/to/your/document.pdf
To save the extracted text to a file:
khmerdocparser /path/to/your/document.pdf --output extracted_text.txt
If you are on Windows and did not add Poppler to your PATH:
khmerdocparser C:\Users\You\doc.pdf --poppler_path "C:\path\to\poppler\bin"
As a Python Library
You can also import and use the function directly in your code.
from khmerdocparser.main import extract_text_from_pdf
pdf_path = "/path/to/your/document.pdf"
# For Windows, if Poppler is not in PATH
# poppler_bin_path = "C:\path\to\poppler\bin"
# text = extract_text_from_pdf(pdf_path, poppler_path=poppler_bin_path)
# For macOS and Linux
text = extract_text_from_pdf(pdf_path)
print(text)
How to Publish (for Developers)
-
Build the package:
pip install build twine python -m build
-
Upload to PyPI:
twine upload dist/*
You will need a PyPI account and an API token for this step.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file khmerdocparser-0.1.0.tar.gz.
File metadata
- Download URL: khmerdocparser-0.1.0.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d224afcf68766404d8f6adb04ce2891a701174566a34a109db56f81ad18ff37
|
|
| MD5 |
6819153799f471c467a2161391e1400f
|
|
| BLAKE2b-256 |
803d64b9a22ea0ebad9e6bf142dcf3f44c496f0d0f3f064a0efc49828f06fca3
|
File details
Details for the file khmerdocparser-0.1.0-py3-none-any.whl.
File metadata
- Download URL: khmerdocparser-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2eac7145a295b555cce98a8e2e47364e09ae05760aa5929a5065e3891059939d
|
|
| MD5 |
f8292fc2224b3e7f9836b5d1b0e1d993
|
|
| BLAKE2b-256 |
1b98aaa701b66c602dda39eb74902b5c379b88e80b23d44c450b2851327b5358
|