Skip to main content

A Python tool to extract Khmer text from PDF documents using Tesseract OCR.

Project description

Khmer Document Parser v0.2.1

khmerdocparser is a command-line tool to extract Khmer text from PDF files. It works by converting each page of a PDF into an image and then using Google's Tesseract OCR engine to extract the text.

This tool uses the Pytesseract library as a wrapper for Tesseract.

Features

  • Extracts both Khmer and English text from PDFs using Tesseract.
  • Simple command-line interface.
  • Option to save extracted text to a file.
  • Can be used as a library in your own Python projects.

Prerequisites

This package requires two crucial external dependencies: Poppler (for handling PDFs) and Tesseract OCR (for recognizing text). You must install both on your system.

1. Tesseract OCR Installation

You must install the Tesseract engine and the Khmer language pack.

  • Windows:

    1. Download and run the Tesseract installer from UB-Mannheim's GitHub.
    2. During installation, make sure to check the box for the Khmer language pack to include it.
    3. Important: Add the Tesseract installation directory (e.g., C:\Program Files\Tesseract-OCR) to your system's PATH environment variable.
  • macOS (using Homebrew):

    # Install Tesseract engine
    brew install tesseract
    
    # Install all available language packs, including Khmer
    brew install tesseract-lang
    
  • Linux (Debian/Ubuntu):

    # Install Tesseract engine
    sudo apt-get update
    sudo apt-get install tesseract-ocr
    
    # Install the Khmer language pack
    sudo apt-get install tesseract-ocr-khm
    

2. Poppler Installation

  • Windows:

    1. Download the latest Poppler binary from here.
    2. Extract the archive and add its bin directory to your system's PATH.
  • macOS (using Homebrew):

    brew install poppler
    
  • Linux (Debian/Ubuntu):

    sudo apt-get install poppler-utils
    

Installation

Once Poppler and Tesseract are installed, you can install this package from PyPI:

pip install --upgrade khmerdocparser

Usage

As a Command-Line Tool

To extract text and print it to the console:

khmerdocparser /path/to/your/document.pdf

To save the extracted text to a file:

khmerdocparser /path/to/your/document.pdf -o extracted_text.txt

If Tesseract or Poppler are not in your system's PATH, you can specify their locations:

khmerdocparser doc.pdf --tesseract_path "C:\Tesseract\tesseract.exe" --poppler_path "C:\Poppler\bin"

As a Python Library

from khmerdocparser.main import extract_text_from_pdf

pdf_path = "/path/to/your/document.pdf"
text = extract_text_from_pdf(pdf_path)
print(text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmerdocparser-0.2.1.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

khmerdocparser-0.2.1-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file khmerdocparser-0.2.1.tar.gz.

File metadata

  • Download URL: khmerdocparser-0.2.1.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for khmerdocparser-0.2.1.tar.gz
Algorithm Hash digest
SHA256 e6bd20ef57150f62355361cea83b5a571a3d3ed1e6b7fa7cbeb47d3588d38345
MD5 1368294e1984f7aebf70643a2f7e8fcc
BLAKE2b-256 e7ebb894df692473ca2408c925df4f0770a5d0a3e5231be3ae1f62dcf7312bd3

See more details on using hashes here.

File details

Details for the file khmerdocparser-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: khmerdocparser-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for khmerdocparser-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 43f6919ae2930fac160e53a2e5c77e9193a67acf1c29da7a7ca0c4c6e943b027
MD5 d904a67884f6d4ff48eae18d09a2155e
BLAKE2b-256 d398e0d28e06e99e1f8756587eea8ec2e734b3eb293609d4a2a651069cd6091f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page