Skip to main content

A Python tool to extract Khmer text from PDF documents using Tesseract OCR.

Project description

Khmer Document Parser v0.2.0

khmerdocparser is a command-line tool to extract Khmer text from PDF files. It works by converting each page of a PDF into an image and then using Google's Tesseract OCR engine to extract the text.

This tool uses the Pytesseract library as a wrapper for Tesseract.

Features

  • Extracts both Khmer and English text from PDFs using Tesseract.
  • Simple command-line interface.
  • Option to save extracted text to a file.
  • Can be used as a library in your own Python projects.

Prerequisites

This package requires two crucial external dependencies: Poppler (for handling PDFs) and Tesseract OCR (for recognizing text). You must install both on your system.

1. Tesseract OCR Installation

You must install the Tesseract engine and the Khmer language pack.

  • Windows:

    1. Download and run the Tesseract installer from UB-Mannheim's GitHub.
    2. During installation, make sure to check the box for the Khmer language pack to include it.
    3. Important: Add the Tesseract installation directory (e.g., C:\Program Files\Tesseract-OCR) to your system's PATH environment variable.
  • macOS (using Homebrew):

    # Install Tesseract engine
    brew install tesseract
    
    # Install all available language packs, including Khmer
    brew install tesseract-lang
    
  • Linux (Debian/Ubuntu):

    # Install Tesseract engine
    sudo apt-get update
    sudo apt-get install tesseract-ocr
    
    # Install the Khmer language pack
    sudo apt-get install tesseract-ocr-khm
    

2. Poppler Installation

  • Windows:

    1. Download the latest Poppler binary from here.
    2. Extract the archive and add its bin directory to your system's PATH.
  • macOS (using Homebrew):

    brew install poppler
    
  • Linux (Debian/Ubuntu):

    sudo apt-get install poppler-utils
    

Installation

Once Poppler and Tesseract are installed, you can install this package from PyPI:

pip install --upgrade khmerdocparser

Usage

As a Command-Line Tool

To extract text and print it to the console:

khmerdocparser /path/to/your/document.pdf

To save the extracted text to a file:

khmerdocparser /path/to/your/document.pdf -o extracted_text.txt

If Tesseract or Poppler are not in your system's PATH, you can specify their locations:

khmerdocparser doc.pdf --tesseract_path "C:\Tesseract\tesseract.exe" --poppler_path "C:\Poppler\bin"

As a Python Library

from khmerdocparser.main import extract_text_from_pdf

pdf_path = "/path/to/your/document.pdf"
text = extract_text_from_pdf(pdf_path)
print(text)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmerdocparser-0.2.0.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

khmerdocparser-0.2.0-py3-none-any.whl (5.5 kB view details)

Uploaded Python 3

File details

Details for the file khmerdocparser-0.2.0.tar.gz.

File metadata

  • Download URL: khmerdocparser-0.2.0.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for khmerdocparser-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b496e6f8b7bf89d33d33855482382e83b20b944a91c14e6a3e616315a806523e
MD5 0c55378c95cb5c9d6f3070b31995f4aa
BLAKE2b-256 dc434dd7dbb2f8a9e123454ceff1a098ad2ebcf3f02ab2ea3fbabd684e42b290

See more details on using hashes here.

File details

Details for the file khmerdocparser-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: khmerdocparser-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for khmerdocparser-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9327389552d606e9ddf6e558fde33a8dacef67101b8433d7e432a4b979a5ce90
MD5 2b998711145760706755ce01d51a89a6
BLAKE2b-256 c13e03c4297a23c15dd2c84783b04d4e07d033dd256baf1f4eababbc580a82ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page