Skip to main content

Batch OCR Processing for Thai Documents using Typhoon OCR API

Project description

PTM-OCR

Batch OCR Processing for Thai Documents using Typhoon OCR API

Installation

pip install ptm-ocr

Quick Start

1. Set Environment Variables

Create a .env file or export environment variables:

export OCR_BASE_URL=https://api.opentyphoon.ai/v1
export OCR_API_KEY=your_api_key_here
export OCR_MODEL=typhoon-ocr-7b

2. Run OCR

Command Line:

# Basic usage
ptm-ocr data/pdfs

# With options
ptm-ocr data/pdfs -o output -w 8 -t structure

Python API:

from ptm_ocr import process_folder, process_pdf_file

# Process entire folder
summary = process_folder(
    input_folder="data/pdfs",
    output_folder="output",
    task_type="default",
    max_workers=4
)

# Process single PDF
summary = process_pdf_file(
    pdf_path="document.pdf",
    output_path="document.jsonl"
)

CLI Options

Option Short Default Description
--output -o output Output folder for JSONL files
--task-type -t default OCR task type: default or structure
--workers -w 4 Number of parallel workers
--base-url env Override OCR_BASE_URL
--api-key env Override OCR_API_KEY
--model env Override OCR_MODEL

Output Format (JSONL)

Each line in the output file is a JSON object:

{"page": 1, "text": "ข้อความที่ OCR ได้", "status": "success", "error": null}
{"page": 2, "text": "ข้อความหน้า 2", "status": "success", "error": null}

Task Types

  • default: General documents, infographics
  • structure: Complex layouts (tables, forms, mixed content)

Reading Output

import json

with open('output/document.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        page = json.loads(line)
        print(f"Page {page['page']}: {page['text'][:100]}...")

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ptm_ocr-0.1.0.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ptm_ocr-0.1.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file ptm_ocr-0.1.0.tar.gz.

File metadata

  • Download URL: ptm_ocr-0.1.0.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for ptm_ocr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8d9e2758f79f1f52ac4bc75936d4d4422b728eae0d87d98033ee323599ed2cf1
MD5 a5468bbcec00b710201df434c3e3bb42
BLAKE2b-256 144fb432092f810a98d2a177d09c11c97902525e217ca3fd36fa1ae7c4f9f1e0

See more details on using hashes here.

File details

Details for the file ptm_ocr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ptm_ocr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.12

File hashes

Hashes for ptm_ocr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d5f5856c01b494b667c53f9769d3088377bd632f4ba9c488a44f58f15205f53e
MD5 f60f85ac6683bf3ffb961aedb82efe8e
BLAKE2b-256 0665b2a97ebad46b9552308769d9d2f122fc476636d758e8bdc6db78424550c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page