Skip to main content

A package for extracting structured content from PDFs and images using Typhoon OCR models

Project description

Typhoon OCR

Typhoon OCR is a model for extracting structured markdown from images or PDFs. It supports document layout analysis and table extraction, returning results in markdown or HTML. This package provides utilities to convert images and PDFs to the format supported by the Typhoon OCR model.

Languages Supported

The Typhoon OCR model supports:

  • English
  • Thai

Features

  • Convert images to PDFs for unified processing
  • Extract text and layout information from PDFs and images
  • Generate OCR-ready messages for API processing with Typhoon OCR model
  • Built-in prompt templates for different document processing tasks
  • Process specific pages from multi-page PDF documents

Installation

pip install typhoon-ocr

System Requirements

The package requires the Poppler utilities to be installed on your system:

For macOS:

brew install poppler

For Linux:

sudo apt-get update
sudo apt-get install poppler-utils

The following binaries are required:

  • pdfinfo
  • pdftoppm

Usage

Core functionality

The package provides 2 main functions:

from typhoon_ocr import ocr_document, prepare_ocr_messages
  • ocr_document: Full OCR pipeline for Typhoon OCR model via opentyphoon.ai or OpenAI compatible api (such as vllm)
  • prepare_ocr_messages: Generate complete OCR-ready messages for the Typhoon OCR model

Complete OCR workflow

Use the simplified API to ocr the document or prepare messages for OpenAI compatible api at opentyphoon.ai:

from typhoon_ocr import ocr_document

markdown = ocr_document(
    pdf_or_image_path="document.pdf",  # Works with PDFs or images
    task_type="default",    # Choose between "default" or "structure"
    page_num=2              # Process page 2 of a PDF (default is 1, always 1 for images)
)

# Or with image
markdown = ocr_document(
    pdf_or_image_path="scan.jpg",  # Works with PDFs or images
    task_type="default",    # Choose between "default" or "structure"
)

Prepare the messages manually.

from typhoon_ocr import prepare_ocr_messages
from openai import OpenAI

# Prepare messages for OCR processing
messages = prepare_ocr_messages(
    pdf_or_image_path="document.pdf",  # Works with PDFs or images
    task_type="default",    # Choose between "default" or "structure"
    page_num=2              # Process page 2 of a PDF (default is 1, always 1 for images)
)

# Use with https://opentyphoon.ai/ api or self-host model via vllm
# See model list at https://huggingface.co/collections/scb10x/typhoon-ocr-682713483cb934ab0cf069bd
client = OpenAI(base_url='https://api.opentyphoon.ai/v1')
response = client.chat.completions.create(
    model="typhoon-ocr-preview",
    messages=messages,
    max_tokens=16000,
    extra_body={
        "repetition_penalty": 1.2,
        "temperature": 0.1,
        "top_p": 0.6,
    },

)

# Parse the JSON response
text_output = response.choices[0].message.content
markdown = json.loads(text_output)['natural_text']
print(markdown)

Available task types

The package comes with built-in prompt templates for different OCR tasks:

  • default: Extracts markdown representation of the document with tables in markdown format
  • structure: Provides more structured output with HTML tables and image analysis placeholders

Document Extraction Capabilities

The Typhoon OCR model, when used with this package, can extract:

  • Structured text with proper layout preservation
  • Tables (in markdown or HTML format)
  • Document hierarchy (headings, paragraphs, lists)
  • Text with positional information
  • Basic image analysis and placement

License

This project code is licensed under the Apache 2.0 License.

Acknowledgments

The code is based on work from OlmoCR under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

typhoon_ocr-0.4.1.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

typhoon_ocr-0.4.1-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file typhoon_ocr-0.4.1.tar.gz.

File metadata

  • Download URL: typhoon_ocr-0.4.1.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for typhoon_ocr-0.4.1.tar.gz
Algorithm Hash digest
SHA256 a216b7450c2746537679ce7917f9081e12a20468b98a27c5a4fe64c2cc006332
MD5 25a48bc45797d0b92fc4c74a3d5a2c46
BLAKE2b-256 70c3838a09b8c21a2409bed1f6b437775afb853a35efcc2e368aa54192c8304a

See more details on using hashes here.

File details

Details for the file typhoon_ocr-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: typhoon_ocr-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for typhoon_ocr-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1086dd7b2df049ee42930d2b441f0c6dea61854c9bfd757bc98017f07be5cc26
MD5 800dd60196b4c5f4600f5fe9fc8d7e54
BLAKE2b-256 527a230c51d650f46bd4790be5e4fa99d04a410e6d0a71a3feb3cad777a5121f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page