doc page extractor can identify text and format in images and return structured data.

Project description

doc page extractor

English | 中文

Introduction

doc page extractor can identify text and format in images and return structured data.

Installation

pip install doc-page-extractor

pip install onnxruntime==1.21.0

Using CUDA

Please refer to the introduction of PyTorch and select the appropriate command to install according to your operating system.

In addition, replace the command to install onnxruntime in the previous article with the following:

pip install onnxruntime-gpu==1.21.0

Example

from PIL import Image
from doc_page_extractor import DocExtractor

extractor = DocExtractor(
  model_dir_path=model_path, # Folder address where AI model is downloaded and installed
  device="cpu", # If you want to use CUDA, please change to device="cuda".
)
with Image.open("/path/to/your/image.png") as image:
  result = extractor.extract(
  image=image,
  lang="ch", # Language of image text
)
for layout in result.layouts:
  for fragment in layout.fragments:
    print(fragment.rect, fragment.text)

Acknowledgements

The code of doc_page_extractor/onnxocr in this repo comes from OnnxOCR.

Project details

Release history Release notifications | RSS feed

0.2.3

May 9, 2025

0.2.2

May 9, 2025

0.2.1

May 9, 2025

This version

0.2.0

May 9, 2025

0.1.7

May 9, 2025

0.1.6

May 9, 2025

0.1.5

May 8, 2025

0.1.4

May 8, 2025

0.1.3

May 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc-page-extractor-test-0.2.0.tar.gz (62.2 kB view details)

Uploaded May 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doc_page_extractor_test-0.2.0-py3-none-any.whl (132.0 kB view details)

Uploaded May 9, 2025 Python 3

File details

Details for the file doc-page-extractor-test-0.2.0.tar.gz.

File metadata

Download URL: doc-page-extractor-test-0.2.0.tar.gz
Upload date: May 9, 2025
Size: 62.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for doc-page-extractor-test-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`399bab94e967d3d498afdb25e6a27deb11fa7741ba8f803dd8038a8228a8f1f4`
MD5	`03a524610fb51ff704d14c470998fe29`
BLAKE2b-256	`48f6ac412806913fbf79a518e102996bb59838db79e0e97693fe730e674560eb`

See more details on using hashes here.

File details

Details for the file doc_page_extractor_test-0.2.0-py3-none-any.whl.

File metadata

Download URL: doc_page_extractor_test-0.2.0-py3-none-any.whl
Upload date: May 9, 2025
Size: 132.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.13

File hashes

Hashes for doc_page_extractor_test-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`506eb6bba7fcf352c26d1e0bcc75892b6db4da3c47c5ec5f46ef95ca210ab455`
MD5	`b31e874350f239da74f3c74b1ab10537`
BLAKE2b-256	`1d6f4612fc9ce1de57aadde33a87ffea6b28ea0bf6758644fb06d1b35b88eb94`

See more details on using hashes here.

doc-page-extractor-test 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

doc page extractor

Introduction

Installation

Using CUDA

Example

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes