Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.232.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.232-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.232.tar.gz.

File metadata

  • Download URL: docstruct-1.0.232.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.232.tar.gz
Algorithm Hash digest
SHA256 64d103bf32234032a7a23e3227c8a3a569dc441d4e91232455b79b80e1db1cf8
MD5 6d7df6360db00b71a1876cda71c96051
BLAKE2b-256 695083273b29fe2eac80b4b4f0a16a513115cb88042274bc00a3035a9a85d9d3

See more details on using hashes here.

File details

Details for the file docstruct-1.0.232-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.232-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.232-py3-none-any.whl
Algorithm Hash digest
SHA256 66cb80fce51fdf67fad45dc606c2b1d3e8c9b4fb9eb34e53e5d0baaf0eedb26d
MD5 2a6f1e3a5e7662448f87f9db60c0c21b
BLAKE2b-256 16d384587a65bdad00be54d8baab0095fa5dd01c44fa8e7044ee4fb7ab56e807

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page