Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.227.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.227-py3-none-any.whl (42.7 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.227.tar.gz.

File metadata

  • Download URL: docstruct-1.0.227.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.227.tar.gz
Algorithm Hash digest
SHA256 7ac762ddc2bcd0791c1e6b7906098d843ac16a2aba822339eb8ea63e262348d5
MD5 2ec56757ee377dccbe948af57eff8d5b
BLAKE2b-256 23821e5a4a79502111857173ceab5a77ddbe5d5e93ff46a56886334522b37374

See more details on using hashes here.

File details

Details for the file docstruct-1.0.227-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.227-py3-none-any.whl
  • Upload date:
  • Size: 42.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.227-py3-none-any.whl
Algorithm Hash digest
SHA256 6fa357897c2e6b7822d56d54e222a22581d584afac2760d052aafa58804ac8b0
MD5 67029d92d464b34171a4cae760ced4fa
BLAKE2b-256 2b73ebc59aab9ad202657d9b61e30416c6c139338de491b7b03a8f3efcd61ff1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page