Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: https://smrt-co.github.io/docstruct/

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.12.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.12-py3-none-any.whl (28.6 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.12.tar.gz.

File metadata

  • Download URL: docstruct-1.0.12.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for docstruct-1.0.12.tar.gz
Algorithm Hash digest
SHA256 adfccbec7004b400fa2488eb8e91dd4f62b88cf5ca0a1399bd0d85137658801c
MD5 5408b7f4e33767a34caaecb8f241ede7
BLAKE2b-256 33428e5c9a0cbbdfd16a08c97c6dbde1f71aae30ccc4b57e942faf29f290d192

See more details on using hashes here.

File details

Details for the file docstruct-1.0.12-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.12-py3-none-any.whl
  • Upload date:
  • Size: 28.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for docstruct-1.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 61ce31072a5d5920f9c013272547204a5e0769308c394b672d241d25ffc2437b
MD5 3a1f927a1eab654b24cb5634d8317f5d
BLAKE2b-256 47768a235f891c91bd2d1af5a04f7d54efb0db8915dd3f05255a9574c83ed029

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page