Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.196.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.196-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.196.tar.gz.

File metadata

  • Download URL: docstruct-1.0.196.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for docstruct-1.0.196.tar.gz
Algorithm Hash digest
SHA256 ebff86ef0eb96b95192012ebafbda9b88b2bfb1139e2c21376ca5b9312b78803
MD5 99783dd94209822b62832942c40b1fcd
BLAKE2b-256 d35ebff589dffb7b1cfbdbde13a902436d6092a51717b7701abdbd7c77593e02

See more details on using hashes here.

File details

Details for the file docstruct-1.0.196-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.196-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for docstruct-1.0.196-py3-none-any.whl
Algorithm Hash digest
SHA256 c59bf78ef3b37077361c0fd2ff4b6614686d2ecb6c1b1ee11dc4ee0f79c93e86
MD5 70504c1c4244f026fe5bdd7e817dd4cb
BLAKE2b-256 d762f9305e14843b04feda35c93dc7305bc9b115c5236ae0402755fff5182456

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page