Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.18.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.18-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.18.tar.gz.

File metadata

  • Download URL: docstruct-1.0.18.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for docstruct-1.0.18.tar.gz
Algorithm Hash digest
SHA256 10093b4075b6a92489eeea2d92165f851e64230080b0a0d3b0f7cf1a251d3ec8
MD5 8a1e4de1bbc2cd480351e4b1064c6708
BLAKE2b-256 5b52045be479cd6ce4a1f78d71260506aa0bf522fecf9914124b012dad55a4cd

See more details on using hashes here.

File details

Details for the file docstruct-1.0.18-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.18-py3-none-any.whl
  • Upload date:
  • Size: 31.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for docstruct-1.0.18-py3-none-any.whl
Algorithm Hash digest
SHA256 c0087c3dc7b4eb601fbbf3145cfc72c54ba38c34d20b1b0dca4c3b56c5146a48
MD5 7e1f988e27b5d1f4e5e0a004b3846afe
BLAKE2b-256 b8a4130b54651106e289e598eb8c33410b1e655c5d9fc21f6f1d3e751fadd74a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page