Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.236.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.236-py3-none-any.whl (42.7 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.236.tar.gz.

File metadata

  • Download URL: docstruct-1.0.236.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.236.tar.gz
Algorithm Hash digest
SHA256 455ae9deda467b5ce4a56bf76fc8634204e02dd825384c0b2c5b066a74433b0b
MD5 371d73f5a8de0c18ac0536d27cfc55c8
BLAKE2b-256 10dd26d5599d01fbfdc7bb7a535b40b62ac38a39191f4442bb5c83d3aa0295f7

See more details on using hashes here.

File details

Details for the file docstruct-1.0.236-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.236-py3-none-any.whl
  • Upload date:
  • Size: 42.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.236-py3-none-any.whl
Algorithm Hash digest
SHA256 3ea4179a6a9378813730a8fe0c901d9923031c83c6b197a9d5769733d33b75f0
MD5 e4d692e706559271c05aba3642b3e7d4
BLAKE2b-256 6fd833e410ba8904fe5b7c692f9048c0a6d8a0516f358c0a2fc7570d659b7cc5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page