Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.251.tar.gz (25.7 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.251-py3-none-any.whl (51.6 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.251.tar.gz.

File metadata

  • Download URL: docstruct-1.0.251.tar.gz
  • Upload date:
  • Size: 25.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.251.tar.gz
Algorithm Hash digest
SHA256 0ab1afff07d08d8452d4f1e7bfd846a5146f57506378a9171f980b24f00cf4d7
MD5 9a66742e6d09a79727226549be7a3eaa
BLAKE2b-256 ea21f9569c180e3b54aa0a26d166a44cc458447f2ba745a217064a2a0b5c78f3

See more details on using hashes here.

File details

Details for the file docstruct-1.0.251-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.251-py3-none-any.whl
  • Upload date:
  • Size: 51.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.251-py3-none-any.whl
Algorithm Hash digest
SHA256 3860bb7c6f697a61fe82a56d16e9e6f68d55204c2f2a2f517023d5f4eacea884
MD5 cfaeae3e30af858a7eb0364eae23e344
BLAKE2b-256 9761eb6e7b08c8cefcd7bb4880ce63f1671f6aaeb3595383072e47fc2fbefc6c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page