Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.213.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.213-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.213.tar.gz.

File metadata

  • Download URL: docstruct-1.0.213.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.213.tar.gz
Algorithm Hash digest
SHA256 a0b7da85f51e29e6eb09ead21e0af59517a32b0bfef2abdabaa81a8271555bdc
MD5 d00ec363e012a934039bd8330107d963
BLAKE2b-256 222db17cff05285e15e6492bc98a4b43064b8759cec3576ecc45bb9a54edf404

See more details on using hashes here.

File details

Details for the file docstruct-1.0.213-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.213-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.213-py3-none-any.whl
Algorithm Hash digest
SHA256 f9528a4276fa9d7286a3fc9911e03e97f4d2b0b128fc3275f68122a6c1f9717a
MD5 342fc92e44f9b92fde102c62b7237c4c
BLAKE2b-256 d8ffb8c65cbf8fb19c94eb9eaaef8c837a9245f1d7187c0410ee784d70c9f1db

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page