Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.214.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.214-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.214.tar.gz.

File metadata

  • Download URL: docstruct-1.0.214.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.214.tar.gz
Algorithm Hash digest
SHA256 d151565a46cc36d0a62c6b85098bca47440164426e747dc40913fd9bf507bb5a
MD5 d34245f44ec59df135e3627ef17e7bea
BLAKE2b-256 f06c32118723943d4d649dea7eab30af2fc4b42665d759077b4876a3f19b184c

See more details on using hashes here.

File details

Details for the file docstruct-1.0.214-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.214-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.214-py3-none-any.whl
Algorithm Hash digest
SHA256 a6bf81201f3676b39857229ca4e377e226d32792d72161e7bd0074c32eaaf8a0
MD5 97e23eb2c39c474f2ce40db1a45cbe1c
BLAKE2b-256 638605edd1393aec9295edaba39e19c4020aa2525b00d47ec92fa1f42514e185

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page