Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.216.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.216-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.216.tar.gz.

File metadata

  • Download URL: docstruct-1.0.216.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.216.tar.gz
Algorithm Hash digest
SHA256 be1e02daed5fe24c5549ed3a89a371bad6168a9bb4a1e45ba98992875c9e075e
MD5 ec1f79a54b3ecced5441b727490c379f
BLAKE2b-256 6e96684cabceaac89ac57474f28ec6efcd0a579e1097f400dc35554db0378e08

See more details on using hashes here.

File details

Details for the file docstruct-1.0.216-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.216-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.216-py3-none-any.whl
Algorithm Hash digest
SHA256 fa5de85084e5b9a756040afebfd46de95f75199a5f340a04dca56682c49f5868
MD5 47f1edff4e9078d803b481ef5f8f98e0
BLAKE2b-256 a3ceeaf60bd25e2a429aa274d4d33e0087dc0d05597a35a5dc436a5cb0ea6e0f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page