Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.242.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.242-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.242.tar.gz.

File metadata

  • Download URL: docstruct-1.0.242.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.242.tar.gz
Algorithm Hash digest
SHA256 b094d79e5f1ca20c718e803780faad4e099f12341b3b2b6ede704793fb20eca0
MD5 8b4421aaeb7a3341c1e2b820fe11473f
BLAKE2b-256 f7d0236103429533c86d90fe747689db06729a53aafb7c9a88300d92ebc0d886

See more details on using hashes here.

File details

Details for the file docstruct-1.0.242-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.242-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.242-py3-none-any.whl
Algorithm Hash digest
SHA256 5ae169d105ea48004a885f03918ab3a49a143afab4ab311f7d2bc1f76028f27d
MD5 2164f5ae4bfe09cb4b921d2320828398
BLAKE2b-256 e2b0fe494f82e4f72d482c122cb19fce1aab072c4b66581b8cb06d6e8d58e257

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page