Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.210.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.210-py3-none-any.whl (40.0 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.210.tar.gz.

File metadata

  • Download URL: docstruct-1.0.210.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.210.tar.gz
Algorithm Hash digest
SHA256 b2225cc3345012bb4e1e7b5ae0b525918c644c0dc8874d4ea473c21f1fd69e1d
MD5 d1da58d2a4a4d9cde3f77c62a6b7efb1
BLAKE2b-256 dfbe19ca2b43b7eb93edc4d91965f39d7f6e48d16f8e79314b87cdbff8a4bea5

See more details on using hashes here.

File details

Details for the file docstruct-1.0.210-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.210-py3-none-any.whl
  • Upload date:
  • Size: 40.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.210-py3-none-any.whl
Algorithm Hash digest
SHA256 f2bc318fce5241a3b9b06829cdd551a604dcf5d0120753d0b3b3c92d8054675a
MD5 e291cabb96bd6aa7ee581f68eedc75b4
BLAKE2b-256 01f6ee7f9c2c3e71dd657c92ed1a9e3b4c3d42cafb7c8bd68cc884269e258fb9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page