Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.204.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.204-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.204.tar.gz.

File metadata

  • Download URL: docstruct-1.0.204.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.204.tar.gz
Algorithm Hash digest
SHA256 1acddcb72f64449d9dfad31472418932bf3fef58695673297ff0e8d1ec6272b3
MD5 6f59d6515d41be54ef0bb3e9cf33e8fd
BLAKE2b-256 f022fae1241e59c25307c0bc5351ab48e820d2939637eef2c27d5cb477d87bc9

See more details on using hashes here.

File details

Details for the file docstruct-1.0.204-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.204-py3-none-any.whl
  • Upload date:
  • Size: 34.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.204-py3-none-any.whl
Algorithm Hash digest
SHA256 39e2a5e0594b9d9b070242b92af996b395981e2f7effec0053e4dc84f79f4f4a
MD5 bc0a2b8178e41f6e8bec62ca2a4cba7c
BLAKE2b-256 fb0d38f81fc1eac1d45b039c46324a5628789eee57f60e18621a3d90572e3df6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page