Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.206.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.206-py3-none-any.whl (39.8 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.206.tar.gz.

File metadata

  • Download URL: docstruct-1.0.206.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.206.tar.gz
Algorithm Hash digest
SHA256 9fae31ca538264d835d84808f97ed41dc9a95b4ace2d9b96136988ce5a0d497c
MD5 50b3157049d59243964b6a1f7bdcef72
BLAKE2b-256 411becae7f5b3d37ff74f1e887830271c4224c5898576c4419958ebfadd86cdf

See more details on using hashes here.

File details

Details for the file docstruct-1.0.206-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.206-py3-none-any.whl
  • Upload date:
  • Size: 39.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.206-py3-none-any.whl
Algorithm Hash digest
SHA256 baabe9d2b45f535cd0932c7d364ca8af3480f17c72c00fd1f8ee7c5329956af8
MD5 638711554af060ddf1ef87a3ead19a48
BLAKE2b-256 59e3a393e39605c10016fd859f2e9c1d87c72fe6068f8e1fb01bc6b09e681868

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page