Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.191.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.191-py3-none-any.whl (31.6 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.191.tar.gz.

File metadata

  • Download URL: docstruct-1.0.191.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.191.tar.gz
Algorithm Hash digest
SHA256 fac5ca99b368cee615cd486043134fd1430ba2e3c9616e5270b48f0e6a0e477f
MD5 1298b06edeb3692cdc67c043f0057474
BLAKE2b-256 63980f6ffe0141a496958e5999f7c8bb9c18e66ae649406c3d9e95f980f4331a

See more details on using hashes here.

File details

Details for the file docstruct-1.0.191-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.191-py3-none-any.whl
  • Upload date:
  • Size: 31.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.191-py3-none-any.whl
Algorithm Hash digest
SHA256 90d6e050a275375e916524b7481003f70e916e4b163088fadf9d4e984b3fb15a
MD5 6676149d06d8166b4654d92782446854
BLAKE2b-256 e7fda49fb8e61c51930b3fc680c63c891700ebf443c3f23f4e9547954c7f31df

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page