Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.192.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.192-py3-none-any.whl (31.6 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.192.tar.gz.

File metadata

  • Download URL: docstruct-1.0.192.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.192.tar.gz
Algorithm Hash digest
SHA256 6be85ca6f9a8e2b96c0f4b68db189485f2706212f86b18ca7f28e392b0662417
MD5 5fb6f40968ec64361812e712f730346c
BLAKE2b-256 cb79aafb695885d6b8e25526dbae5dc1e26620926344b2aa54d48094f861d82d

See more details on using hashes here.

File details

Details for the file docstruct-1.0.192-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.192-py3-none-any.whl
  • Upload date:
  • Size: 31.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.192-py3-none-any.whl
Algorithm Hash digest
SHA256 d819ef0e92f3e768f5e51caf7799d32f55b858313ad3a6110f0bfc9ff5e6015b
MD5 cddf5bb967b2f334aa7f95f8324db7ae
BLAKE2b-256 5ca13fa04cc38e1f49f295dad2a158927d4f639e6d43e184f6c878aeff4c961d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page