Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.197.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.197-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.197.tar.gz.

File metadata

  • Download URL: docstruct-1.0.197.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for docstruct-1.0.197.tar.gz
Algorithm Hash digest
SHA256 e0fc31a61f7bbdb130ab8ce7d8578ff2a59b9e396b8350178f55733dee5f68d7
MD5 60aff78cecae978e9d57ec1fb91c5f96
BLAKE2b-256 9d3f9757ca450f4edc2edbf2dd39a830b9599f261c0a8fc2ed5f389f72c35585

See more details on using hashes here.

File details

Details for the file docstruct-1.0.197-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.197-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for docstruct-1.0.197-py3-none-any.whl
Algorithm Hash digest
SHA256 a98f09fa9e008bcc5bbc3c2515b38ce65d48fef53e0fc3246526264c4bfa3c31
MD5 4ae208131ee85efd1f03927fa368571f
BLAKE2b-256 c317484a0202bf0a3fececce9e0cd58de02fde3c702c4d8622ab837f5d45ec9a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page