Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.194.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.194-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.194.tar.gz.

File metadata

  • Download URL: docstruct-1.0.194.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.194.tar.gz
Algorithm Hash digest
SHA256 b1277f3b500d4a8073dd08aa202e80258aa26278b9a9c46ab96c67e7778b9e7d
MD5 bec34511e49a710e34e3c758d47c43a6
BLAKE2b-256 01a26db10bba5969d3f150f9b01f72e34a200aae527e4b4d8d8a7e78ec09e771

See more details on using hashes here.

File details

Details for the file docstruct-1.0.194-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.194-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.194-py3-none-any.whl
Algorithm Hash digest
SHA256 78383bc810048047de1655d131cfd138f9c12c9a26786b233ec545fcd07b4112
MD5 fabd933ec3c56931e355193d5e3f1c55
BLAKE2b-256 70086df0ec7243b7e3690b9f3e15d65a1c62cce7ef16182acba70c964e4826f1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page