Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.244.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.244-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.244.tar.gz.

File metadata

  • Download URL: docstruct-1.0.244.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.244.tar.gz
Algorithm Hash digest
SHA256 609b390bdfcfe3c38a8a2b8f3281204d1577c384df87ad7123986ac75c1e3b15
MD5 9d767f38834a3fed9506d61f5c95bf2c
BLAKE2b-256 2a9225d18be5cb7285d7c0eaad0e45bedb456814b5b2c0cdddc2cf326ec00ffd

See more details on using hashes here.

File details

Details for the file docstruct-1.0.244-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.244-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.244-py3-none-any.whl
Algorithm Hash digest
SHA256 3db719df7375622927c6e150443f9164ab74330d421087a30ef7d5c0dd5efb6c
MD5 bb49f9ab6be613e5ab1b52dd78e753ba
BLAKE2b-256 b1acd80e5a8b1cafd965bd7569ce8a640c903c2bb001c4a4a70f8abd40bfa4ef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page