Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.195.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.195-py3-none-any.whl (33.4 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.195.tar.gz.

File metadata

  • Download URL: docstruct-1.0.195.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.195.tar.gz
Algorithm Hash digest
SHA256 f31b589227af4f7b795fb0c7d20e9716e0c4a7614b2e31fa37aa8e0a59bb6c10
MD5 a918f04e6172bac3e73db6518dfa3c83
BLAKE2b-256 ccb66d945fac2292c8769198c620dc84b495be31fdb153a5483d0290d81da400

See more details on using hashes here.

File details

Details for the file docstruct-1.0.195-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.195-py3-none-any.whl
  • Upload date:
  • Size: 33.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.195-py3-none-any.whl
Algorithm Hash digest
SHA256 f8909e3853c90e94c90d4a390a12d50f7d73d988a5106d2946eeee2d263d47f3
MD5 07f4ebe6f1a4b338daa6be13657a14b1
BLAKE2b-256 19d21d591dd7ea4ca4bca4c9238ec54acc0a9b0ce8bbb44bb721205d73f3491d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page