Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.211.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.211-py3-none-any.whl (40.0 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.211.tar.gz.

File metadata

  • Download URL: docstruct-1.0.211.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.211.tar.gz
Algorithm Hash digest
SHA256 221e3570c736bb4208eb6f84cebe8e859eb25341f3df2d9559cff0365711840c
MD5 19fcd69e19cb9fc9f8f44b8668325bcc
BLAKE2b-256 f6a474e145b8c48b19d73f8769f5a9fd407c6f58052af3d633699676bbbb34ef

See more details on using hashes here.

File details

Details for the file docstruct-1.0.211-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.211-py3-none-any.whl
  • Upload date:
  • Size: 40.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.211-py3-none-any.whl
Algorithm Hash digest
SHA256 2a99cdea9b3e0e1c8dd93f4194b0d50198c46dadcaf2982fe36ff65afa530396
MD5 f2b377cf6e14ebbd7390b3cc83426a71
BLAKE2b-256 bf9f96f6272f27f0cccb3d5c6134e4fae910d33c43ded5b83e51ecab5e909002

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page