Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.235.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.235-py3-none-any.whl (43.1 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.235.tar.gz.

File metadata

  • Download URL: docstruct-1.0.235.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.235.tar.gz
Algorithm Hash digest
SHA256 d4230317eb63d42b22061e0647a210b3653f3727693eba56c7c8d43681543fee
MD5 4bcd550314126f35bf8ca971dd29bcbb
BLAKE2b-256 5148606da38a60b182d6c7a702fbf39c25c74512ca62d9d9d5fdb9df8a06b204

See more details on using hashes here.

File details

Details for the file docstruct-1.0.235-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.235-py3-none-any.whl
  • Upload date:
  • Size: 43.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.235-py3-none-any.whl
Algorithm Hash digest
SHA256 b3709a8b1f3ac19f14c9fe82f3e871f059829a9b13b627a225c2ff1a3850baed
MD5 c190ce9e25ad4c5e397ea1bbec6ae8b4
BLAKE2b-256 1903d7ca2edccb7fec85698b9d0c265e95403218598b28e4082c65be29e5cf65

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page