Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.221.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.221-py3-none-any.whl (42.0 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.221.tar.gz.

File metadata

  • Download URL: docstruct-1.0.221.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.221.tar.gz
Algorithm Hash digest
SHA256 1593d05d51d8173f4d41cfd73aee9b0df3e5aee945404aac952d67676fee217a
MD5 092600581966c962ce55160011fe4040
BLAKE2b-256 2bfa770b02bb8c3576dfd1faa10e602ccbe755b9233de8ce3a703449eb8ee1be

See more details on using hashes here.

File details

Details for the file docstruct-1.0.221-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.221-py3-none-any.whl
  • Upload date:
  • Size: 42.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.221-py3-none-any.whl
Algorithm Hash digest
SHA256 3501675b860adb8d93fafc62d5d878e1a686c20ba2cf1d42d1b3124b9fd3327c
MD5 7a352d868d81ab3c41e5307174ae2095
BLAKE2b-256 8e96d8541965e85855c08a6baca73d4e6698853d9b5a4f6900e62dd026fb19a8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page