Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.225.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.225-py3-none-any.whl (42.6 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.225.tar.gz.

File metadata

  • Download URL: docstruct-1.0.225.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.225.tar.gz
Algorithm Hash digest
SHA256 aa4acaa149b7e7400bd2ea9213bede8fceb2a5fa3ae7040cb810561966bfa4bf
MD5 ffc5c159a1be733de8d9d94bdc151573
BLAKE2b-256 7151a433baf0aeb212fb8e0ca69f6da944b707101e5d3f9a4c540d750dc97cc2

See more details on using hashes here.

File details

Details for the file docstruct-1.0.225-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.225-py3-none-any.whl
  • Upload date:
  • Size: 42.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.225-py3-none-any.whl
Algorithm Hash digest
SHA256 de8670950a124b279fa86c53ab465f5c24f7af9323c17dbb52ed17034f367bd6
MD5 9c5a3b6cbec125f3eea6a23bd7e7c1b3
BLAKE2b-256 b2534e3196cd5245e7f119cf6d7e3edd9505e7155d83dd837214d2012a521424

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page