Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.237.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.237-py3-none-any.whl (42.7 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.237.tar.gz.

File metadata

  • Download URL: docstruct-1.0.237.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.237.tar.gz
Algorithm Hash digest
SHA256 dd12220f80ef5ee1a3154617ac7185f15f513bb9e9625ec6af29afcad7a604ef
MD5 c05f7cb00bd9abb7b3b789b44519b5d7
BLAKE2b-256 4de598a81dad815a9dbc6ffc70564ea4019ae34f7eb0704fcb9f3033054522e4

See more details on using hashes here.

File details

Details for the file docstruct-1.0.237-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.237-py3-none-any.whl
  • Upload date:
  • Size: 42.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.2

File hashes

Hashes for docstruct-1.0.237-py3-none-any.whl
Algorithm Hash digest
SHA256 eb2fd7fef3e3818dc3024ef0ccb026de4d43266158e3971b5a31168a626003ed
MD5 a6659452fe1faf6fe69aa3a7a83f2cfb
BLAKE2b-256 a5cb7f5d7c619607d4d80774b5428570c53473f354547c6843d2d27fb57127fa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page