Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.16.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.16-py3-none-any.whl (39.3 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.16.tar.gz.

File metadata

  • Download URL: docstruct-1.0.16.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for docstruct-1.0.16.tar.gz
Algorithm Hash digest
SHA256 cf0b0a6855b24a06570f41d417dcc56a529e967a47f2b3331469864f9e48420d
MD5 444f69fa633861b7325d679a3e146147
BLAKE2b-256 17b944430e5f7caaa015cf00cc1c6b99082975be14e8a780be49d8f07e8e6da6

See more details on using hashes here.

File details

Details for the file docstruct-1.0.16-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.16-py3-none-any.whl
  • Upload date:
  • Size: 39.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for docstruct-1.0.16-py3-none-any.whl
Algorithm Hash digest
SHA256 74113385c210104bd7cd7edb5be68a2fb631d858e4caddcd9d0aa5ffc0b4c55d
MD5 4c84b16aef922de5666aa2e3510dfd5a
BLAKE2b-256 31fe99ec5762caf432647a651ee90d5663114e8ce1a9008daa40a206fddad2d9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page