Skip to main content

A package for representing documents as a tree of document, pages, paragraphs, lines, words, and characters

Project description

Trullion image

Docstruct

Overview

Docstruct is a package that parses the results of optical character recognition (OCR) algorithms, such as Tesseract (using the hOCR output) or Textract (AWS), into a tree structure. This tree structure allows for the visual representation of the document, with each node representing a document, page, paragraph, line, word, or character, along with its bounding box. The package also includes support for paragraph detection and text splitting that preserves logical units.

Documentation

For more information read the docs at: [Docstruct](https://smrt-co.github.io/docstruct/)

pip install docstruct

Contributions

Contributions to the Docstruct package are always welcome. If you have a bug fix or a new feature, feel free to create a pull request on the GitHub repository.

License

The Docstruct package is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docstruct-1.0.199.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

docstruct-1.0.199-py3-none-any.whl (34.1 kB view details)

Uploaded Python 3

File details

Details for the file docstruct-1.0.199.tar.gz.

File metadata

  • Download URL: docstruct-1.0.199.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for docstruct-1.0.199.tar.gz
Algorithm Hash digest
SHA256 e1c7d14fcde94578a2437435863bb0ee72ba7669600b805966c9a6d07a368c0e
MD5 726d82cc199e8503f76acc3bd948355c
BLAKE2b-256 9d674e63e4b437c7976d6d96a55bd8967bc1a381e97cab7f82ec93f70ecb7e84

See more details on using hashes here.

File details

Details for the file docstruct-1.0.199-py3-none-any.whl.

File metadata

  • Download URL: docstruct-1.0.199-py3-none-any.whl
  • Upload date:
  • Size: 34.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for docstruct-1.0.199-py3-none-any.whl
Algorithm Hash digest
SHA256 d31c5922ee55b46b5f0ab3d9b402e20e398361cbe962bd5d648e3923bbb2f33a
MD5 b4fe759cab478a4230b1cd999fa4023f
BLAKE2b-256 363748866cc96791652017647d5b70324a63a7228720735967fb1584556d85e0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page