Skip to main content

Document aesthetics and text extractor

Project description

# docSilhouette :tada: docSilhouette

## What is it?

This library wraps pytesseract and adds some useful features for text processing. Objevtively it takes information from the bouding boxes issued by tesseract and it exctracts some coherent information from the text aesthetic, like page and document position for each text block.

We also applied a greedy algorithm to organize the words in blocks, firstly processing lines, after that processing the groups of words as exposed by tesseract dataframe.

## How to use

You’d rather install the library using pip:

`shell pip install docSilhouette `

Then you can use it:

`python from docSilhouette.docSilhouette import docSilhouette doc = docSilhouette('./tests/assets/single_page.pdf') doc.setup() print(doc.get_text(True)) `

You might find output like the following

`shell xxP001 xxQ00_00 xxbob Universal Language Model Fine-tuning for Text Classification xxeob xxQ00_03 `

## Special Tokens

  • xxP001: Page number

  • xxbob: Begin of block

  • xxeob: End of block

  • xxQ01_00: Block number, where 01 refers to the first line of the page matrix and the 00 refers to the first column of the page. Check out the image bellow with a page with the matrix plotted on it. When set to issue quadrants, every block will have a xxQ for the beginning of the block and another for the end of the block. The following example highlights the quadrant of the block 1 Introduction, which starts at line 3 and column 0 and ends at line 3 and column 1. Refer to the image bellow for a more detailed example.

`shell xxQ03_00 xxbob 1 Introduction xxeob xxQ03_01 `

  • xxbcet: centralized text line

  • xxecet: end of centralized text line

![](imgs/2022-04-23-15-08-27.png)

## License MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docSilhouette-0.1.0.tar.gz (2.1 kB view details)

Uploaded Source

File details

Details for the file docSilhouette-0.1.0.tar.gz.

File metadata

  • Download URL: docSilhouette-0.1.0.tar.gz
  • Upload date:
  • Size: 2.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.7.9

File hashes

Hashes for docSilhouette-0.1.0.tar.gz
Algorithm Hash digest
SHA256 77685aee4322e8d469bf1c638d7af22f5a9b00994c038df91d3d172a0b9bb1ae
MD5 dd7b187a5f3ef9bb4c18b17caef17985
BLAKE2b-256 bf5bf4f5db8904e3f9d853f538e5eaef83d798ff1dc5df4834f583bd56900fec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page