Document aesthetics and text extractor
Project description
# docSilhouette :tada: docSilhouette
## What is it?
This library wraps pytesseract and adds some useful features for text processing. Objevtively it takes information from the bouding boxes issued by tesseract and it exctracts some coherent information from the text aesthetic, like page and document position for each text block.
We also applied a greedy algorithm to organize the words in blocks, firstly processing lines, after that processing the groups of words as exposed by tesseract dataframe.
## How to use
You’d rather install the library using pip:
`shell pip install docSilhouette `
Then you can use it:
`python from docSilhouette.docSilhouette import docSilhouette doc = docSilhouette('./tests/assets/single_page.pdf') doc.setup() print(doc.get_text(True)) `
You might find output like the following
`shell xxP001 xxQ00_00 xxbob Universal Language Model Fine-tuning for Text Classification xxeob xxQ00_03 `
## Special Tokens
xxP001: Page number
xxbob: Begin of block
xxeob: End of block
xxQ01_00: Block number, where 01 refers to the first line of the page matrix and the 00 refers to the first column of the page. Check out the image bellow with a page with the matrix plotted on it. When set to issue quadrants, every block will have a xxQ for the beginning of the block and another for the end of the block. The following example highlights the quadrant of the block 1 Introduction, which starts at line 3 and column 0 and ends at line 3 and column 1. Refer to the image bellow for a more detailed example.
`shell xxQ03_00 xxbob 1 Introduction xxeob xxQ03_01 `
xxbcet: centralized text line
xxecet: end of centralized text line
![](imgs/2022-04-23-15-08-27.png)
## License MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file docSilhouette-0.1.0.tar.gz
.
File metadata
- Download URL: docSilhouette-0.1.0.tar.gz
- Upload date:
- Size: 2.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 77685aee4322e8d469bf1c638d7af22f5a9b00994c038df91d3d172a0b9bb1ae |
|
MD5 | dd7b187a5f3ef9bb4c18b17caef17985 |
|
BLAKE2b-256 | bf5bf4f5db8904e3f9d853f538e5eaef83d798ff1dc5df4834f583bd56900fec |