Your solution to cleansing PDF documents for preprocessing for NLP
Project description
doc_intel
This package is subject to several potential fixes and until then any benefits derived by using this is very much intended.
doc_intel is your solution to a largely cleansed and intact text extract from a PDF.
change 0.0.1 (8 / 23 / 2021) :
- Line breaks lead to breakage of full words into smaller potentially non dictionary words and changes have been made to fix that.
- A dictionary has been used to identify how to reconstruct the broken words.
change 0.0.2 (8 / 28 / 2021) :
- Updated dictionary, maybe inverse document frequency will later be used instead to fix the line breaks.
- Inverse document frequency has been used to precisely split and reconstruct stitched or meaninglessly spaced words.
Feature instruction:
- remove header and footer terms in your document :
from doc_intel import text_laundry
file_path = / your_path/ your_file.pdf
texts = text_laundry.head_foot(file_path).remove()
- scrub off textual noise from your texts:
Arguements :
-
serial numerical noise [bool]: noise like
some textual piece but then 101 234 384 927 so all these numbers will be removed if needed
will be removed. -
sentences or words interrupted with special characters [bool] : All cohesive words and sentences interruptions like
co -hesive
orsol **utions
should be removed. -
lower case [bool] : toggle between bool values for lower or upper casing.
-
s u b s t r i n g s w h i c h a r e a t t a c h e d w i l l b e s e p a r a t e d
will be separated tosubstrings which are attached will be separated
based on the fequency of the constituent words in the document.
text_object = text_laundry.load_text([input str], remove_serial, sents_or_word_breaks, lower)
cleaned_text = text_object.launder()
- restructure stitched words into separate dictionary words:
from doc_intel import despace
texts = (your text str)
text_stats = despace.deSpace(texts)
piece = (constituent stitched text string)
fixed_texts = text_stats.infer_spaces(piece)
credits to the original maker of infer_spaces function : https://stackoverflow.com/a/11642687/13115158
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for doc_intel-0.0.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2ce09f499fb177ed97347de37ec00d4e0e1b919df7254ad47777ed0f143ef5b |
|
MD5 | 9a7422d970ed0f04954c87370b332d95 |
|
BLAKE2b-256 | fc22fb9e35325af0268471695108915c6a7ccae39182453802086c726c9372e1 |