Your solution to cleansing PDF documents for preprocessing for NLP
Project description
doc_intel
pip install doc-intel
This package is subject to several potential fixes and until then any benefits derived by using this is very much intended.
doc_intel is your solution to a largely cleansed and intact text extract from a PDF.
change 0.0.1 (8 / 23 / 2021) :
- Line breaks lead to breakage of full words into smaller potentially non dictionary words and changes have been made to fix that.
- A dictionary has been used to identify how to reconstruct the broken words.
change 0.0.2 (8 / 28 / 2021) :
- Updated dictionary, maybe inverse document frequency will later be used instead to fix the line breaks.
- Inverse document frequency has been used to precisely split and reconstruct stitched or meaninglessly spaced words.
change 0.0.3 (9 / 10 / 2021) :
- added and removed words in word.txt, fixed dot spacing and conserved number word detachment post processing"
change 0.0.4 (9 / 14 / 2021) :
- added and removed words in word.txt, maintained the positions of apostrophes.
Feature instruction:
- remove header and footer terms in your document :
from doc_intel import text_laundry
file_path = / your_path/ your_file.pdf
texts = text_laundry.head_foot(file_path).remove()
- scrub off textual noise from your texts:
Arguements :
-
serial numerical noise [bool]: noise like
some textual piece but then 101 234 384 927 so all these numbers will be removed if needed
will be removed. -
sentences or words interrupted with special characters [bool] : All cohesive words and sentences interruptions like
co -hesive
orsol **utions
should be removed. -
lower case [bool] : toggle between bool values for lower or upper casing.
-
s u b s t r i n g s w h i c h a r e a t t a c h e d w i l l b e s e p a r a t e d
will be separated tosubstrings which are attached will be separated
based on the fequency of the constituent words in the document.
text_object = text_laundry.load_text([input str], remove_serial, sents_or_word_breaks, lower)
cleaned_text = text_object.launder()
Authored & Maintained By : Vishak Arudhra
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file doc_intel-0.0.10.tar.gz
.
File metadata
- Download URL: doc_intel-0.0.10.tar.gz
- Upload date:
- Size: 879.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f84672747ecae220f39399811aee3c88e41d141abfa8058a3b333a6f01dde9ec |
|
MD5 | da9aefffbd9b64147a9dc9409d1d54b7 |
|
BLAKE2b-256 | 5bd150d88a984755e090baad9c3c9b8927895872656a8bc227d40eab30c75d2e |
File details
Details for the file doc_intel-0.0.10-py3-none-any.whl
.
File metadata
- Download URL: doc_intel-0.0.10-py3-none-any.whl
- Upload date:
- Size: 878.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.6.1 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e33f72698e5d227a90a14e00e636e59b39a7c1a42be33d7de6653d7d02d7b8c3 |
|
MD5 | da5200ad8cb16d804ab5e8ff9231ec25 |
|
BLAKE2b-256 | bfdbcab3ae3e0abbab11f7f53df838b485500a9dac1b947f541da26a4e85b0d0 |