Skip to main content

Your solution to cleansing PDF documents for preprocessing for NLP

Project description

doc_intel

pip install doc-intel

This package is subject to several potential fixes and until then any benefits derived by using this is very much intended.

doc_intel is your solution to a largely cleansed and intact text extract from a PDF.

change 0.0.1 (8 / 23 / 2021) :
  • Line breaks lead to breakage of full words into smaller potentially non dictionary words and changes have been made to fix that.
  • A dictionary has been used to identify how to reconstruct the broken words.
change 0.0.2 (8 / 28 / 2021) :
  • Updated dictionary, maybe inverse document frequency will later be used instead to fix the line breaks.
  • Inverse document frequency has been used to precisely split and reconstruct stitched or meaninglessly spaced words.
change 0.0.3 (9 / 10 / 2021) :
  • added and removed words in word.txt, fixed dot spacing and conserved number word detachment post processing"
change 0.0.4 (9 / 14 / 2021) :
  • added and removed words in word.txt, maintained the positions of apostrophes.

Major Version Update 1.0.0 (10 / 19 / 2021) :

  • A new module has been added to serve requirements of deletion and addition of words in the words text file.
  • Usage demonstration can be found under the feature instruction section.
change 1.0.1 (10 / 21 / 2021) :
  • fixed issue by which decimal numbers were spaced from the decimal point.
  • closed the words text file in the main code block to prevent unclosed file issues.

Feature instruction:

REMOVE THE HEADER AND FOOTER TERMS FORM YOUR DOCUMENT:

from doc_intel import text_laundry

file_path = / your_path/ your_file.pdf

texts = text_laundry.head_foot(file_path).remove()

SCRUB OFF TEXTUAL NOISE FROM YOUR TEXTS:

Arguments :
  • serial numerical noise [bool]: noise like (text) + (random numbers, spaced or continous) will be removed.

  • sentences or words interrupted with special characters [bool] : All cohesive words and sentences interruptions like co -hesive or sol **utions should be removed.

  • lower case [bool] : toggle between bool values for lower or upper casing.

  • s u b s t r i n g s w h i c h a r e a t t a c h e d w i l l b e s e p a r a t e d --> substrings which are attached will be separated (based on the fequency of the constituent words in the document).

text_object = text_laundry.load_text([input str], remove_serial, sents_or_word_breaks, lower)
cleaned_text = text_object.launder()

ADD AND DELETE WORDS FROM IN-BUILT DICTIONARY:

  • PDF documents are not purposely written to suite document extraction and therefore, a lot of discontinued words in the documents end up broken with ordinary text extraction.
  • doc-intel's in-built dictionary support identification of many 1000s of words which is now editable. Add and remove words as you require for your smooth text extraction.
  • do-cument --> document
from doc_intel import manage_diction

word_list = " your list of words to either add or delete "
manage_diction.register_words(word_list).Add()  #or
manage_diction.register_words(word_list).Delete()
Authored & Maintained By : Vishak Arudhra

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_intel-1.0.1.tar.gz (881.6 kB view details)

Uploaded Source

Built Distribution

doc_intel-1.0.1-py3-none-any.whl (880.5 kB view details)

Uploaded Python 3

File details

Details for the file doc_intel-1.0.1.tar.gz.

File metadata

  • Download URL: doc_intel-1.0.1.tar.gz
  • Upload date:
  • Size: 881.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for doc_intel-1.0.1.tar.gz
Algorithm Hash digest
SHA256 42085bbe5598d391e16fd0fd9e9fc7d1bd361d86854f9b5e6f7804007b793199
MD5 554669bbfa0d22f7237b927aa9f68c16
BLAKE2b-256 55c221705f5f20b1131629547a6b1ceea2ad2a4acdd3b6e8c277d0e027bce2f1

See more details on using hashes here.

File details

Details for the file doc_intel-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: doc_intel-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 880.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for doc_intel-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 713cc1cf95f3987947164ff06763089b0c47533ec1ca03873a82203be9298b90
MD5 f9056a04a815cc1f6f0ac3e431062860
BLAKE2b-256 c5e5a4ad8ff3d0e9ebe16b2866d7d79990c25966d54ea5ea2175631006377af6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page