Skip to main content

Your solution to cleansing PDF documents for preprocessing for NLP

Project description

doc_intel

This package is subject to several potential fixes and until then any benefits derived by using this is very much intended.

doc_intel is your solution to a largely cleansed and intact text extract from a PDF.

change 0.0.1 (8 / 23 / 2021) :
  • Line breaks lead to breakage of full words into smaller potentially non dictionary words and changes have been made to fix that.
  • A dictionary has been used to identify how to reconstruct the broken words.
change 0.0.2 (8 / 28 / 2021) :
  • Updated dictionary, maybe inverse document frequency will later be used instead to fix the line breaks.
  • Inverse document frequency has been used to precisely split and reconstruct stitched or meaninglessly spaced words.

Feature instruction:

  • remove header and footer terms in your document :
from doc_intel import text_laundry

file_path = / your_path/ your_file.pdf

texts = text_laundry.head_foot(file_path).remove()
  • scrub off textual noise from your texts:
Arguements :
  • serial numerical noise [bool]: noise like some textual piece but then 101 234 384 927 so all these numbers will be removed if needed will be removed.

  • sentences or words interrupted with special characters [bool] : All cohesive words and sentences interruptions like co -hesive or sol **utions should be removed.

  • lower case [bool] : toggle between bool values for lower or upper casing.

  • s u b s t r i n g s w h i c h a r e a t t a c h e d w i l l b e s e p a r a t e d will be separated to substrings which are attached will be separated based on the fequency of the constituent words in the document.

text_object = text_laundry.load_text([input str], remove_serial, sents_or_word_breaks, lower)
cleaned_text = text_object.launder()
  • restructure stitched words into separate dictionary words:
from doc_intel import despace

texts = (your text str)

text_stats = despace.deSpace(texts)

piece = (constituent stitched text string)

fixed_texts = text_stats.infer_spaces(piece)

credits to the original maker of infer_spaces function : https://stackoverflow.com/a/11642687/13115158

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_intel-0.0.9.tar.gz (879.2 kB view details)

Uploaded Source

Built Distribution

doc_intel-0.0.9-py3-none-any.whl (877.7 kB view details)

Uploaded Python 3

File details

Details for the file doc_intel-0.0.9.tar.gz.

File metadata

  • Download URL: doc_intel-0.0.9.tar.gz
  • Upload date:
  • Size: 879.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for doc_intel-0.0.9.tar.gz
Algorithm Hash digest
SHA256 1010c9e44381720abdeaa9e2d241100fe917c1f623e2af8f19b7b6f99debdf59
MD5 8e1506d55a0908fdbd6a49804c0bcdd0
BLAKE2b-256 1982b0f3ebfa99d2c5646eef6f9a55dc80f5fe56edb0b19e9029021f26dae535

See more details on using hashes here.

File details

Details for the file doc_intel-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: doc_intel-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 877.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.7.3 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for doc_intel-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 a2ce09f499fb177ed97347de37ec00d4e0e1b919df7254ad47777ed0f143ef5b
MD5 9a7422d970ed0f04954c87370b332d95
BLAKE2b-256 fc22fb9e35325af0268471695108915c6a7ccae39182453802086c726c9372e1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page