Skip to main content

Pre process the textual data for NLP and machine learning applications

Project description

Use this library to get an out of the box solution for all text pre-processing related problems. A wide range of text processing methods have been added to this library and I will keep on adding more methods. The use of this library is very simple and intuitive and makes it very easy to clean the data for your NLP/NLU/Machine learning pipelines.

Pre-requiste steps to be completed before use:

  1. Download the nltk stopwords with the following command on your python interpreter:
>>> nltk.download('stopwords')
  1. Install spacy en_core_web_sm with the following command on your machine's terminal:
$ python3 -m spacy download en
  1. Install pathlib package with the command:
$ pip install pathlib
  1. Make sure that the python version is 3.x and above

Sample demonstration of some of the methods in the library:

>>> from dataPreprocess.preprocess import Preprocess
>>> text = "<br> This is   the firt     line. And this    is the 23   secodn lie. </br>"
>>> Data_preprocessor = Preprocess()
>>> clean_text = Data_preprocessor.strip_html_tags(text)
>>> clean_text = Data_preprocessor.text_lowercase(clean_text)
>>> clean_text = Data_preprocessor.correct_spellings(clean_text)
>>> clean_text = Data_preprocessor.remove_stopwords(clean_text)
>>> clean_text = Data_preprocessor.remove_whitespace(clean_text)
>>> clean_text = Data_preprocessor.remove_numbers(clean_text)
>>> clean_text = Data_preprocessor.correct_spellings(clean_text)

As demonstrated above, the methods of the library can be used in series without any hassle. It also takes out the headache of matching the input format requirements of various libraries that are otherwise available online by different contributers. PS: I have written the code from scratch and not copy pasted the code of the other contributers.

Note that despite supporting various functions, this library is very fast. That means that this adding this library to your production pipeline will not hold you back at all ;)

Right now this library supports 21 different functions to clean your text right out of the box. The list of functions is as follows:

  1. text_lowercase
  2. text_uppercase
  3. remove_numbers
  4. remove_punctuation
  5. remove_whitespace
  6. remove_stopwords
  7. stem_words
  8. lemmatize_words
  9. pos_tagging
  10. NER
  11. remove_emoji
  12. remove_emoticons
  13. emoticon_to_words
  14. remove_urls
  15. remove_html
  16. correct_spellings
  17. Remove_special_char
  18. Expand_contractions
  19. remove_accented_chars
  20. convert_number_towords
  21. remove_freqwords

The functionalities of the methods listed above is pretty self-explanatory

This library is still in development phase and I will keep adding more and more functions to it other than just text cleaning.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataPreprocess-0.2.7.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

dataPreprocess-0.2.7-py3-none-any.whl (29.5 kB view details)

Uploaded Python 3

File details

Details for the file dataPreprocess-0.2.7.tar.gz.

File metadata

  • Download URL: dataPreprocess-0.2.7.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.1 setuptools/49.6.0.post20201009 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.6.12

File hashes

Hashes for dataPreprocess-0.2.7.tar.gz
Algorithm Hash digest
SHA256 2557590ecce7abaca3a333191f227de06897859c582fae105e3e093087e592f3
MD5 101697c9d29600e42cab7aed0fa0c1ad
BLAKE2b-256 c6a02da71173d76c9cb5acc0e30276690a4e9769ea532d1d3347874c681c606c

See more details on using hashes here.

File details

Details for the file dataPreprocess-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: dataPreprocess-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 29.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.1 setuptools/49.6.0.post20201009 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.6.12

File hashes

Hashes for dataPreprocess-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 6dfbc3aea97e28937df7aff98e37f81d7f4094196a3ad188a920e666c681f6d4
MD5 63e12df0d073d2c7a34e923620865892
BLAKE2b-256 6b655c9f5e40f44fceccb3ce264231ec712e6e180f98e887faf118b89511310f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page