Skip to main content

Neattext - a simple NLP package for cleaning text

Project description

neattext

NeatText:a simple NLP package for cleaning textual data and text preprocessing

Build Status

GitHub license

Problem

  • Cleaning of unstructured text data
  • Reduce noise [special characters,stopwords]
  • Reducing repetition of using the same code for text preprocessing

Solution

  • convert the already known solution for cleaning text into a reuseable package

Installation

pip install neattext

Usage

  • The OOP Way(Object Oriented Way)

Clean Text

  • Clean text by removing emails,numbers,stopwords,emojis,etc
>>> from neattext import TextCleaner
>>> docx = TextCleaner()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.clean_text()

Remove Emails,Numbers,Phone Numbers

>>> docx.remove_emails()
>>> 'This is the mail  ,our WEBSITE is https://example.com 😊.'
>>>
>>> docx.remove_stopwords()
>>> 'This mail example@gmail.com ,our WEBSITE https://example.com 😊.'
>>>
>>> docx.remove_numbers()
>>> docx.remove_phone_numbers()

Remove Special Characters

>>> docx.remove_special_characters()

Remove Emojis

>>> docx.remove_emojis()
>>> 'This is the mail example@gmail.com ,our WEBSITE is https://example.com .'

Replace Emails,Numbers,Phone Numbers

>>> docx.replace_emails()
>>> docx.replace_numbers()
>>> docx.replace_phone_numbers()

Using TextExtractor

  • To Extract emails,phone numbers,numbers,urls,emojis from text
>>> from neattext import TextExtractor
>>> docx = TextExtractor()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.extract_emails()
>>> ['example@gmail.com']
>>>
>>> docx.extract_emojis()
>>> ['😊']

Using TextMetrics

  • To Find the Words Stats such as counts of vowels,consonants,stopwords,word-stats
>>> from neattext import TextMetrics
>>> docx = TextMetrics()
>>> docx.text = "This is the mail example@gmail.com ,our WEBSITE is https://example.com 😊."
>>> docx.count_vowels()
>>> docx.count_consonants()
>>> docx.count_stopwords()
>>> docx.word_stats()

Usage

  • The MOP(method/function oriented way) Way
>>> from neattext.neattext import clean_text,extract_emails
>>> t1 = "This is the mail example@gmail.com ,our WEBSITE is https://example.com ."
>>> clean_text(t1,True)
>>>'this is the mail <email> ,our website is <url> .'
>>> extract_emails(t1)
>>> ['example@gmail.com']

Documentation

Please read the documentation for more information on what neattext does and how to use is for your needs.

More Features To Add

  • unicode explainer
  • currency normalizer

Acknowledgements

  • Inspired by packages like clean-text from Johannes Fillter and textify by JCharisTech

NB

  • Contributions Are Welcomed
  • Notice a bug, please let us know.
  • Thanks A lot

By

  • Jesse E.Agbe(JCharis)
  • Jesus Saves @JCharisTech

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neattext-0.0.2.tar.gz (6.3 kB view hashes)

Uploaded Source

Built Distribution

neattext-0.0.2-py3-none-any.whl (6.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page