Skip to main content

An open-source python package to process text data

Project description

========== processtext ==========

processtext is a an open-source python package to clean raw text data.

PyPI Version Python Versions

Installation

processtext requires Python 3, NLTK, and Autocorrect to execute.

To install using pip, use

pip install processtext

Downloads

Features

processtext package contains different functions such as:

  • degroup_num: Removes comma(,) in between numbers inside a string
  • remove_hyphen: Removes hyphen(-) in between texts
  • int_to_en: Returns whole numbers in english text. e.g. 25 -> twenty-five
  • num_to_en: Returns english of numbers one by one from left to right
  • float_to_en: Returns floating point numbers into english text
  • int_to_text: Replaces all the whole numbers inside string into English text
  • float_to_text: Replacing all the positive rational numbers inside string into English text
  • decontract_strings: Decontracts strings e.g. I'm -> I am
  • remove_emoji: Removes emoji
  • clean_text: For deep cleaning of texts
  • lowercase: Converts the texts into lowercase
  • autocorrect: Corrects spelling mistakes
  • lemmatize: Lemmatizes the input texts
  • remove_sw: Removes stop words
  • clean: to clean raw text and return the cleaned text
  • clean_l: to clean raw text and return a list of clean words
The processtext.clean() and processtext.clean_l() function can apply all, or a selected combination of the following cleaning operations:
  • Remove special symbols/characters
  • Remove digits from the text
  • Remove punctuations from the text
  • Remove extra white spaces
  • Remove or replace the part of text with custom regex
  • Convert the entire text into a uniform lowercase
  • Lemmatize the words
  • Remove stop words, and choose a language for stop words

Usage

  • Import the library:
import processtext as pt
  • Choose a method:

To return the text in a string format,

pt.clean("your_raw_text_here") 

To return a list of words from the text,

pt.clean_l("your_raw_text_here") 

To choose a specific set of cleaning operations,

pt.clean_l("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True ,  # Remove extra white spaces 
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits 
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english'  # Language for stop words
)

Examples

import processtext as pt
pt.degroup_num('111,222,333')

returns,

'111222333'
import processtext as pt
pt.remove_hyphen('2022-2023')

returns,

'2022 2023'
import processtext as pt
print(pt.int_to_en(1998))
print(pt.int_to_en('9123456789'))

returns,

one thousand nine hundred and ninety-eight

nine billion one hundred and twenty-three million four hundred and fifty-six thousand seven hundred and eighty-nine
import processtext as pt
print(pt.num_to_en(12345))
print(pt.num_to_en('09876'))

returns,

one two three four five

zero nine eight seven six
import processtext as pt
print(pt.float_to_en(12.345))
print(pt.float_to_en('456.09876'))

returns,

twelve point three four five

four hundred and fifty-six point zero nine eight seven six
import processtext as pt
print(pt.float_to_en(12.345))
print(pt.float_to_en('456.09876'))

returns,

twelve point three four five

four hundred and fifty-six point zero nine eight seven six
import processtext as pt
pt.int_to_text('First 100 twin primes have values between 3 & 5 and 3821 & 3823')

returns,

First one hundred twin primes have values between three & five and three thousand eight hundred and twenty-one & three thousand eight hundred and twenty-three
import processtext as pt
pt.float_to_text('The first 10 digits of pi are 3.141592653')

returns,

The first ten point zero digits of pi are three point one four one five nine two six five three
import processtext as pt
pt.decontract_strings("I can't believe he'll be graduating from college in just a few months.")

returns,

I can not believe he will be graduating from college in just a few months.
import processtext as pt
pt.remove_emoji("🌞🌊☀️ Just spent an amazing day at the beach with my friends! 🏖️👭👬 We built sandcastles 🏰, played beach volleyball 🏐, and even went for a swim 🏊‍♀️🏊‍♂️. The sun was shining ☀️ and the water was so refreshing 💦. Can't wait to do it again! 🤩👍")

returns,

 Just spent an amazing day at the beach with my friends!  We built sandcastles , played beach volleyball , and even went for a swim . The sun was shining  and the water was so refreshing . Can't wait to do it again! 
import processtext as pt
pt.clean_text('The password must contain at least one symbol such as !,^,*,+,=,%,$,~,?,/,<>,|@, #, or %.')

returns,

The password must contain at least one symbol such as                               or   
import processtext as pt
pt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')

returns,

the quick brown fox jumped over the lazy dog.
import processtext as pt
pt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')

returns,

the quick brown fox jumped over the lazy dog.
import processtext as pt
pt.autocorrect("I haven't receeved the package yet, but I think it should arrive somtime tomoro.")

returns,

I haven't received the package yet, but I think it should arrive sometime tomorrow.
import processtext as pt
pt.autocorrect("I haven't receeved the package yet, but I think it should arrive somtime tomoro.")

returns,

I haven't received the package yet, but I think it should arrive sometime tomorrow.
import processtext as pt
pt.lemmatize('they were playing in the garden.')

returns,

they be play in the garden.
import processtext as pt
pt.remove_sw('I went to the store and bought some milk, bread, and eggs.')

returns,

went store bought milk, bread, eggs.
import processtext as pt
pt.clean("TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e..........                 L@a/\|z+Y d==OG.", extra_spaces=True, lowercase=True, numbers=True, punct=True)

returns,

'the quick brown fox jumped over the lazy dog'

import processtext as pt
pt.clean_l('TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e..........                 L@a/\|z+Y d==OG.', 
           extra_spaces=True, lowercase=True, numbers=True, punct=True)

returns,

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

from processtext import clean
text = "my email id: ujjwal@rkmvu.ac.in and your's: mili@rnlk.ed"
clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='********', clean_all=False)

returns,

'my email id: ******** and your's: ********'

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

processtext-0.1.7.tar.gz (11.5 kB view hashes)

Uploaded Source

Built Distribution

processtext-0.1.7-py3-none-any.whl (9.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page