Skip to main content

An open-source python package to process text data

Project description

========== processtext ==========

processtext is a an open-source python package to clean raw text data.

PyPI Version Python Versions

Installation

processtext requires Python 3, NLTK, and Autocorrect to execute.

To install using pip, use

pip install processtext

Downloads

Features

processtext package contains different functions such as:

  • degroup_num: Removes comma(,) in between numbers inside a string
  • remove_hyphen: Removes hyphen(-) in between texts
  • int_to_en: Returns whole numbers in english text. e.g. 25 -> twenty-five
  • num_to_en: Returns english of numbers one by one from left to right
  • float_to_en: Returns floating point numbers into english text
  • int_to_text: Replaces all the whole numbers inside string into English text
  • float_to_text: Replacing all the positive rational numbers inside string into English text
  • decontract_strings: Decontracts strings e.g. I'm -> I am
  • remove_emoji: Removes emoji
  • clean_text: For deep cleaning of texts
  • lowercase: Converts the texts into lowercase
  • autocorrect: Corrects spelling mistakes
  • lemmatize: Lemmatizes the input texts
  • remove_sw: Removes stop words
  • clean: to clean raw text and return the cleaned text
  • clean_l: to clean raw text and return a list of clean words
The processtext.clean() and processtext.clean_l() function can apply all, or a selected combination of the following cleaning operations:
  • Remove special symbols/characters
  • Remove digits from the text
  • Remove punctuations from the text
  • Remove extra white spaces
  • Remove or replace the part of text with custom regex
  • Convert the entire text into a uniform lowercase
  • Lemmatize the words
  • Remove stop words, and choose a language for stop words

Usage

  • Import the library:
import processtext as pt
  • Choose a method:

To return the text in a string format,

pt.clean("your_raw_text_here") 

To return a list of words from the text,

pt.clean_l("your_raw_text_here") 

To choose a specific set of cleaning operations,

pt.clean_l("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True ,  # Remove extra white spaces 
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits 
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english'  # Language for stop words
)

Examples

import processtext as pt
pt.degroup_num('111,222,333')

returns,

'111222333'
import processtext as pt
pt.remove_hyphen('2022-2023')

returns,

'2022 2023'
import processtext as pt
print(pt.int_to_en(1998))
print(pt.int_to_en('9123456789'))

returns,

one thousand nine hundred and ninety-eight

nine billion one hundred and twenty-three million four hundred and fifty-six thousand seven hundred and eighty-nine
import processtext as pt
print(pt.num_to_en(12345))
print(pt.num_to_en('09876'))

returns,

one two three four five

zero nine eight seven six
import processtext as pt
print(pt.float_to_en(12.345))
print(pt.float_to_en('456.09876'))

returns,

twelve point three four five

four hundred and fifty-six point zero nine eight seven six
import processtext as pt
print(pt.float_to_en(12.345))
print(pt.float_to_en('456.09876'))

returns,

twelve point three four five

four hundred and fifty-six point zero nine eight seven six
import processtext as pt
pt.int_to_text('First 100 twin primes have values between 3 & 5 and 3821 & 3823')

returns,

First one hundred twin primes have values between three & five and three thousand eight hundred and twenty-one & three thousand eight hundred and twenty-three
import processtext as pt
pt.float_to_text('The first 10 digits of pi are 3.141592653')

returns,

The first ten point zero digits of pi are three point one four one five nine two six five three
import processtext as pt
pt.decontract_strings("I can't believe he'll be graduating from college in just a few months.")

returns,

I can not believe he will be graduating from college in just a few months.
import processtext as pt
pt.remove_emoji("🌞🌊☀️ Just spent an amazing day at the beach with my friends! 🏖️👭👬 We built sandcastles 🏰, played beach volleyball 🏐, and even went for a swim 🏊‍♀️🏊‍♂️. The sun was shining ☀️ and the water was so refreshing 💦. Can't wait to do it again! 🤩👍")

returns,

 Just spent an amazing day at the beach with my friends!  We built sandcastles , played beach volleyball , and even went for a swim . The sun was shining  and the water was so refreshing . Can't wait to do it again! 
import processtext as pt
pt.clean_text('The password must contain at least one symbol such as !,^,*,+,=,%,$,~,?,/,<>,|@, #, or %.')

returns,

The password must contain at least one symbol such as                               or   
import processtext as pt
pt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')

returns,

the quick brown fox jumped over the lazy dog.
import processtext as pt
pt.lowercase('THE QUICK BROWN FOX JUMPED OVER THE LAZY DOG.')

returns,

the quick brown fox jumped over the lazy dog.
import processtext as pt
pt.autocorrect("I haven't receeved the package yet, but I think it should arrive somtime tomoro.")

returns,

I haven't received the package yet, but I think it should arrive sometime tomorrow.
import processtext as pt
pt.autocorrect("I haven't receeved the package yet, but I think it should arrive somtime tomoro.")

returns,

I haven't received the package yet, but I think it should arrive sometime tomorrow.
import processtext as pt
pt.lemmatize('they were playing in the garden.')

returns,

they be play in the garden.
import processtext as pt
pt.remove_sw('I went to the store and bought some milk, bread, and eggs.')

returns,

went store bought milk, bread, eggs.
import processtext as pt
pt.clean("TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e..........                 L@a/\|z+Y d==OG.", extra_spaces=True, lowercase=True, numbers=True, punct=True)

returns,

'the quick brown fox jumped over the lazy dog'

import processtext as pt
pt.clean_l('TH@@#e Q!@#UicK bR0owN f*#!@)(O000000X JUmp100ED 000oV###3eR Th77777#$$e..........                 L@a/\|z+Y d==OG.', 
           extra_spaces=True, lowercase=True, numbers=True, punct=True)

returns,

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

from processtext import clean
text = "my email id: ujjwal@rkmvu.ac.in and your's: mili@rnlk.ed"
clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='********', clean_all=False)

returns,

'my email id: ******** and your's: ********'

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

processtext-0.1.7.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

processtext-0.1.7-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file processtext-0.1.7.tar.gz.

File metadata

  • Download URL: processtext-0.1.7.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for processtext-0.1.7.tar.gz
Algorithm Hash digest
SHA256 9a36e4f6b2539358d36414f66adc8f783ac5effc18f7ff04988fccdfe801eef3
MD5 ae219148ce8bd2d5ad41cc3f04b462e1
BLAKE2b-256 ba9a8b19658f99485f60d2fd66818eae275b77d778d878648f890d252ed7b7f9

See more details on using hashes here.

File details

Details for the file processtext-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: processtext-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for processtext-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 2bef682f283ae20c78a22d9946066f8b1a186a82eda8a0cf8b643ceeab7640c5
MD5 a0f3a2805097a968d821f4cba97b2114
BLAKE2b-256 becff3b88983afaf306b5bd58c4d459a8e60c334daf68b7f635d8cb64b04322a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page