Skip to main content

A series of methods to help you work pre processing of text in general, like stem, tokenizer and others.

Project description

# preprocessingtext

A tool short, but very usefull to help in pre-processing data from texts.



## How to Install


>> pip install --user preprocessingtext



## Usage

#### Using stem_sentence()

>> from preprocessingtext import CleanSentence

>> cleaner = CleanSentence(idiom='portuguese')

>> cleaner.stem_sentence(sentence="String", remove_stop_words=True, remove_punctuation=True, normalize_text=True, replace_garbage=True)

To init a a class, you need to pass the idiom that you want to work. The custom value, is "portuguese".

Before, you can instance a new object from CleanSentence, and call the method stem_sentence. You can choose in use
"remove_stop_words" from string (pass True or False) and "remove_punctuation" from string (pass True or False),
"replace_garbage" (True or False) removing values from data, and "normalize_text" (True or False) to normalize text.

#### Usage of list_to_replace
You can improve what you need to replace (clean) in your data. You can use "cleaner.list_to_replace.append('what_you_need_to_add')",
or you can pass a new list of values: cleaner.list_to_replace = ['item1', 'item2', 'item3']

# Custom value of list_to_replace
>> list_to_replace = ['https://', 'http://', '$']

# Adding new values
list_to_replace.append('item1')
['https://', 'http://', 'R$', '$', 'item1']

# Replacing values
>> list_to_replace = ['item1', 'item2', 'item3']


#### Using tokenizer()

>> cleaner.tokenizer('Um exemplo de tokens.')

>> ['Um', 'exemplo', 'de', 'tokens']

## Example

## Using all parameters of stem_sentence()
>> string = "Eu sou uma sentença comum. Serei pré-processada com este modulo, veremos a serguir usando os métodos disponiveis"
>> cleaner.stem_sentence(sentence=string,
remove_stop_words=True,
remove_punctuation=True,
normalize_text=True,
replace_garbage=True
)
>> sentenc comum pre-process modul ver segu us metod disponi

## Don't using remove_stop_words
>> print(cleaner.stem_sentence(sentence=string,
remove_stop_words=False,
remove_punctuation=True,
normalize_text=True,
replace_garbage=True
)
)
>> eu sou uma sentenc comum ser pre-process com est modul ver a segu us os metod disponi

## Tokenizer
>> print(cleaner.tokenizer('Um exemplo de tokens.'))
>> ['Um', 'exemplo', 'de', 'tokens']

## Cleaning garbage words
>> string_web = 'Acesse esses links para ganhar dinheiro: https://easymoney.com.net and http://falselink.com'
>> cleaner.stem_sentence(sentence=string_web,
remove_stop_words=False,
remove_punctuation=True,
replace_garbage=True
)
>> acess ess link par ganh dinh easymoney.com.net and falselink.com

## English example
>> en_cleaner = CleanSentences(idiom='english')

>> string_web = 'Access these links to gain money: https://easymoney.com.net and http://falselink.com'
>> print(en_cleaner.stem_sentence(sentence=string_web,
remove_stop_words=True,
remove_punctuation=True,
replace_garbage=True
)
)
>> acc link gain money easymoney.com.net falselink.com


# Author
{
'name': Everton Tomalok,
'email': evertontomalok123@gmail.com
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preprocessingtext-0.0.4.tar.gz (3.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page