Skip to main content

A series of methods to help you work pre processing of text in general, like stem, tokenizer and others.

Project description

# preprocessingtext

A tool short, but very usefull to help in pre-processing data from texts.



## How to Install


>> pip install --user preprocessingtext



## Usage

#### Using stem_sentence()

>> from preprocessingtext import CleanSentence

>> cleaner = CleanSentence(idiom='portuguese')

>> cleaner.stem_sentence(sentence="String", remove_stop_words=True, remove_punctuation=True, normalize_text=True, replace_garbage=True)

To init a a class, you need to pass the idiom that you want to work. The custom value, is "portuguese".

Before, you can instance a new object from CleanSentence, and call the method stem_sentence. You can choose in use
"remove_stop_words" from string (pass True or False) and "remove_punctuation" from string (pass True or False),
"replace_garbage" (True or False) removing values from data, and "normalize_text" (True or False) to normalize text.

#### Usage of list_to_replace
You can improve what you need to replace (clean) in your data. You can use "cleaner.list_to_replace.append('what_you_need_to_add')",
or you can pass a new list of values: cleaner.list_to_replace = ['item1', 'item2', 'item3']

# Custom value of list_to_replace
>> list_to_replace = ['https://', 'http://', '$']

# Adding new values
list_to_replace.append('item1')
['https://', 'http://', 'R$', '$', 'item1']

# Replacing values
>> list_to_replace = ['item1', 'item2', 'item3']


#### Using tokenizer()

>> cleaner.tokenizer('Um exemplo de tokens.')

>> ['Um', 'exemplo', 'de', 'tokens']

## Example

## Using all parameters of stem_sentence()
>> string = "Eu sou uma sentença comum. Serei pré-processada com este modulo, veremos a serguir usando os métodos disponiveis"
>> cleaner.stem_sentence(sentence=string,
remove_stop_words=True,
remove_punctuation=True,
normalize_text=True,
replace_garbage=True
)
>> sentenc comum pre-process modul ver segu us metod disponi

## Don't using remove_stop_words
>> print(cleaner.stem_sentence(sentence=string,
remove_stop_words=False,
remove_punctuation=True,
normalize_text=True,
replace_garbage=True
)
)
>> eu sou uma sentenc comum ser pre-process com est modul ver a segu us os metod disponi

## Tokenizer
>> print(cleaner.tokenizer('Um exemplo de tokens.'))
>> ['Um', 'exemplo', 'de', 'tokens']

## Cleaning garbage words
>> string_web = 'Acesse esses links para ganhar dinheiro: https://easymoney.com.net and http://falselink.com'
>> cleaner.stem_sentence(sentence=string_web,
remove_stop_words=False,
remove_punctuation=True,
replace_garbage=True
)
>> acess ess link par ganh dinh easymoney.com.net and falselink.com

## English example
>> en_cleaner = CleanSentences(idiom='english')

>> string_web = 'Access these links to gain money: https://easymoney.com.net and http://falselink.com'
>> print(en_cleaner.stem_sentence(sentence=string_web,
remove_stop_words=True,
remove_punctuation=True,
replace_garbage=True
)
)
>> acc link gain money easymoney.com.net falselink.com


# Author
{
'name': Everton Tomalok,
'email': evertontomalok123@gmail.com
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

preprocessingtext-0.0.4.tar.gz (3.9 kB view details)

Uploaded Source

File details

Details for the file preprocessingtext-0.0.4.tar.gz.

File metadata

  • Download URL: preprocessingtext-0.0.4.tar.gz
  • Upload date:
  • Size: 3.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for preprocessingtext-0.0.4.tar.gz
Algorithm Hash digest
SHA256 81a3365bf106b901bf3bd968d75e9ba0190b363fd943d6e140e76d84f7ff2e75
MD5 78270cd411bb65baaaa6bc2b7e88c064
BLAKE2b-256 353301f321ffff1e01f90fd97974fc6bd1e0eb1ad35d8dd35b795d54eeda3c69

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page