A series of methods to help you work pre processing of text in general, like stem, tokenizer and others.
Project description
# preprocessingtext
A tool short, but very usefull to help in pre-processing data from texts.
## How to Install
>> pip install --user preprocessingtext
## Usage
#### Using stem_sentence()
>> from preprocessingtext import CleanSentence
>> cleaner = CleanSentence(idiom='portuguese')
>> cleaner.stem_sentence(sentence="String", remove_stop_words=True, remove_punctuation=True, normalize_text=True, replace_garbage=True)
To init a a class, you need to pass the idiom that you want to work. The custom value, is "portuguese".
Before, you can instance a new object from CleanSentence, and call the method stem_sentence. You can choose in use
"remove_stop_words" from string (pass True or False) and "remove_punctuation" from string (pass True or False),
"replace_garbage" (True or False) removing values from data, and "normalize_text" (True or False) to normalize text.
#### Usage of list_to_replace
You can improve what you need to replace (clean) in your data. You can use "cleaner.list_to_replace.append('what_you_need_to_add')",
or you can pass a new list of values: cleaner.list_to_replace = ['item1', 'item2', 'item3']
# Custom value of list_to_replace
>> list_to_replace = ['https://', 'http://', '$']
# Adding new values
list_to_replace.append('item1')
['https://', 'http://', 'R$', '$', 'item1']
# Replacing values
>> list_to_replace = ['item1', 'item2', 'item3']
#### Using tokenizer()
>> cleaner.tokenizer('Um exemplo de tokens.')
>> ['Um', 'exemplo', 'de', 'tokens']
## Example
## Using all parameters of stem_sentence()
>> string = "Eu sou uma sentença comum. Serei pré-processada com este modulo, veremos a serguir usando os métodos disponiveis"
>> cleaner.stem_sentence(sentence=string,
remove_stop_words=True,
remove_punctuation=True,
normalize_text=True,
replace_garbage=True
)
>> sentenc comum pre-process modul ver segu us metod disponi
## Don't using remove_stop_words
>> print(cleaner.stem_sentence(sentence=string,
remove_stop_words=False,
remove_punctuation=True,
normalize_text=True,
replace_garbage=True
)
)
>> eu sou uma sentenc comum ser pre-process com est modul ver a segu us os metod disponi
## Tokenizer
>> print(cleaner.tokenizer('Um exemplo de tokens.'))
>> ['Um', 'exemplo', 'de', 'tokens']
## Cleaning garbage words
>> string_web = 'Acesse esses links para ganhar dinheiro: https://easymoney.com.net and http://falselink.com'
>> cleaner.stem_sentence(sentence=string_web,
remove_stop_words=False,
remove_punctuation=True,
replace_garbage=True
)
>> acess ess link par ganh dinh easymoney.com.net and falselink.com
## English example
>> en_cleaner = CleanSentences(idiom='english')
>> string_web = 'Access these links to gain money: https://easymoney.com.net and http://falselink.com'
>> print(en_cleaner.stem_sentence(sentence=string_web,
remove_stop_words=True,
remove_punctuation=True,
replace_garbage=True
)
)
>> acc link gain money easymoney.com.net falselink.com
# Author
{
'name': Everton Tomalok,
'email': evertontomalok123@gmail.com
}
A tool short, but very usefull to help in pre-processing data from texts.
## How to Install
>> pip install --user preprocessingtext
## Usage
#### Using stem_sentence()
>> from preprocessingtext import CleanSentence
>> cleaner = CleanSentence(idiom='portuguese')
>> cleaner.stem_sentence(sentence="String", remove_stop_words=True, remove_punctuation=True, normalize_text=True, replace_garbage=True)
To init a a class, you need to pass the idiom that you want to work. The custom value, is "portuguese".
Before, you can instance a new object from CleanSentence, and call the method stem_sentence. You can choose in use
"remove_stop_words" from string (pass True or False) and "remove_punctuation" from string (pass True or False),
"replace_garbage" (True or False) removing values from data, and "normalize_text" (True or False) to normalize text.
#### Usage of list_to_replace
You can improve what you need to replace (clean) in your data. You can use "cleaner.list_to_replace.append('what_you_need_to_add')",
or you can pass a new list of values: cleaner.list_to_replace = ['item1', 'item2', 'item3']
# Custom value of list_to_replace
>> list_to_replace = ['https://', 'http://', '$']
# Adding new values
list_to_replace.append('item1')
['https://', 'http://', 'R$', '$', 'item1']
# Replacing values
>> list_to_replace = ['item1', 'item2', 'item3']
#### Using tokenizer()
>> cleaner.tokenizer('Um exemplo de tokens.')
>> ['Um', 'exemplo', 'de', 'tokens']
## Example
## Using all parameters of stem_sentence()
>> string = "Eu sou uma sentença comum. Serei pré-processada com este modulo, veremos a serguir usando os métodos disponiveis"
>> cleaner.stem_sentence(sentence=string,
remove_stop_words=True,
remove_punctuation=True,
normalize_text=True,
replace_garbage=True
)
>> sentenc comum pre-process modul ver segu us metod disponi
## Don't using remove_stop_words
>> print(cleaner.stem_sentence(sentence=string,
remove_stop_words=False,
remove_punctuation=True,
normalize_text=True,
replace_garbage=True
)
)
>> eu sou uma sentenc comum ser pre-process com est modul ver a segu us os metod disponi
## Tokenizer
>> print(cleaner.tokenizer('Um exemplo de tokens.'))
>> ['Um', 'exemplo', 'de', 'tokens']
## Cleaning garbage words
>> string_web = 'Acesse esses links para ganhar dinheiro: https://easymoney.com.net and http://falselink.com'
>> cleaner.stem_sentence(sentence=string_web,
remove_stop_words=False,
remove_punctuation=True,
replace_garbage=True
)
>> acess ess link par ganh dinh easymoney.com.net and falselink.com
## English example
>> en_cleaner = CleanSentences(idiom='english')
>> string_web = 'Access these links to gain money: https://easymoney.com.net and http://falselink.com'
>> print(en_cleaner.stem_sentence(sentence=string_web,
remove_stop_words=True,
remove_punctuation=True,
replace_garbage=True
)
)
>> acc link gain money easymoney.com.net falselink.com
# Author
{
'name': Everton Tomalok,
'email': evertontomalok123@gmail.com
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file preprocessingtext-0.0.4.tar.gz
.
File metadata
- Download URL: preprocessingtext-0.0.4.tar.gz
- Upload date:
- Size: 3.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Python-urllib/3.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 81a3365bf106b901bf3bd968d75e9ba0190b363fd943d6e140e76d84f7ff2e75 |
|
MD5 | 78270cd411bb65baaaa6bc2b7e88c064 |
|
BLAKE2b-256 | 353301f321ffff1e01f90fd97974fc6bd1e0eb1ad35d8dd35b795d54eeda3c69 |