Text Preprocesser
Project description
Take Text Pre-Process
This package is a tool for pre-processing a sentence.
The basic functionality available in this packages are:
- Converting to lower case
- Remove non ascii characters
- Add space between punctuation and word
The customize functionality available are:
- Replace URL by a token
- Replace Email by a token
- Replace Numbers by a token
- Replace Code (Number and letters) by a token
- Remove symbols
- Replace abbreviations
- Keep emojis
Installation
The TakeTextPreProcess can be installed from PyPi:
pip install take-text-preprocess
Usage
Basic pre-process
To use the basic pre-process:
from take_text_preprocess.presentation import pre_process
sentence = 'Bom dia, meu ẞ caro'
pre_process(sentence)
Customize pre-process
To use the customize pre-process is needed a input with a list of all pre-process you wanted to use.
The following examples show all the customized pre-processes available.
- URL
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['URL']
sentence = 'Bom dia, meu https://www.take.net caro'
pre_process(sentence, optional_tokenization)
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['EMAIL']
sentence = 'Bom dia, meu teste@gmail.com caro'
pre_process(sentence, optional_tokenization)
- NUMBER
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['NUMBER']
sentence = 'Este é um número 99999-9999'
pre_process(sentence, optional_tokenization)
- CODE
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['CODE']
sentence = 'Este é um código 91234abc'
pre_process(sentence, optional_tokenization)
- SYMBOLS
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['SYMBOL']
sentence = 'Este é um sÃmbolo %'
pre_process(sentence, optional_tokenization)
- ABBREVIATIONS
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['ABBR']
sentence = 'Este é uma abreviação vc'
pre_process(sentence, optional_tokenization)
- EMOJI
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['EMOJI']
sentence = 'Este é um emoji 😀'
pre_process(sentence, optional_tokenization)
Contribute
If this is the first time you are contributing to this project, first create the virtual environment using the following command:
conda env create -f env/environment.yml
Then activate the environment:
conda activate taketextpreprocess
To test your modifications build the package:
pip install dist\take-text-preprocess-VERSION-py3-none-any.whl --force-reinstall
Then run the tests:
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for take-text-preprocess-0.0.6b2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 554f378971983f053d18a03c6dd197b425a8a95e5f5108113b53cd799e05bbc3 |
|
MD5 | 8fcb310a0e6a9e7ce9881cc7a6792a64 |
|
BLAKE2b-256 | 350a19633dce85c07b9f1c83a13f4a3b0f4e26e5f99de5bc305f5665259c4039 |
Hashes for take_text_preprocess-0.0.6b2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 68c90d1b420365d6bed5b34fe9476e80f14633ca233455b41c09e2a90da6f394 |
|
MD5 | 13c49109e31f871b2a0d5b0738de125e |
|
BLAKE2b-256 | c71c6dee3c3c2be8b5ca0eecadd286542fab62dc47b94d0865ca9f8524674f35 |