Skip to main content

Text Preprocesser

Project description

Take Text Pre-Process

This package is a tool for pre-processing a sentence.

The basic functionality available in this packages are:

  • Converting to lower case
  • Remove non ascii characters
  • Add space between punctuation and word

The customize functionality available are:

  • Replace URL by a token
  • Replace Email by a token
  • Replace Numbers by a token
  • Replace Code (Number and letters) by a token
  • Remove symbols
  • Replace abbreviations

Installation

The TakeTextPreProcess can be installed from PyPi:

pip install take-text-preprocess

Usage

Basic pre-process

To use the basic pre-process:

from take_text_preprocess.presentation import pre_process
sentence = 'Bom dia, meu ẞ caro'
pre_process(sentence)

Customize pre-process

To use the customize pre-process is needed a input with a list of all pre-process you wanted to use.

The following examples show all the customized pre-processes available.

  • URL
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['URL']
sentence = 'Bom dia, meu https://www.take.net  caro'
pre_process(sentence, optional_tokenization)
  • EMAIL
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['EMAIL']
sentence = 'Bom dia, meu teste@gmail.com  caro'
pre_process(sentence, optional_tokenization)
  • NUMBER
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['NUMBER']
sentence = 'Este é um número 99999-9999'
pre_process(sentence, optional_tokenization)
  • CODE
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['CODE']
sentence = 'Este é um código 91234abc'
pre_process(sentence, optional_tokenization)
  • SYMBOLS
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['SYMBOL']
sentence = 'Este é um símbolo %'
pre_process(sentence, optional_tokenization)
  • ABBREVIATIONS
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['ABBR']
sentence = 'Este é uma abreviação vc'
pre_process(sentence, optional_tokenization)

Contribute

If this is the first time you are contributing to this project, first create the virtual environment using the following command:

conda env create -f env/environment.yml

Then activate the environment:

conda activate taketextpreprocess

To test your modifications build the package:

pip install dist\take-text-preprocess-VERSION-py3-none-any.whl --force-reinstall

Then run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

take-text-preprocess-0.0.5b0.tar.gz (8.6 kB view hashes)

Uploaded Source

Built Distribution

take_text_preprocess-0.0.5b0-py3-none-any.whl (27.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page