Skip to main content

Text Preprocesser

Project description

Take Text Pre-Process

This package is a tool for pre-processing a sentence.

The basic functionality available in this packages are:

  • Converting to lower case
  • Remove non ascii characters
  • Add space between punctuation and word

The customize functionality available are:

  • Replace URL by a token
  • Replace Email by a token
  • Replace Numbers by a token
  • Replace Code (Number and letters) by a token
  • Remove symbols
  • Replace abbreviations

Installation

The TakeTextPreProcess can be installed from PyPi:

pip install take-text-preprocess

Usage

Basic pre-process

To use the basic pre-process:

from take_text_preprocess.presentation import pre_process
sentence = 'Bom dia, meu ẞ caro'
pre_process(sentence)

Customize pre-process

To use the customize pre-process is needed a input with a list of all pre-process you wanted to use.

The following examples show all the customized pre-processes available.

  • URL
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['URL']
sentence = 'Bom dia, meu https://www.take.net  caro'
pre_process(sentence, optional_tokenization)
  • EMAIL
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['EMAIL']
sentence = 'Bom dia, meu teste@gmail.com  caro'
pre_process(sentence, optional_tokenization)
  • NUMBER
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['NUMBER']
sentence = 'Este é um número 99999-9999'
pre_process(sentence, optional_tokenization)
  • CODE
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['CODE']
sentence = 'Este é um código 91234abc'
pre_process(sentence, optional_tokenization)
  • SYMBOLS
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['SYMBOL']
sentence = 'Este é um símbolo %'
pre_process(sentence, optional_tokenization)
  • ABBREVIATIONS
from take_text_preprocess.presentation import pre_process
optional_tokenization = ['ABBR']
sentence = 'Este é uma abreviação vc'
pre_process(sentence, optional_tokenization)

Contribute

If this is the first time you are contributing to this project, first create the virtual environment using the following command:

conda env create -f env/environment.yml

Then activate the environment:

conda activate taketextpreprocess

To test your modifications build the package:

pip install dist\take-text-preprocess-VERSION-py3-none-any.whl --force-reinstall

Then run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

take-text-preprocess-0.0.5.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

take_text_preprocess-0.0.5-py3-none-any.whl (26.7 kB view details)

Uploaded Python 3

File details

Details for the file take-text-preprocess-0.0.5.tar.gz.

File metadata

  • Download URL: take-text-preprocess-0.0.5.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.7.6

File hashes

Hashes for take-text-preprocess-0.0.5.tar.gz
Algorithm Hash digest
SHA256 5efb5c359059c704ef210492ddea756c0490907c9f2f4d1aec9edb57449e9a04
MD5 6933fae94b1bc6ba1266809872ae04c2
BLAKE2b-256 1e3d591c74accc100509eff8d67431eafdb470a02c21ed3b7a3fb6825a9217aa

See more details on using hashes here.

File details

Details for the file take_text_preprocess-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: take_text_preprocess-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 26.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.7.6

File hashes

Hashes for take_text_preprocess-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9fe6858b65f8ae0a42aca62796ce62c3f64439aa3ad68632d0fc876a061ec769
MD5 87d6295bec46889725e3ca8943c65254
BLAKE2b-256 ed9da6fc8af2684e74becc9f0dc9cd2f2e33e627e82532b97458623cdff691ee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page