Skip to main content

TakeSentenceTokenizer is a tool for tokenizing and pre processing messages

Project description

TakeSentenceTokenizer

TakeSentenceTokenizer is a tool for pre processing and tokenizing sentences. The package is used to: - convert the first word of the sentence to lowercase - convert from uppercase to lowercase - convert word to lowercase after punctuation - replace words for placeholders: laugh, date, time, ddd, measures (10kg, 20m, 5gb, etc), code, phone number, cnpj, cpf, email, money, url, number (ordinal and cardinal) - replace abbreviations - replace common typos - split punctuations - remove emoji - remove characters that are not letters or punctuation - add missing accentuation - tokenize the sentence

Installation

Use the package manager pip to install TakeSentenceTokenizer

pip install TakeSentenceTokenizer

Usage

Example 1: full processing not keeping registry of removed punctuation

Code:

from SentenceTokenizer import SentenceTokenizer
sentence = 'P/ saber disso eh c/ vc ou consigo ver pelo site www.dúvidas.com.br/minha-dúvida ??'
tokenizer = SentenceTokenizer()
processed_sentence = tokenizer.process_message(sentence)
print(processed_sentence)

Output:

'para saber disso é com você ou consigo ver pelo site URL ? ?'

Example 2: full processing keeping registry of removed punctuation

from SentenceTokenizer import SentenceTokenizer
sentence = 'como assim $@???'
tokenizer = SentenceTokenizer(keep_registry_punctuation = True)
processed_sentence = tokenizer.process_message(sentence)
print(processed_sentence)
print(tokenizer.removal_registry_lst)

Output:

como assim ? ? ?
[['como assim $@ ? ? ?', {'punctuation': '$', 'position': 11}, {'punctuation': '@', 'position': 12}, {'punctuation': ' ', 'position': 13}]]

Author

Take Data&Analytics Research

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TakeSentenceTokenizer-1.0.2.tar.gz (7.5 kB view hashes)

Uploaded Source

Built Distribution

TakeSentenceTokenizer-1.0.2-py3-none-any.whl (408.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page