TakeSentenceTokenizer is a tool for tokenizing and pre processing messages
Project description
TakeSentenceTokenizer
TakeSentenceTokenizer is a tool for pre processing and tokenizing sentences. The package is used to: - convert the first word of the sentence to lowercase - convert from uppercase to lowercase - convert word to lowercase after punctuation - replace words for placeholders: laugh, date, time, ddd, measures (10kg, 20m, 5gb, etc), code, phone number, cnpj, cpf, email, money, url, number (ordinal and cardinal) - replace abbreviations - replace common typos - split punctuations - remove emoji - remove characters that are not letters or punctuation - add missing accentuation - tokenize the sentence
Installation
Use the package manager pip to install TakeSentenceTokenizer
pip install TakeSentenceTokenizer
Usage
Example 1: full processing not keeping registry of removed punctuation
Code:
from SentenceTokenizer import SentenceTokenizer
sentence = 'P/ saber disso eh c/ vc ou consigo ver pelo site www.dúvidas.com.br/minha-dúvida ??'
tokenizer = SentenceTokenizer()
processed_sentence = tokenizer.process_message(sentence)
print(processed_sentence)
Output:
'para saber disso é com você ou consigo ver pelo site URL ? ?'
Example 2: full processing keeping registry of removed punctuation
from SentenceTokenizer import SentenceTokenizer
sentence = 'como assim $@???'
tokenizer = SentenceTokenizer(keep_registry_punctuation = True)
processed_sentence = tokenizer.process_message(sentence)
print(processed_sentence)
print(tokenizer.removal_registry_lst)
Output:
como assim ? ? ?
[['como assim $@ ? ? ?', {'punctuation': '$', 'position': 11}, {'punctuation': '@', 'position': 12}, {'punctuation': ' ', 'position': 13}]]
Author
Take Data&Analytics Research
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file TakeSentenceTokenizer-1.0.2.tar.gz
.
File metadata
- Download URL: TakeSentenceTokenizer-1.0.2.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1fd6be085c4ce022fa5ab9620f85a85ac3c10ba7b14a3302f44a08e7c0dbf947 |
|
MD5 | 675a0a1de585e9b40e790578655d6159 |
|
BLAKE2b-256 | c04aca551bbff747a4c414382a1a390eb4b429dbc9ad587009e2b25fdaa7da95 |
File details
Details for the file TakeSentenceTokenizer-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: TakeSentenceTokenizer-1.0.2-py3-none-any.whl
- Upload date:
- Size: 408.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9f533ef3a880c1ac01859f19bf0de44105e96a4615a8f8d67f5fb4a50465532 |
|
MD5 | 5735c9974532de2b8467125f894cfcf9 |
|
BLAKE2b-256 | ec0294d828621011be09aee4736151dd1e6a1f5e819ffe5a045de1618f481027 |