Skip to main content

TakeSentenceTokenizer is a tool for tokenizing and pre processing messages

Project description

TakeSentenceTokenizer

TakeSentenceTokenizer is a tool for pre processing and tokenizing sentences. The package is used to: - convert the first word of the sentence to lowercase - convert from uppercase to lowercase - convert word to lowercase after punctuation - replace words for placeholders: laugh, date, time, ddd, measures (10kg, 20m, 5gb, etc), code, phone number, cnpj, cpf, email, money, url, number (ordinal and cardinal) - replace abbreviations - replace common typos - split punctuations - remove emoji - remove characters that are not letters or punctuation - add missing accentuation - tokenize the sentence

Installation

Use the package manager pip to install TakeSentenceTokenizer

pip install TakeSentenceTokenizer

Usage

Example 1: full processing not keeping registry of removed punctuation

Code:

from SentenceTokenizer import SentenceTokenizer
sentence = 'P/ saber disso eh c/ vc ou consigo ver pelo site www.dúvidas.com.br/minha-dúvida ??'
tokenizer = SentenceTokenizer()
processed_sentence = tokenizer.process_message(sentence)
print(processed_sentence)

Output:

'para saber disso é com você ou consigo ver pelo site URL ? ?'

Example 2: full processing keeping registry of removed punctuation

from SentenceTokenizer import SentenceTokenizer
sentence = 'como assim $@???'
tokenizer = SentenceTokenizer(keep_registry_punctuation = True)
processed_sentence = tokenizer.process_message(sentence)
print(processed_sentence)
print(tokenizer.removal_registry_lst)

Output:

como assim ? ? ?
[['como assim $@ ? ? ?', {'punctuation': '$', 'position': 11}, {'punctuation': '@', 'position': 12}, {'punctuation': ' ', 'position': 13}]]

Author

Take Data&Analytics Research

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TakeSentenceTokenizer-1.0.2.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

TakeSentenceTokenizer-1.0.2-py3-none-any.whl (408.2 kB view details)

Uploaded Python 3

File details

Details for the file TakeSentenceTokenizer-1.0.2.tar.gz.

File metadata

  • Download URL: TakeSentenceTokenizer-1.0.2.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.13

File hashes

Hashes for TakeSentenceTokenizer-1.0.2.tar.gz
Algorithm Hash digest
SHA256 1fd6be085c4ce022fa5ab9620f85a85ac3c10ba7b14a3302f44a08e7c0dbf947
MD5 675a0a1de585e9b40e790578655d6159
BLAKE2b-256 c04aca551bbff747a4c414382a1a390eb4b429dbc9ad587009e2b25fdaa7da95

See more details on using hashes here.

File details

Details for the file TakeSentenceTokenizer-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: TakeSentenceTokenizer-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 408.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.13

File hashes

Hashes for TakeSentenceTokenizer-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f9f533ef3a880c1ac01859f19bf0de44105e96a4615a8f8d67f5fb4a50465532
MD5 5735c9974532de2b8467125f894cfcf9
BLAKE2b-256 ec0294d828621011be09aee4736151dd1e6a1f5e819ffe5a045de1618f481027

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page