Skip to main content

Sentence and word tokenizers for the Turkish language

Project description

TrTokenizer 🇹🇷

Python PyPI

TrTokenizer is a complete solution for Turkish sentence and word tokenization with extensively-covering language conventions. If you think that Natural language models always need robust, fast, and accurate tokenizers, be sure that you are at the the right place now. Sentence tokenization approach uses non-prefix keyword given in 'tr_non_suffixes' file. This file can be expanded if required, for developer convenience lines start with # symbol are evaluated as comments. Designed regular expressions are pre-compiled to speed-up the performance.

Install

pip install trtokenizer

Usage

from trtokenizer.tr_tokenizer import SentenceTokenizer, WordTokenizer

sentence_tokenizer_object = SentenceTokenizer()  # during object creation regexes are compiled only at once

sentence_tokenizer_object.tokenize(<given paragraph as string>)

word_tokenizer_object = WordTokenizer()  # # during object creation regexes are compiled only at once

word_tokenizer_object.tokenize(<given sentence as string>)

To-do

  • Usage examples (Done)
  • Cython C-API for performance (Done, build/tr_tokenizer.c)
  • Release platform specific shared dynamic libraries (Done, build/tr_tokenizer.cpython-38-x86_64-linux-gnu.so, only for Debian Linux with gcc compiler)
  • Limitations
  • Prepare a simple guide for contribution

Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trtokenizer-0.0.3.tar.gz (6.7 kB view hashes)

Uploaded Source

Built Distribution

trtokenizer-0.0.3-py3-none-any.whl (7.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page