Skip to main content

Sentence and word tokenizers for the Turkish language

Project description

TrTokenizer 🇹🇷

Python PyPI

TrTokenizer is a complete solution for Turkish sentence and word tokenization with extensively-covering language conventions. If you think that Natural language models always need robust, fast, and accurate tokenizers, be sure that you are at the the right place now. Sentence tokenization approach uses non-prefix keyword given in 'tr_non_suffixes' file. This file can be expanded if required, for developer convenience lines start with # symbol are evaluated as comments. Designed regular expressions are pre-compiled to speed-up the performance.

Install

pip install trtokenizer

Usage

from trtokenizer.tr_tokenizer import SentenceTokenizer, WordTokenizer

sentence_tokenizer_object = SentenceTokenizer()  # during object creation regexes are compiled only at once

sentence_tokenizer_object.tokenize(<given paragraph as string>)

word_tokenizer_object = WordTokenizer()  # # during object creation regexes are compiled only at once

word_tokenizer_object.tokenize(<given sentence as string>)

To-do

  • Usage examples (Done)
  • Cython C-API for performance (Done, build/tr_tokenizer.c)
  • Release platform specific shared dynamic libraries (Done, build/tr_tokenizer.cpython-38-x86_64-linux-gnu.so, only for Debian Linux with gcc compiler)
  • Limitations
  • Prepare a simple guide for contribution

Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trtokenizer-0.0.3.tar.gz (6.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trtokenizer-0.0.3-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file trtokenizer-0.0.3.tar.gz.

File metadata

  • Download URL: trtokenizer-0.0.3.tar.gz
  • Upload date:
  • Size: 6.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.7.9

File hashes

Hashes for trtokenizer-0.0.3.tar.gz
Algorithm Hash digest
SHA256 8100eca1c1c4dcfdd2ec13aef6606c6a4f54d9e9318107bd5cf31d3642ed66fd
MD5 ef73b08f4c7aea1e570f5eb55638b096
BLAKE2b-256 f5bb38488f537662d0e25b7a77c529bebb87e8f43336730b391ad2c71ca00283

See more details on using hashes here.

File details

Details for the file trtokenizer-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: trtokenizer-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.7.9

File hashes

Hashes for trtokenizer-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8afab11883ad97f5f8b91d994c2c0c8e8044980e09f6a6ba1cfb6a4fa0528027
MD5 e7a447d8dba2227dacbecbd1d592199e
BLAKE2b-256 f1fae3df3c1523ff16a69b10a0e5bda08d9321bd459966bc4cff8ef624166136

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page