Sentence and word tokenizers for the Turkish language
Project description
TrTokenizer 🇹🇷
TrTokenizer is a complete solution for Turkish sentence and word tokenization with extensively-covering language conventions. If you think that Natural language models always need robust, fast, and accurate tokenizers, be sure that you are at the the right place now. Sentence tokenization approach uses non-prefix keyword given in 'tr_non_suffixes' file. This file can be expanded if required, for developer convenience lines start with # symbol are evaluated as comments. Designed regular expressions are pre-compiled to speed-up the performance.
Install
pip install trtokenizer
Usage
from trtokenizer.tr_tokenizer import SentenceTokenizer, WordTokenizer
sentence_tokenizer_object = SentenceTokenizer() # during object creation regexes are compiled only at once
sentence_tokenizer_object.tokenize(<given paragraph as string>)
word_tokenizer_object = WordTokenizer() # # during object creation regexes are compiled only at once
word_tokenizer_object.tokenize(<given sentence as string>)
To-do
- Usage examples (Done)
- Cython C-API for performance (Done, build/tr_tokenizer.c)
- Release platform specific shared dynamic libraries (Done, build/tr_tokenizer.cpython-38-x86_64-linux-gnu.so, only for Debian Linux with gcc compiler)
- Limitations
- Prepare a simple guide for contribution
Resources
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trtokenizer-0.0.3.tar.gz.
File metadata
- Download URL: trtokenizer-0.0.3.tar.gz
- Upload date:
- Size: 6.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8100eca1c1c4dcfdd2ec13aef6606c6a4f54d9e9318107bd5cf31d3642ed66fd
|
|
| MD5 |
ef73b08f4c7aea1e570f5eb55638b096
|
|
| BLAKE2b-256 |
f5bb38488f537662d0e25b7a77c529bebb87e8f43336730b391ad2c71ca00283
|
File details
Details for the file trtokenizer-0.0.3-py3-none-any.whl.
File metadata
- Download URL: trtokenizer-0.0.3-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8afab11883ad97f5f8b91d994c2c0c8e8044980e09f6a6ba1cfb6a4fa0528027
|
|
| MD5 |
e7a447d8dba2227dacbecbd1d592199e
|
|
| BLAKE2b-256 |
f1fae3df3c1523ff16a69b10a0e5bda08d9321bd459966bc4cff8ef624166136
|