High-performance Turkish tokenizer with Rust backend and Python wrapper

These details have not been verified by PyPI

Project links

Project description

Turkish Tokenizer

A high-performance Turkish tokenizer with Rust backend and Python wrapper. This package combines linguistic rules with BPE (Byte Pair Encoding) for optimal tokenization of Turkish text.

🚀 Features

High Performance: Rust backend for ultra-fast tokenization
Linguistic Rules: Root and suffix matching based on Turkish morphology
BPE Support: Byte Pair Encoding for unknown words
Special Tokens: Handles spaces, newlines, tabs, and case sensitivity
Decoding: Convert token IDs back to readable text
Easy Integration: Simple Python API and command-line interface

📦 Installation

From PyPI (Recommended)

pip install turkish-tokenizer

From Source

git clone https://github.com/malibayram/turkish-tokenizer.git
cd turkish-tokenizer
pip install -e .

Prerequisites

Python 3.8 or higher
Rust (for building the backend): Install from rustup.rs

🔧 Quick Start

Python API

from turkish_tokenizer import TurkishTokenizer

# Initialize the tokenizer
tokenizer = TurkishTokenizer()

# Tokenize text
text = "Merhaba dünya! Bu bir test cümlesidir."
tokens, token_ids = tokenizer.tokenize(text)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

# Decode back to text
decoded_text = tokenizer.decode(token_ids)
print("Decoded:", decoded_text)

Command Line

# Tokenize text
turkish-tokenizer "Merhaba dünya!"

# Tokenize from file
turkish-tokenizer -f input.txt

# Save output to file
turkish-tokenizer "Merhaba dünya!" -o output.json

# Build Rust backend
turkish-tokenizer --build

# Check if Rust backend is available
turkish-tokenizer --check

📚 API Reference

TurkishTokenizer Class

`init()`

Initialize the tokenizer. No parameters required.

`tokenize(text: str) -> Tuple[List[str], List[int]]`

Tokenize the input text.

Parameters:

text (str): The text to tokenize

Returns:

tokens (List[str]): List of token strings
token_ids (List[int]): List of token IDs

`decode(token_ids: List[int]) -> str`

Decode a list of token IDs back to text.

Parameters:

token_ids (List[int]): List of token IDs to decode

Returns:

str: The decoded text

`tokenize_batch(texts: List[str]) -> List[Tuple[List[str], List[int]]]`

Tokenize a batch of texts.

Parameters:

texts (List[str]): List of texts to tokenize

Returns:

List of tuples, each containing (tokens, token_ids)

`get_vocab_size() -> int`

Get the vocabulary size.

Returns:

int: Total number of tokens in the vocabulary

`get_special_tokens() -> Dict[str, int]`

Get the special tokens and their IDs.

Returns:

Dict[str, int]: Dictionary mapping special token names to their IDs

Convenience Functions

`tokenize(text: str) -> Tuple[List[str], List[int]]`

Quick tokenization function for simple use cases.

from turkish_tokenizer import tokenize

tokens, ids = tokenize("Merhaba dünya!")

🛠️ Development

Setup Development Environment

git clone https://github.com/malibayram/turkish-tokenizer.git
cd turkish-tokenizer
pip install -e ".[dev]"

Running Tests

pytest

Code Formatting

black turkish_tokenizer/
isort turkish_tokenizer/

Type Checking

mypy turkish_tokenizer/

📦 Building and Publishing

Building the Package

# Build source distribution
python -m build --sdist

# Build wheel
python -m build --wheel

# Build both
python -m build

Publishing to PyPI

Register on PyPI (if you haven't already):
```
pip install twine
```

Upload to Test PyPI first:

twine upload --repository testpypi dist/*

Upload to PyPI:
```
twine upload dist/*
```

Publishing to Test PyPI

# Upload to Test PyPI
twine upload --repository testpypi dist/*

# Install from Test PyPI
pip install --index-url https://test.pypi.org/simple/ turkish-tokenizer

🔍 How It Works

The tokenizer uses a multi-stage approach:

Special Token Detection: Identifies spaces, newlines, tabs, and case changes
Root Matching: Matches words against a comprehensive Turkish root dictionary
Suffix Matching: Applies Turkish morphological rules for suffixes
BPE Tokenization: Uses Byte Pair Encoding for unknown words
Decoding: Applies reverse transformations to reconstruct text

Token Types

Roots: Base Turkish words (e.g., "kitab", "defter")
Suffixes: Turkish grammatical suffixes (e.g., "ler", "i", "nin")
BPE Tokens: Subword units for unknown words
Special Tokens: <space>, <newline>, <tab>, <uppercase>, <unknown>

📊 Performance

Method	Speed	Time for 1K words
Python (pure)	~100 words/sec	~10 seconds
Rust backend	~10K words/sec	~0.1 seconds

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Turkish linguistic resources and vocabulary
Rust programming language and ecosystem
Python packaging community

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: your.email@example.com

🔄 Changelog

0.1.0 (2024-01-XX)

Initial release
Python API with Rust backend
Command-line interface
Comprehensive Turkish tokenization
Decoding functionality

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.4

Feb 1, 2026

1.0.1

Feb 1, 2026

1.0.0

Feb 1, 2026

0.2.26

Sep 3, 2025

0.2.25

Aug 27, 2025

0.2.24

Aug 23, 2025

0.2.23

Aug 23, 2025

0.2.22

Aug 20, 2025

0.2.21

Aug 20, 2025

0.2.20

Aug 20, 2025

0.2.18

Aug 20, 2025

0.2.17

Aug 20, 2025

0.2.16

Aug 20, 2025

0.2.15

Aug 20, 2025

0.2.14

Aug 18, 2025

0.2.13

Aug 16, 2025

0.2.12

Aug 16, 2025

0.2.11

Aug 16, 2025

0.2.8

Aug 16, 2025

0.2.4

Aug 16, 2025

0.2.0

Aug 16, 2025

0.1.9

Aug 16, 2025

0.1.8

Aug 16, 2025

0.1.7

Aug 16, 2025

0.1.6

Aug 16, 2025

0.1.5

Aug 16, 2025

0.1.4

Jun 22, 2025

0.1.3

Jun 22, 2025

0.1.2

Jun 22, 2025

This version

0.1.1

Jun 22, 2025

0.1.0

Jun 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkish_tokenizer-0.1.1.tar.gz (241.0 kB view details)

Uploaded Jun 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turkish_tokenizer-0.1.1-py3-none-any.whl (239.1 kB view details)

Uploaded Jun 22, 2025 Python 3

File details

Details for the file turkish_tokenizer-0.1.1.tar.gz.

File metadata

Download URL: turkish_tokenizer-0.1.1.tar.gz
Upload date: Jun 22, 2025
Size: 241.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for turkish_tokenizer-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`69ba3778f8ccb91f294b0ed8648ffd14e0175fa54c62100bc02bdf3fa981f398`
MD5	`c82c1d98226be25b52e1f9fe619c2f97`
BLAKE2b-256	`9c8a3717dd5227a286c0a312e7556a5763e878c53cc4c862b924c4c4bf421a63`

See more details on using hashes here.

File details

Details for the file turkish_tokenizer-0.1.1-py3-none-any.whl.

File metadata

Download URL: turkish_tokenizer-0.1.1-py3-none-any.whl
Upload date: Jun 22, 2025
Size: 239.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for turkish_tokenizer-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ac0f678123991b27d62af8adac96cf27f74ec297ede0e970ff71d9732caa7077`
MD5	`42e2f7591071c4ac51d91fada70cfa7f`
BLAKE2b-256	`45c9e0f13b825c83a1e2b6fb6b047666e967a940d217f2bc4f54a7f359aa782c`

See more details on using hashes here.

turkish-tokenizer 0.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Turkish Tokenizer

🚀 Features

📦 Installation

From PyPI (Recommended)

From Source

Prerequisites

🔧 Quick Start

Python API

Command Line

📚 API Reference

TurkishTokenizer Class

__init__()

tokenize(text: str) -> Tuple[List[str], List[int]]

decode(token_ids: List[int]) -> str

tokenize_batch(texts: List[str]) -> List[Tuple[List[str], List[int]]]

get_vocab_size() -> int

get_special_tokens() -> Dict[str, int]

Convenience Functions

tokenize(text: str) -> Tuple[List[str], List[int]]

🛠️ Development

Setup Development Environment

Running Tests

Code Formatting

Type Checking

📦 Building and Publishing

Building the Package

Publishing to PyPI

Publishing to Test PyPI

🔍 How It Works

Token Types

📊 Performance

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

🔄 Changelog

0.1.0 (2024-01-XX)

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`init()`

`tokenize(text: str) -> Tuple[List[str], List[int]]`

`decode(token_ids: List[int]) -> str`

`tokenize_batch(texts: List[str]) -> List[Tuple[List[str], List[int]]]`

`get_vocab_size() -> int`

`get_special_tokens() -> Dict[str, int]`

`tokenize(text: str) -> Tuple[List[str], List[int]]`