High-performance Turkish tokenizer with Rust backend and Python wrapper
Project description
Turkish Tokenizer
A high-performance Turkish tokenizer with Rust backend and Python wrapper. This package combines linguistic rules with BPE (Byte Pair Encoding) for optimal tokenization of Turkish text.
🚀 Features
- High Performance: Rust backend for ultra-fast tokenization
- Linguistic Rules: Root and suffix matching based on Turkish morphology
- BPE Support: Byte Pair Encoding for unknown words
- Special Tokens: Handles spaces, newlines, tabs, and case sensitivity
- Decoding: Convert token IDs back to readable text
- Easy Integration: Simple Python API and command-line interface
📦 Installation
From PyPI (Recommended)
pip install turkish-tokenizer
From Source
git clone https://github.com/malibayram/turkish-tokenizer.git
cd turkish-tokenizer
pip install -e .
Prerequisites
- Python 3.8 or higher
- Rust (for building the backend): Install from rustup.rs
🔧 Quick Start
Python API
from turkish_tokenizer import TurkishTokenizer
# Initialize the tokenizer
tokenizer = TurkishTokenizer()
# Tokenize text
text = "Merhaba dünya! Bu bir test cümlesidir."
tokens, token_ids = tokenizer.tokenize(text)
print("Tokens:", tokens)
print("Token IDs:", token_ids)
# Decode back to text
decoded_text = tokenizer.decode(token_ids)
print("Decoded:", decoded_text)
Command Line
# Tokenize text
turkish-tokenizer "Merhaba dünya!"
# Tokenize from file
turkish-tokenizer -f input.txt
# Save output to file
turkish-tokenizer "Merhaba dünya!" -o output.json
# Build Rust backend
turkish-tokenizer --build
# Check if Rust backend is available
turkish-tokenizer --check
📚 API Reference
TurkishTokenizer Class
__init__()
Initialize the tokenizer. No parameters required.
tokenize(text: str) -> Tuple[List[str], List[int]]
Tokenize the input text.
Parameters:
text(str): The text to tokenize
Returns:
tokens(List[str]): List of token stringstoken_ids(List[int]): List of token IDs
decode(token_ids: List[int]) -> str
Decode a list of token IDs back to text.
Parameters:
token_ids(List[int]): List of token IDs to decode
Returns:
str: The decoded text
tokenize_batch(texts: List[str]) -> List[Tuple[List[str], List[int]]]
Tokenize a batch of texts.
Parameters:
texts(List[str]): List of texts to tokenize
Returns:
- List of tuples, each containing (tokens, token_ids)
get_vocab_size() -> int
Get the vocabulary size.
Returns:
int: Total number of tokens in the vocabulary
get_special_tokens() -> Dict[str, int]
Get the special tokens and their IDs.
Returns:
Dict[str, int]: Dictionary mapping special token names to their IDs
Convenience Functions
tokenize(text: str) -> Tuple[List[str], List[int]]
Quick tokenization function for simple use cases.
from turkish_tokenizer import tokenize
tokens, ids = tokenize("Merhaba dünya!")
🛠️ Development
Setup Development Environment
git clone https://github.com/malibayram/turkish-tokenizer.git
cd turkish-tokenizer
pip install -e ".[dev]"
Running Tests
pytest
Code Formatting
black turkish_tokenizer/
isort turkish_tokenizer/
Type Checking
mypy turkish_tokenizer/
📦 Building and Publishing
Building the Package
# Build source distribution
python -m build --sdist
# Build wheel
python -m build --wheel
# Build both
python -m build
Publishing to PyPI
-
Register on PyPI (if you haven't already):
pip install twine
-
Upload to Test PyPI first:
twine upload --repository testpypi dist/*
-
Upload to PyPI:
twine upload dist/*
Publishing to Test PyPI
# Upload to Test PyPI
twine upload --repository testpypi dist/*
# Install from Test PyPI
pip install --index-url https://test.pypi.org/simple/ turkish-tokenizer
🔍 How It Works
The tokenizer uses a multi-stage approach:
- Special Token Detection: Identifies spaces, newlines, tabs, and case changes
- Root Matching: Matches words against a comprehensive Turkish root dictionary
- Suffix Matching: Applies Turkish morphological rules for suffixes
- BPE Tokenization: Uses Byte Pair Encoding for unknown words
- Decoding: Applies reverse transformations to reconstruct text
Token Types
- Roots: Base Turkish words (e.g., "kitab", "defter")
- Suffixes: Turkish grammatical suffixes (e.g., "ler", "i", "nin")
- BPE Tokens: Subword units for unknown words
- Special Tokens:
<space>,<newline>,<tab>,<uppercase>,<unknown>
📊 Performance
| Method | Speed | Time for 1K words |
|---|---|---|
| Python (pure) | ~100 words/sec | ~10 seconds |
| Rust backend | ~10K words/sec | ~0.1 seconds |
🤝 Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Turkish linguistic resources and vocabulary
- Rust programming language and ecosystem
- Python packaging community
📞 Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: your.email@example.com
🔄 Changelog
0.1.0 (2024-01-XX)
- Initial release
- Python API with Rust backend
- Command-line interface
- Comprehensive Turkish tokenization
- Decoding functionality
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turkish_tokenizer-0.1.1.tar.gz.
File metadata
- Download URL: turkish_tokenizer-0.1.1.tar.gz
- Upload date:
- Size: 241.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69ba3778f8ccb91f294b0ed8648ffd14e0175fa54c62100bc02bdf3fa981f398
|
|
| MD5 |
c82c1d98226be25b52e1f9fe619c2f97
|
|
| BLAKE2b-256 |
9c8a3717dd5227a286c0a312e7556a5763e878c53cc4c862b924c4c4bf421a63
|
File details
Details for the file turkish_tokenizer-0.1.1-py3-none-any.whl.
File metadata
- Download URL: turkish_tokenizer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 239.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac0f678123991b27d62af8adac96cf27f74ec297ede0e970ff71d9732caa7077
|
|
| MD5 |
42e2f7591071c4ac51d91fada70cfa7f
|
|
| BLAKE2b-256 |
45c9e0f13b825c83a1e2b6fb6b047666e967a940d217f2bc4f54a7f359aa782c
|