High-performance Turkish tokenizer with Rust backend and Python wrapper
Project description
Turkish Tokenizer
A high-performance Turkish tokenizer with Rust backend and Python wrapper, designed for efficient natural language processing of Turkish text.
Features
- High Performance: Rust backend for fast tokenization
- Turkish Language Support: Optimized for Turkish morphology and grammar
- Python Integration: Easy-to-use Python wrapper
- Comprehensive Coverage: Handles Turkish roots, suffixes, and BPE tokens
- Command Line Interface: CLI tool for batch processing
Installation
pip install turkish-tokenizer
Quick Start
from turkish_tokenizer import TurkishTokenizer
# Initialize the tokenizer
tokenizer = TurkishTokenizer()
# Tokenize text
text = "Merhaba dünya! Bu bir test cümlesidir."
tokens = tokenizer.tokenize(text)
print(tokens)
# Decode tokens back to text
decoded_text = tokenizer.decode(tokens)
print(decoded_text)
Command Line Usage
# Tokenize a text file
turkish-tokenizer tokenize input.txt output.txt
# Decode tokens back to text
turkish-tokenizer decode input_tokens.txt output_text.txt
API Reference
TurkishTokenizer
The main tokenizer class.
Methods
tokenize(text: str) -> List[int]: Tokenize input text into token IDsdecode(tokens: List[int]) -> str: Decode token IDs back to textencode(text: str) -> List[int]: Alias for tokenize method
Development
Setup
- Clone the repository
- Install development dependencies:
pip install -e ".[dev]" - Run tests:
pytest
Building
python -m build
License
MIT License - see LICENSE file for details.
Changelog
0.1.3 (2024-12-19)
- FIXED: JSON vocabulary files are now properly included in the package distribution
- FIXED: MANIFEST.in corrected to include JSON files from the right directory structure
- FIXED: Package data configuration updated to ensure JSON files are bundled
0.1.2 (2024-12-19)
- ADDED: Command line interface (CLI) for batch processing
- ADDED: Comprehensive test suite
- IMPROVED: Better error handling and validation
- IMPROVED: Enhanced documentation and examples
0.1.1 (2024-12-19)
- FIXED: Package metadata and dependencies
- IMPROVED: Better package structure and organization
0.1.0 (2024-12-19)
- INITIAL: First release with basic tokenization functionality
- FEATURES: Turkish root and suffix matching
- FEATURES: BPE tokenization support
- FEATURES: Python wrapper for Rust backend
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Author
Ali Bayram - malibayram20@gmail.com
Repository
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turkish_tokenizer-0.1.3.tar.gz.
File metadata
- Download URL: turkish_tokenizer-0.1.3.tar.gz
- Upload date:
- Size: 252.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc0ca8fd5f235ce52c06a74e87e24be775e39ec4fd18923dede296d422960636
|
|
| MD5 |
00edd229e221f10b6380c6f9cddd9615
|
|
| BLAKE2b-256 |
aa6864f07e0d36411229c6ea155902a2d16b453c75bf92ef15fba96a17c7c11e
|
File details
Details for the file turkish_tokenizer-0.1.3-py3-none-any.whl.
File metadata
- Download URL: turkish_tokenizer-0.1.3-py3-none-any.whl
- Upload date:
- Size: 269.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43dce42a9448454c9f0ae7f2ab372aa637fc68fe3b911522b7b52fbbd11adcb9
|
|
| MD5 |
1beec71474ae5d797208fa8163107964
|
|
| BLAKE2b-256 |
9d965a94d50a64eb05f409f5285d6bc16de3a1efaa261d7212c84d53c2d6df74
|