Skip to main content

High-performance Turkish tokenizer with Rust backend and Python wrapper

Project description

Turkish Tokenizer

A high-performance Turkish tokenizer with Rust backend and Python wrapper. This package combines linguistic rules with BPE (Byte Pair Encoding) for optimal tokenization of Turkish text.

🚀 Features

  • High Performance: Rust backend for ultra-fast tokenization
  • Linguistic Rules: Root and suffix matching based on Turkish morphology
  • BPE Support: Byte Pair Encoding for unknown words
  • Special Tokens: Handles spaces, newlines, tabs, and case sensitivity
  • Decoding: Convert token IDs back to readable text
  • Easy Integration: Simple Python API and command-line interface

📦 Installation

From PyPI (Recommended)

pip install turkish-tokenizer

From Source

git clone https://github.com/malibayram/turkish-tokenizer.git
cd turkish-tokenizer
pip install -e .

Prerequisites

  • Python 3.8 or higher
  • Rust (for building the backend): Install from rustup.rs

🔧 Quick Start

Python API

from turkish_tokenizer import TurkishTokenizer

# Initialize the tokenizer
tokenizer = TurkishTokenizer()

# Tokenize text
text = "Merhaba dünya! Bu bir test cümlesidir."
tokens, token_ids = tokenizer.tokenize(text)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

# Decode back to text
decoded_text = tokenizer.decode(token_ids)
print("Decoded:", decoded_text)

Command Line

# Tokenize text
turkish-tokenizer "Merhaba dünya!"

# Tokenize from file
turkish-tokenizer -f input.txt

# Save output to file
turkish-tokenizer "Merhaba dünya!" -o output.json

# Build Rust backend
turkish-tokenizer --build

# Check if Rust backend is available
turkish-tokenizer --check

📚 API Reference

TurkishTokenizer Class

__init__()

Initialize the tokenizer. No parameters required.

tokenize(text: str) -> Tuple[List[str], List[int]]

Tokenize the input text.

Parameters:

  • text (str): The text to tokenize

Returns:

  • tokens (List[str]): List of token strings
  • token_ids (List[int]): List of token IDs

decode(token_ids: List[int]) -> str

Decode a list of token IDs back to text.

Parameters:

  • token_ids (List[int]): List of token IDs to decode

Returns:

  • str: The decoded text

tokenize_batch(texts: List[str]) -> List[Tuple[List[str], List[int]]]

Tokenize a batch of texts.

Parameters:

  • texts (List[str]): List of texts to tokenize

Returns:

  • List of tuples, each containing (tokens, token_ids)

get_vocab_size() -> int

Get the vocabulary size.

Returns:

  • int: Total number of tokens in the vocabulary

get_special_tokens() -> Dict[str, int]

Get the special tokens and their IDs.

Returns:

  • Dict[str, int]: Dictionary mapping special token names to their IDs

Convenience Functions

tokenize(text: str) -> Tuple[List[str], List[int]]

Quick tokenization function for simple use cases.

from turkish_tokenizer import tokenize

tokens, ids = tokenize("Merhaba dünya!")

🛠️ Development

Setup Development Environment

git clone https://github.com/malibayram/turkish-tokenizer.git
cd turkish-tokenizer
pip install -e ".[dev]"

Running Tests

pytest

Code Formatting

black turkish_tokenizer/
isort turkish_tokenizer/

Type Checking

mypy turkish_tokenizer/

📦 Building and Publishing

Building the Package

# Build source distribution
python -m build --sdist

# Build wheel
python -m build --wheel

# Build both
python -m build

Publishing to PyPI

  1. Register on PyPI (if you haven't already):

    pip install twine
    
  2. Upload to Test PyPI first:

    twine upload --repository testpypi dist/*
    
  3. Upload to PyPI:

    twine upload dist/*
    

Publishing to Test PyPI

# Upload to Test PyPI
twine upload --repository testpypi dist/*

# Install from Test PyPI
pip install --index-url https://test.pypi.org/simple/ turkish-tokenizer

🔍 How It Works

The tokenizer uses a multi-stage approach:

  1. Special Token Detection: Identifies spaces, newlines, tabs, and case changes
  2. Root Matching: Matches words against a comprehensive Turkish root dictionary
  3. Suffix Matching: Applies Turkish morphological rules for suffixes
  4. BPE Tokenization: Uses Byte Pair Encoding for unknown words
  5. Decoding: Applies reverse transformations to reconstruct text

Token Types

  • Roots: Base Turkish words (e.g., "kitab", "defter")
  • Suffixes: Turkish grammatical suffixes (e.g., "ler", "i", "nin")
  • BPE Tokens: Subword units for unknown words
  • Special Tokens: <space>, <newline>, <tab>, <uppercase>, <unknown>

📊 Performance

Method Speed Time for 1K words
Python (pure) ~100 words/sec ~10 seconds
Rust backend ~10K words/sec ~0.1 seconds

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Turkish linguistic resources and vocabulary
  • Rust programming language and ecosystem
  • Python packaging community

📞 Support

🔄 Changelog

0.1.2 (2024-01-XX)

  • Added GitHub repository and proper project metadata
  • Updated author and maintainer information
  • Fixed security issues (removed API tokens from git history)
  • Improved documentation and README
  • Added proper .gitignore configuration

0.1.1 (2024-01-XX)

  • Updated package metadata with correct GitHub URLs
  • Fixed author and maintainer information

0.1.0 (2024-01-XX)

  • Initial release
  • Python API with Rust backend
  • Command-line interface
  • Comprehensive Turkish tokenization
  • Decoding functionality

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkish_tokenizer-0.1.2.tar.gz (241.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turkish_tokenizer-0.1.2-py3-none-any.whl (239.2 kB view details)

Uploaded Python 3

File details

Details for the file turkish_tokenizer-0.1.2.tar.gz.

File metadata

  • Download URL: turkish_tokenizer-0.1.2.tar.gz
  • Upload date:
  • Size: 241.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for turkish_tokenizer-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d09c3562a68eeb19983adbf00bf42d17cd4b655a83a29582dc3531d72ffcb089
MD5 d6b4778206cb2d15f478402d5b7af5a1
BLAKE2b-256 4baa5a38b922cbeff8a7d2ce2a21f7e9833eda995c4177156f90f7fec1613ea8

See more details on using hashes here.

File details

Details for the file turkish_tokenizer-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for turkish_tokenizer-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 705e014dc15d3c75a26864857e4f6097761edb9cc6c963c02e7d605e9141f819
MD5 688ab8b85aa7126f558997b85bd7f7cd
BLAKE2b-256 85e235b82fd5b2e3c8176dd96845465a69fb8007730be72dd99374538c718e28

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page