Skip to main content

High-performance Turkish tokenizer with Rust backend and Python wrapper

Project description

Turkish Tokenizer

A high-performance Turkish tokenizer with Rust backend and Python wrapper, designed for efficient natural language processing of Turkish text.

Features

  • High Performance: Rust backend for fast tokenization
  • Turkish Language Support: Optimized for Turkish morphology and grammar
  • Python Integration: Easy-to-use Python wrapper
  • Comprehensive Coverage: Handles Turkish roots, suffixes, and BPE tokens
  • Command Line Interface: CLI tool for batch processing

Installation

pip install turkish-tokenizer

Quick Start

from turkish_tokenizer import TurkishTokenizer

# Initialize the tokenizer
tokenizer = TurkishTokenizer()

# Tokenize text
text = "Merhaba dünya! Bu bir test cümlesidir."
tokens = tokenizer.tokenize(text)
print(tokens)

# Decode tokens back to text
decoded_text = tokenizer.decode(tokens)
print(decoded_text)

Command Line Usage

# Tokenize a text file
turkish-tokenizer tokenize input.txt output.txt

# Decode tokens back to text
turkish-tokenizer decode input_tokens.txt output_text.txt

API Reference

TurkishTokenizer

The main tokenizer class.

Methods

  • tokenize(text: str) -> List[int]: Tokenize input text into token IDs
  • decode(tokens: List[int]) -> str: Decode token IDs back to text
  • encode(text: str) -> List[int]: Alias for tokenize method

Development

Setup

  1. Clone the repository
  2. Install development dependencies: pip install -e ".[dev]"
  3. Run tests: pytest

Building

python -m build

License

MIT License - see LICENSE file for details.

Changelog

0.1.3 (2024-12-19)

  • FIXED: JSON vocabulary files are now properly included in the package distribution
  • FIXED: MANIFEST.in corrected to include JSON files from the right directory structure
  • FIXED: Package data configuration updated to ensure JSON files are bundled

0.1.2 (2024-12-19)

  • ADDED: Command line interface (CLI) for batch processing
  • ADDED: Comprehensive test suite
  • IMPROVED: Better error handling and validation
  • IMPROVED: Enhanced documentation and examples

0.1.1 (2024-12-19)

  • FIXED: Package metadata and dependencies
  • IMPROVED: Better package structure and organization

0.1.0 (2024-12-19)

  • INITIAL: First release with basic tokenization functionality
  • FEATURES: Turkish root and suffix matching
  • FEATURES: BPE tokenization support
  • FEATURES: Python wrapper for Rust backend

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Author

Ali Bayram - malibayram20@gmail.com

Repository

https://github.com/malibayram/turkish-tokenizer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkish_tokenizer-0.1.4.tar.gz (252.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turkish_tokenizer-0.1.4-py3-none-any.whl (269.0 kB view details)

Uploaded Python 3

File details

Details for the file turkish_tokenizer-0.1.4.tar.gz.

File metadata

  • Download URL: turkish_tokenizer-0.1.4.tar.gz
  • Upload date:
  • Size: 252.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for turkish_tokenizer-0.1.4.tar.gz
Algorithm Hash digest
SHA256 1265b780334ce96276cb6c50fea050d54ec007df2c13fc5c2b2508577ee932e7
MD5 02ea7802cd0269d6c7c7f4e41da0bb8d
BLAKE2b-256 b784eac2e236a852e6e5cbcedffbc7569630c7e43c3c6c42fb96d166efa5c9db

See more details on using hashes here.

File details

Details for the file turkish_tokenizer-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for turkish_tokenizer-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 eaa7055cda9f169cab52310498aff3ec4205c3ee354cd1164511c1b0df8e4cdd
MD5 952aff7119d68d6b79c3e0efc570deba
BLAKE2b-256 87f006b352a8ed41bdb1f6ca4895dc3e30b7a2ecc6575c630cb6f8ea5adb0272

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page