Skip to main content

High-performance Turkish tokenizer with Rust backend and Python wrapper

Project description

Turkish Tokenizer

A high-performance Turkish tokenizer with Rust backend and Python wrapper, designed for efficient natural language processing of Turkish text.

Features

  • High Performance: Rust backend for fast tokenization
  • Turkish Language Support: Optimized for Turkish morphology and grammar
  • Python Integration: Easy-to-use Python wrapper
  • Comprehensive Coverage: Handles Turkish roots, suffixes, and BPE tokens
  • Command Line Interface: CLI tool for batch processing

Installation

pip install turkish-tokenizer

Quick Start

from turkish_tokenizer import TurkishTokenizer

# Initialize the tokenizer
tokenizer = TurkishTokenizer()

# Tokenize text
text = "Merhaba dünya! Bu bir test cümlesidir."
tokens = tokenizer.tokenize(text)
print(tokens)

# Decode tokens back to text
decoded_text = tokenizer.decode(tokens)
print(decoded_text)

Command Line Usage

# Tokenize a text file
turkish-tokenizer tokenize input.txt output.txt

# Decode tokens back to text
turkish-tokenizer decode input_tokens.txt output_text.txt

API Reference

TurkishTokenizer

The main tokenizer class.

Methods

  • tokenize(text: str) -> List[int]: Tokenize input text into token IDs
  • decode(tokens: List[int]) -> str: Decode token IDs back to text
  • encode(text: str) -> List[int]: Alias for tokenize method

Development

Setup

  1. Clone the repository
  2. Install development dependencies: pip install -e ".[dev]"
  3. Run tests: pytest

Building

python -m build

License

MIT License - see LICENSE file for details.

Changelog

0.1.3 (2024-12-19)

  • FIXED: JSON vocabulary files are now properly included in the package distribution
  • FIXED: MANIFEST.in corrected to include JSON files from the right directory structure
  • FIXED: Package data configuration updated to ensure JSON files are bundled

0.1.2 (2024-12-19)

  • ADDED: Command line interface (CLI) for batch processing
  • ADDED: Comprehensive test suite
  • IMPROVED: Better error handling and validation
  • IMPROVED: Enhanced documentation and examples

0.1.1 (2024-12-19)

  • FIXED: Package metadata and dependencies
  • IMPROVED: Better package structure and organization

0.1.0 (2024-12-19)

  • INITIAL: First release with basic tokenization functionality
  • FEATURES: Turkish root and suffix matching
  • FEATURES: BPE tokenization support
  • FEATURES: Python wrapper for Rust backend

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Author

Ali Bayram - malibayram20@gmail.com

Repository

https://github.com/malibayram/turkish-tokenizer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkish_tokenizer-0.1.3.tar.gz (252.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turkish_tokenizer-0.1.3-py3-none-any.whl (269.0 kB view details)

Uploaded Python 3

File details

Details for the file turkish_tokenizer-0.1.3.tar.gz.

File metadata

  • Download URL: turkish_tokenizer-0.1.3.tar.gz
  • Upload date:
  • Size: 252.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for turkish_tokenizer-0.1.3.tar.gz
Algorithm Hash digest
SHA256 bc0ca8fd5f235ce52c06a74e87e24be775e39ec4fd18923dede296d422960636
MD5 00edd229e221f10b6380c6f9cddd9615
BLAKE2b-256 aa6864f07e0d36411229c6ea155902a2d16b453c75bf92ef15fba96a17c7c11e

See more details on using hashes here.

File details

Details for the file turkish_tokenizer-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for turkish_tokenizer-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 43dce42a9448454c9f0ae7f2ab372aa637fc68fe3b911522b7b52fbbd11adcb9
MD5 1beec71474ae5d797208fa8163107964
BLAKE2b-256 9d965a94d50a64eb05f409f5285d6bc16de3a1efaa261d7212c84d53c2d6df74

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page