Skip to main content

A Python package for transliterating English text to Khmer script.

Project description

Khmer Text Transliteration

A Python-based system for transliterating English text to Khmer script using sequence-to-sequence neural networks.

Overview

This project provides tools to convert English phonetic text into Khmer script. It uses a sequence-to-sequence model with LSTM layers for transliteration.

Features

  • English to Khmer text transliteration
  • Multiple prediction variants
  • Fuzzy matching and similarity search
  • Web interface using Gradio

Project Structure

Pre-trained Models

The project includes pre-trained models located in khmer_text_transliteration/models/pretrained/:

  • khmer_transliterator.keras: A pre-trained sequence-to-sequence model for English to Khmer transliteration

Training Assets

Tokenizer and model assets are stored in data/processed/:

  • khmer_transliteration_assets.pkl: Contains the English and Khmer tokenizers, along with sequence length information

Training Data

Raw data for training and reference is available in data/raw/:

  • eng_khm_data.txt: Training data with English-Khmer word pairs
  • khmer_words.txt: Dictionary of Khmer words
  • 1000-most-common-khmer-words/: Collection of common Khmer words for reference

Training Process

The model training process is documented in the notebooks:

  • notebooks/khmer_seq2seq.ipynb: Jupyter notebook containing the complete training pipeline, including:
    • Data preprocessing
    • Model architecture
    • Training configuration
    • Evaluation metrics
    • Example predictions

To train a new model or experiment with the existing one, refer to the training notebook for detailed instructions and parameters.

Core Functions

1. Basic Transliteration

from khmer_text_transliteration.predict import transliterate

# Convert English text to Khmer
result = transliterate("somlor")  # Returns: សម្ល

2. Generate Multiple Variants

from khmer_text_transliteration.predict import transliterate_variants

# Get multiple possible transliterations
variants = transliterate_variants("srolanh", num_variants=3, temperature=0.7)
# Returns: ['ស្រឡាញ់', 'ស្រលាញ', 'ស្រលាញ់']

3. Find Similar Words

from khmer_text_transliteration.predict_with_clean import find_similar

# Find similar Khmer words
similar_words = find_similar("min", max_results=2)
# Returns: ['មិន', 'មីន']

4. TF-IDF Based Similarity Search

from khmer_text_transliteration.predict_with_clean import find_similar_tfidf

# Find similar words using TF-IDF
similar = find_similar_tfidf("min", max_results=2)
# Returns: ['មិន', 'មីន']

5. Last Result Prediction

from khmer_text_transliteration.predict_with_clean import predict_last_result

# Get final predictions with scoring
results = predict_last_result("snam", num_results=3)
# Returns: ['ស្នាម', 'ស្នំ', 'សម្នាម']

Requirements

  • TensorFlow 2.x
  • NumPy
  • scikit-learn
  • python-Levenshtein
  • rapidfuzz
  • gradio (for web interface)

Installation

pip install -r requirements.txt

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmer-english-transliteration-0.1.4.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

khmer_english_transliteration-0.1.4-py3-none-any.whl (1.7 MB view details)

Uploaded Python 3

File details

Details for the file khmer-english-transliteration-0.1.4.tar.gz.

File metadata

File hashes

Hashes for khmer-english-transliteration-0.1.4.tar.gz
Algorithm Hash digest
SHA256 2e2cadf5eec442fbb5cd9945fe1e37b8bf3ccecd7c87c99b04fa798868a8d969
MD5 5a6e30e2ce31dc09ab3da0d0df36e50d
BLAKE2b-256 2f9167350d9e5b0755c62a86eb84b32b663a4c5104ca8dbc52da40a93454076f

See more details on using hashes here.

File details

Details for the file khmer_english_transliteration-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for khmer_english_transliteration-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 631c34c8b5d081f1aa6261786f6b6e2116a6770879d39eca8e7c886483cf577d
MD5 42eb0a8fb9ce1cda6618e49f72fa0920
BLAKE2b-256 439cf04bad7bef8525f43a56a6b79f3ae065748a68dfe0252e8d2217d60c8c55

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page