A Python package for transliterating English text to Khmer script.
Project description
Khmer Text Transliteration
A Python-based system for transliterating English text to Khmer script using sequence-to-sequence neural networks.
Overview
This project provides tools to convert English phonetic text into Khmer script. It uses a sequence-to-sequence model with LSTM layers for transliteration.
Features
- English to Khmer text transliteration
- Multiple prediction variants
- Fuzzy matching and similarity search
- Web interface using Gradio
Project Structure
Pre-trained Models
The project includes pre-trained models located in khmer_text_transliteration/models/pretrained/
:
khmer_transliterator.keras
: A pre-trained sequence-to-sequence model for English to Khmer transliteration
Training Assets
Tokenizer and model assets are stored in data/processed/
:
khmer_transliteration_assets.pkl
: Contains the English and Khmer tokenizers, along with sequence length information
Training Data
Raw data for training and reference is available in data/raw/
:
eng_khm_data.txt
: Training data with English-Khmer word pairskhmer_words.txt
: Dictionary of Khmer words1000-most-common-khmer-words/
: Collection of common Khmer words for reference
Training Process
The model training process is documented in the notebooks:
notebooks/khmer_seq2seq.ipynb
: Jupyter notebook containing the complete training pipeline, including:- Data preprocessing
- Model architecture
- Training configuration
- Evaluation metrics
- Example predictions
To train a new model or experiment with the existing one, refer to the training notebook for detailed instructions and parameters.
Core Functions
1. Basic Transliteration
from khmer_text_transliteration.predict import transliterate
# Convert English text to Khmer
result = transliterate("somlor") # Returns: សម្ល
2. Generate Multiple Variants
from khmer_text_transliteration.predict import transliterate_variants
# Get multiple possible transliterations
variants = transliterate_variants("srolanh", num_variants=3, temperature=0.7)
# Returns: ['ស្រឡាញ់', 'ស្រលាញ', 'ស្រលាញ់']
3. Find Similar Words
from khmer_text_transliteration.predict_with_clean import find_similar
# Find similar Khmer words
similar_words = find_similar("min", max_results=2)
# Returns: ['មិន', 'មីន']
4. TF-IDF Based Similarity Search
from khmer_text_transliteration.predict_with_clean import find_similar_tfidf
# Find similar words using TF-IDF
similar = find_similar_tfidf("min", max_results=2)
# Returns: ['មិន', 'មីន']
5. Last Result Prediction
from khmer_text_transliteration.predict_with_clean import predict_last_result
# Get final predictions with scoring
results = predict_last_result("snam", num_results=3)
# Returns: ['ស្នាម', 'ស្នំ', 'សម្នាម']
Requirements
- TensorFlow 2.x
- NumPy
- scikit-learn
- python-Levenshtein
- rapidfuzz
- gradio (for web interface)
Installation
pip install -r requirements.txt
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file khmer-english-transliteration-0.1.4.tar.gz
.
File metadata
- Download URL: khmer-english-transliteration-0.1.4.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
2e2cadf5eec442fbb5cd9945fe1e37b8bf3ccecd7c87c99b04fa798868a8d969
|
|
MD5 |
5a6e30e2ce31dc09ab3da0d0df36e50d
|
|
BLAKE2b-256 |
2f9167350d9e5b0755c62a86eb84b32b663a4c5104ca8dbc52da40a93454076f
|
File details
Details for the file khmer_english_transliteration-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: khmer_english_transliteration-0.1.4-py3-none-any.whl
- Upload date:
- Size: 1.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
631c34c8b5d081f1aa6261786f6b6e2116a6770879d39eca8e7c886483cf577d
|
|
MD5 |
42eb0a8fb9ce1cda6618e49f72fa0920
|
|
BLAKE2b-256 |
439cf04bad7bef8525f43a56a6b79f3ae065748a68dfe0252e8d2217d60c8c55
|