A Khmer Speech Toolkit.
Project description
khmerspeech
KhmerSpeech is a text-normalization toolkit tailored for Khmer speech applications. It provides a set of focused processors that clean raw text and verbalize numbers, currencies, dates, URLs, and other tokens that regularly appear in transcripts.
This project is heavily inspired by, and builds on, the open-source tha repository created by Seanghay Yat; the modules here adapt his original work for Khmer-specific speech use cases.
Requirements
- Python 3.8+
- Bundled processors rely on
regex,phonenumbers,urlextract, andftfy.
Installation
PyPI
pip install khmerspeech
From source
You can work directly from the repository today:
git clone https://github.com/MetythornPenn/khmerspeech
cd khmerspeech
pip install -e .
This will install the package in editable mode along with its core dependencies.
Quick start
All processors live under the khmerspeech.text_normalization namespace. Compose them to fit your pipeline:
from khmerspeech.text_normalization import (
normalize,
datetime as km_datetime,
phone_numbers,
currency,
cardinals,
decimals,
urls,
dict_verbalize,
)
raw = "010123123 គិតថ្លៃ $100.25 នៅថ្ងៃទី 2024-01-02 វេលា 10:23AM ចូលតាម https://google.com.kh"
# 1. Normalize Khmer punctuation / Unicode issues.
clean = normalize.processor(raw)
# 2. Handle structured tokens.
clean = phone_numbers.processor(clean, chunk_size=3)
clean = km_datetime.date_processor(clean)
clean = km_datetime.time_processor(clean)
clean = currency.processor(clean)
clean = urls.processor(clean)
# 3. Expand dictionary-driven spellings / units.
clean = dict_verbalize(clean)
print(clean)
# 0▁10▁123▁123 គិតថ្លៃ មួយរយដុល្លារ▁ម្ភៃប្រាំសេន នៅថ្ងៃទី 2024 01 02 វេលា 10 23▁A▁M ចូលតាម google dot com dot k▁h
# Standalone helpers
print(cardinals.processor("1234")) # មួយពាន់▁ពីររយ▁សាមសិបបួន
print(decimals.processor("-123.45")) # ដក▁មួយរយ▁ម្ភៃបី▁ចុច▁សែសិបប្រាំ
Processors at a glance
| Module / helper | What it does | Example |
|---|---|---|
normalize.processor |
Unicode cleanup, punctuation collapsing, whitespace normalization | "មិន\u200bឲ្យ" → "មិនឱ្យ" |
datetime.date_processor, datetime.time_processor |
Normalize numeric dates and clock times (AM/PM verbalized) | "2024-01-02" → "2024 01 02" |
phone_numbers.processor |
Chunk Khmer phone numbers while keeping carrier prefixes | "010123123" → "0▁10▁123▁123" |
currency.processor |
Verbalize USD / KHR amounts (supports $, USD, ៛, រៀល) |
"$100.01" → "មួយរយដុល្លារ▁មួយសេន" |
cardinals.processor |
Convert integers to Khmer wording | "1234" → "មួយពាន់▁ពីររយ▁សាមសិបបួន" |
decimals.processor |
Render decimal numbers with “ចុច/ក្បៀស” markers | "123.001" → "មួយរយ▁ម្ភៃបី▁ចុច▁សូន្យ▁សូន្យ▁មួយ" |
ordinals.processor |
Turn English ordinals (1st, 5th, …) into Khmer |
"5th" → "ទី▁ប្រាំ" |
urls.processor |
Verbalize URLs/emails, normalizing domain suffixes | "google.com.kh" → "google dot com dot k▁h" |
hashtags.processor |
Drop Khmer or Latin hashtags inline | "Hello #ពិសោធន៍" → "Hello " |
ascii_lines.processor |
Remove ASCII dividers / ruler lines | "--- title ---" → " title " |
license_plate.processor |
Reformat Cambodian license plates with syllable separators | "1A 1234" → "1 A 12▁34" |
parenthesis.processor |
Remove text inside () or [] |
"Hello (secret) world" → "Hello world" |
repeater.processor |
Expand Khmer iteration mark ៗ; accepts custom tokenizers |
"បន្តិចម្ដងៗ" → "បន្តិចម្ដង▁បន្តិចម្ដង" |
punctuations.processor |
Collapse repeated sentence-ending punctuation and enforce spacing | "។។។" → "។ " |
dict_verbalize / dictionary.processor |
Apply spelling, verbatim, and measurement-unit replacements from dict/ TSVs |
"10 kg" → "10▁គីឡូក្រាម" |
Each module is intentionally tiny—inspect the source under khmerspeech/text_normalization/ if you need to tweak behaviour or build bespoke processors.
Dictionary resources
The text_normalization/dict/ folder ships pragmatic TSVs containing common spellings, pronunciations, and measurement units. dict_verbalize loads them at runtime:
from khmerspeech.text_normalization import dict_verbalize
text = "100 km ឬ 10 kg"
print(dict_verbalize(text))
# 100▁គីឡូម៉ែត្រ ឬ 10▁គីឡូក្រាម
Add new terms by editing the TSV files and re-running your pipeline—no extra build step is required.
Development
After installing in editable mode you can run the included assertions to spot regressions:
python tests/tests.py
Code contributions are welcome. Please format new code with ruff or black, keep functions focused, and add tests for new processors.
Acknowledgements
- Seanghay Yat for the original tha project, released under the MIT license.
- Nvidia NeMo text processing
License
khmerspeech is distributed under the Apache License 2.0. See the LICENSE file for the full text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file khmerspeech-0.2.0.tar.gz.
File metadata
- Download URL: khmerspeech-0.2.0.tar.gz
- Upload date:
- Size: 643.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8849f8f7a0c46158ddb18b64557b8b6b1b5de842d49298c1fbdcfc63b13583b1
|
|
| MD5 |
12e2fb16d3f9f9419af33cbb86aed557
|
|
| BLAKE2b-256 |
3a3e368d3c3f93471336925eba4b4f1a77f074e79e5e87eedb032cd473721ca7
|
File details
Details for the file khmerspeech-0.2.0-py3-none-any.whl.
File metadata
- Download URL: khmerspeech-0.2.0-py3-none-any.whl
- Upload date:
- Size: 655.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffebdad128b60d89456711cfe56a31964d5264232e73ee7779affda48b1d844c
|
|
| MD5 |
34dca96fa1e4dcdd0f7329b4ac2af191
|
|
| BLAKE2b-256 |
cbe9a5276d80ad97d840919734a41bf50730e40c5d9b2512763a75f489a44aab
|