Skip to main content

A Khmer Speech Toolkit.

Project description

khmerspeech

KhmerSpeech is a text-normalization toolkit tailored for Khmer speech applications. It provides a set of focused processors that clean raw text and verbalize numbers, currencies, dates, URLs, and other tokens that regularly appear in transcripts.

This project is heavily inspired by, and builds on, the open-source tha repository created by Seanghay Yat; the modules here adapt his original work for Khmer-specific speech use cases.

Requirements

  • Python 3.8+
  • Bundled processors rely on regex, phonenumbers, urlextract, and ftfy.

Installation

PyPI

pip install khmerspeech

From source

You can work directly from the repository today:

git clone https://github.com/MetythornPenn/khmerspeech
cd khmerspeech
pip install -e .

This will install the package in editable mode along with its core dependencies.

Quick start

All processors live under the khmerspeech.text_normalization namespace. Compose them to fit your pipeline:

from khmerspeech.text_normalization import (
  normalize,
  datetime as km_datetime,
  phone_numbers,
  currency,
  cardinals,
  decimals,
  urls,
  dict_verbalize,
)

raw = "010123123 គិតថ្លៃ $100.25 នៅថ្ងៃទី 2024-01-02 វេលា 10:23AM ចូលតាម https://google.com.kh"

# 1. Normalize Khmer punctuation / Unicode issues.
clean = normalize.processor(raw)

# 2. Handle structured tokens.
clean = phone_numbers.processor(clean, chunk_size=3)
clean = km_datetime.date_processor(clean)
clean = km_datetime.time_processor(clean)
clean = currency.processor(clean)
clean = urls.processor(clean)

# 3. Expand dictionary-driven spellings / units.
clean = dict_verbalize(clean)

print(clean)
# 0▁10▁123▁123 គិតថ្លៃ មួយរយដុល្លារ▁ម្ភៃប្រាំសេន នៅថ្ងៃទី 2024 01 02 វេលា 10 23▁A▁M ចូលតាម google dot com dot k▁h

# Standalone helpers
print(cardinals.processor("1234"))   # មួយពាន់▁ពីររយ▁សាមសិបបួន
print(decimals.processor("-123.45")) # ដក▁មួយរយ▁ម្ភៃបី▁ចុច▁សែសិបប្រាំ

Processors at a glance

Module / helper What it does Example
normalize.processor Unicode cleanup, punctuation collapsing, whitespace normalization "មិន\u200bឲ្យ""មិនឱ្យ"
datetime.date_processor, datetime.time_processor Normalize numeric dates and clock times (AM/PM verbalized) "2024-01-02""2024 01 02"
phone_numbers.processor Chunk Khmer phone numbers while keeping carrier prefixes "010123123""0▁10▁123▁123"
currency.processor Verbalize USD / KHR amounts (supports $, USD, , រៀល) "$100.01""មួយរយដុល្លារ▁មួយសេន"
cardinals.processor Convert integers to Khmer wording "1234""មួយពាន់▁ពីររយ▁សាមសិបបួន"
decimals.processor Render decimal numbers with “ចុច/ក្បៀស” markers "123.001""មួយរយ▁ម្ភៃបី▁ចុច▁សូន្យ▁សូន្យ▁មួយ"
ordinals.processor Turn English ordinals (1st, 5th, …) into Khmer "5th""ទី▁ប្រាំ"
urls.processor Verbalize URLs/emails, normalizing domain suffixes "google.com.kh""google dot com dot k▁h"
hashtags.processor Drop Khmer or Latin hashtags inline "Hello #ពិសោធន៍""Hello "
ascii_lines.processor Remove ASCII dividers / ruler lines "--- title ---"" title "
license_plate.processor Reformat Cambodian license plates with syllable separators "1A 1234""1 A 12▁34"
parenthesis.processor Remove text inside () or [] "Hello (secret) world""Hello world"
repeater.processor Expand Khmer iteration mark ; accepts custom tokenizers "បន្តិចម្ដងៗ""បន្តិចម្ដង▁បន្តិចម្ដង"
punctuations.processor Collapse repeated sentence-ending punctuation and enforce spacing "។។។""។ "
dict_verbalize / dictionary.processor Apply spelling, verbatim, and measurement-unit replacements from dict/ TSVs "10 kg""10▁គីឡូក្រាម"

Each module is intentionally tiny—inspect the source under khmerspeech/text_normalization/ if you need to tweak behaviour or build bespoke processors.

Dictionary resources

The text_normalization/dict/ folder ships pragmatic TSVs containing common spellings, pronunciations, and measurement units. dict_verbalize loads them at runtime:

from khmerspeech.text_normalization import dict_verbalize

text = "100 km ឬ 10 kg"
print(dict_verbalize(text))
# 100▁គីឡូម៉ែត្រ ឬ 10▁គីឡូក្រាម

Add new terms by editing the TSV files and re-running your pipeline—no extra build step is required.

Development

After installing in editable mode you can run the included assertions to spot regressions:

python tests/tests.py

Code contributions are welcome. Please format new code with ruff or black, keep functions focused, and add tests for new processors.

Acknowledgements

License

khmerspeech is distributed under the Apache License 2.0. See the LICENSE file for the full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmerspeech-0.2.0.tar.gz (643.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

khmerspeech-0.2.0-py3-none-any.whl (655.8 kB view details)

Uploaded Python 3

File details

Details for the file khmerspeech-0.2.0.tar.gz.

File metadata

  • Download URL: khmerspeech-0.2.0.tar.gz
  • Upload date:
  • Size: 643.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for khmerspeech-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8849f8f7a0c46158ddb18b64557b8b6b1b5de842d49298c1fbdcfc63b13583b1
MD5 12e2fb16d3f9f9419af33cbb86aed557
BLAKE2b-256 3a3e368d3c3f93471336925eba4b4f1a77f074e79e5e87eedb032cd473721ca7

See more details on using hashes here.

File details

Details for the file khmerspeech-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: khmerspeech-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 655.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for khmerspeech-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ffebdad128b60d89456711cfe56a31964d5264232e73ee7779affda48b1d844c
MD5 34dca96fa1e4dcdd0f7329b4ac2af191
BLAKE2b-256 cbe9a5276d80ad97d840919734a41bf50730e40c5d9b2512763a75f489a44aab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page