Skip to main content

Add your description here

Project description

khmerspeech

KhmerSpeech is a text-normalization toolkit tailored for Khmer speech applications. It provides a set of focused processors that clean raw text and verbalize numbers, currencies, dates, URLs, and other tokens that regularly appear in transcripts.

This project is heavily inspired by, and builds on, the open-source tha repository created by Seanghay Yat; the modules here adapt his original work for Khmer-specific speech use cases.

Requirements

  • Python 3.8+ (tested with CPython)
  • Bundled processors rely on regex, phonenumbers, urlextract, and ftfy. These are installed automatically when you install the package.

Installation

PyPI

The project will soon be published to PyPI. When it becomes available you will be able to install it with:

pip install khmerspeech

From source

You can work directly from the repository today:

git clone https://github.com/MetythornPenn/khmerspeech
cd khmerspeech

# (optional) create a virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows use: .venv\Scripts\activate

pip install -e .

This will install the package in editable mode along with its core dependencies.

Quick start

All processors live under the khmerspeech namespace. Compose them to fit your pipeline:

from khmerspeech import (
  normalize,
  datetime as km_datetime,
  phone_numbers,
  currency,
  cardinals,
  decimals,
  urls,
  dict_verbalize,
)

raw = "010123123 គិតថ្លៃ $100.25 នៅថ្ងៃទី 2024-01-02 វេលា 10:23AM ចូលតាម https://google.com.kh"

# 1. Normalize Khmer punctuation / Unicode issues.
clean = normalize.processor(raw)

# 2. Handle structured tokens.
clean = phone_numbers.processor(clean, chunk_size=3)
clean = km_datetime.date_processor(clean)
clean = km_datetime.time_processor(clean)
clean = currency.processor(clean)
clean = urls.processor(clean)

# 3. Expand dictionary-driven spellings / units.
clean = dict_verbalize(clean)

print(clean)
# 0▁10▁123▁123 គិតថ្លៃ មួយរយដុល្លារ▁ម្ភៃប្រាំសេន នៅថ្ងៃទី 2024 01 02 វេលា 10 23▁A▁M ចូលតាម google dot com dot k▁h

# Standalone helpers
print(cardinals.processor("1234"))   # មួយពាន់▁ពីររយ▁សាមសិបបួន
print(decimals.processor("-123.45")) # ដក▁មួយរយ▁ម្ភៃបី▁ចុច▁សែសិបប្រាំ

Processors at a glance

Module / helper What it does Example
normalize.processor Unicode cleanup, punctuation collapsing, whitespace normalization "មិន\u200bឲ្យ""មិនឱ្យ"
datetime.date_processor, datetime.time_processor Normalize numeric dates and clock times (AM/PM verbalized) "2024-01-02""2024 01 02"
phone_numbers.processor Chunk Khmer phone numbers while keeping carrier prefixes "010123123""0▁10▁123▁123"
currency.processor Verbalize USD / KHR amounts (supports $, USD, , រៀល) "$100.01""មួយរយដុល្លារ▁មួយសេន"
cardinals.processor Convert integers to Khmer wording "1234""មួយពាន់▁ពីររយ▁សាមសិបបួន"
decimals.processor Render decimal numbers with “ចុច/ក្បៀស” markers "123.001""មួយរយ▁ម្ភៃបី▁ចុច▁សូន្យ▁សូន្យ▁មួយ"
ordinals.processor Turn English ordinals (1st, 5th, …) into Khmer "5th""ទី▁ប្រាំ"
urls.processor Verbalize URLs/emails, normalizing domain suffixes "google.com.kh""google dot com dot k▁h"
hashtags.processor Drop Khmer or Latin hashtags inline "Hello #ពិសោធន៍""Hello "
ascii_lines.processor Remove ASCII dividers / ruler lines "--- title ---"" title "
license_plate.processor Reformat Cambodian license plates with syllable separators "1A 1234""1 A 12▁34"
parenthesis.processor Remove text inside () or [] "Hello (secret) world""Hello world"
repeater.processor Expand Khmer iteration mark ; accepts custom tokenizers "បន្តិចម្ដងៗ""បន្តិចម្ដង▁បន្តិចម្ដង"
punctuations.processor Collapse repeated sentence-ending punctuation and enforce spacing "។។។""។ "
dict_verbalize / dictionary.processor Apply spelling, verbatim, and measurement-unit replacements from dict/ TSVs "10 kg""10▁គីឡូក្រាម"

Each module is intentionally tiny—inspect the source under khmerspeech/ if you need to tweak behaviour or build bespoke processors.

Dictionary resources

The dict/ folder ships pragmatic TSVs containing common spellings, pronunciations, and measurement units. dict_verbalize loads them at runtime:

from khmerspeech import dict_verbalize

text = "100 km ឬ 10 kg"
print(dict_verbalize(text))
# 100▁គីឡូម៉ែត្រ ឬ 10▁គីឡូក្រាម

Add new terms by editing the TSV files and re-running your pipeline—no extra build step is required.

Development

After installing in editable mode you can run the included assertions to spot regressions:

python tests.py

Code contributions are welcome. Please format new code with ruff or black, keep functions focused, and add tests for new processors.

Acknowledgements

Credit goes to Seanghay Yat for the original tha project, released under the MIT license. KhmerSpeech reuses and extends his processors to fit Khmer speech-normalization needs.

License

khmerspeech is distributed under the Apache License 2.0. See the LICENSE file for the full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

khmerspeech-0.1.0.tar.gz (641.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

khmerspeech-0.1.0-py3-none-any.whl (653.6 kB view details)

Uploaded Python 3

File details

Details for the file khmerspeech-0.1.0.tar.gz.

File metadata

  • Download URL: khmerspeech-0.1.0.tar.gz
  • Upload date:
  • Size: 641.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for khmerspeech-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9bb86bd0289bb1d3500850c30d662dcb9e38ceca3412adb9fe4b652b26024d6f
MD5 6c7b370aaa94342eb2050b9fdff7bbf0
BLAKE2b-256 a749cb39924d4efc0dd71b3878f5e6b5cc321b07c21b33fbc98e8319a5fdb28f

See more details on using hashes here.

File details

Details for the file khmerspeech-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: khmerspeech-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 653.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for khmerspeech-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d3130f2d2c207aa7f7d9a624f74742b0beb8d49b917da1e0973d94f8cc6617ef
MD5 f931b10187dc4cf0878381980bbc76e7
BLAKE2b-256 33cf64cb6bf7ebe6c93e7e95d07ee366f546f2234a4e6095fab28328917a8e46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page