Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.
  • Markup tags: Wrap a span to control reading:
    • <en>...</en> — keep the content for the English phonemizer (e.g. <en>hello</en>).
    • <math>...</math> — read as a math formula: variable clusters are spelled letter-by-letter and operators/symbols are voiced, while function names (sin, cos, log, lim, ...) are preserved. <math>b² - 4ac</math>"bê bình phương trừ bốn a xê", <math>∫f dx</math>"tích phân ép đê ích".

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.13.tar.gz (20.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.13-cp310-abi3-win_amd64.whl (21.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.13-cp310-abi3-win32.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.13-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

sea_g2p-0.7.13-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.13-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.13-cp310-abi3-macosx_10_12_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file sea_g2p-0.7.13.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.13.tar.gz
  • Upload date:
  • Size: 20.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.13.tar.gz
Algorithm Hash digest
SHA256 f8c2a14652f590507f27c0963c7139afa3ff18c4b6727f15d77f27c4074f8035
MD5 96000f07e109f35432b975dc725ebac8
BLAKE2b-256 9fb9343e72018759185119d33debb0f8deaffaa4d4689dd7f489973dc9b2d613

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.13-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.13-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.13-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 9c506f1a12a245b7252b3c4ed3c675ce65efa3198b582b20fd3676a9663472dd
MD5 66fa1b7bcf519eef35104b6db151b03a
BLAKE2b-256 12a968374617e39f3caf52c4d893289ddb6d8084a65d2c5bf41c80d6e8484620

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.13-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.13-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.13-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 fc2cd2e044f9ce15359b5069e36a11af15b6e980d22d39622ef2663b349d5495
MD5 d7ce7f31d25fb8fc92716f3bc9c32e2c
BLAKE2b-256 3739aee346458e7278ce282272ae3a21d441d43675f44ba49c6ccec362d37366

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.13-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.13-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 89e3cab3f0643d3bdcf5dd0ddac8cc4b88314e9606a880af75901be9f824f9e9
MD5 c063bb58491df6e3b6d15bfb2f72c2df
BLAKE2b-256 7e3df9e6d037d3f840648dee01283e360f7c89ae8b95586ba2c4a504bfce3aad

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.13-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.13-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5849d267998df1e6165e9cf3a7c1b73c5a8d8b92b1647f39fda0fe66f21848aa
MD5 b86556704752524f844b47e1429b6554
BLAKE2b-256 c96f3f154f2aa30fa2dd1ebcc4135c349ff0284985b0a7bde92318d182ae7683

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.13-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.13-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d666832eb0bab553150689f9006541768a86de1a6f44bd0a66c293e34dfb68d6
MD5 0c7858fb0b27ff4e834deff1331c5152
BLAKE2b-256 53ad959ee2bc52aa4ee5879121fd76496f9c708858b9141d69e54d1161c0286e

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.13-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.13-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 95d02bb8f8eec0dd4d747d2e1d3d0d00417e7ab0396cd9f9a632f7384066c6ea
MD5 8d22513aa48a8df186df5d5b708ca04f
BLAKE2b-256 f98fd97ae91795a8729616b9bb3224c5ad40e316f2ce395171c0702ce0676a75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page