Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.
  • Markup tags: Wrap a span to control reading:
    • <en>...</en> — keep the content for the English phonemizer (e.g. <en>hello</en>).
    • <math>...</math> — read as a math formula: variable clusters are spelled letter-by-letter and operators/symbols are voiced, while function names (sin, cos, log, lim, ...) are preserved. <math>b² - 4ac</math>"bê bình phương trừ bốn a xê", <math>∫f dx</math>"tích phân ép đê ích".

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.16.tar.gz (20.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.16-cp310-abi3-win_amd64.whl (21.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.16-cp310-abi3-win32.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

sea_g2p-0.7.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.16-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.16-cp310-abi3-macosx_10_12_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file sea_g2p-0.7.16.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.16.tar.gz
  • Upload date:
  • Size: 20.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.16.tar.gz
Algorithm Hash digest
SHA256 e70006335206136bdf0e3c96992635ae7bf54f3b5591edb88336dd2f4e6904b3
MD5 74bc1f3f2414102eb42a2697ac2e3042
BLAKE2b-256 34b799594319ec2dbf75a9b2c28fcddb170e26326e0050a4f922334099fcf432

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.16-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.16-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.16-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 1ede4ba2bb167963ebf1220a91c99f759b92aabbcfa86714c6277f73b941f050
MD5 cde011db1a0f51799bb78c72d13c9c50
BLAKE2b-256 cc3527da38d91dce9c11c3a804b26f74a713715fe9e82f5a21b089491d19d59c

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.16-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.16-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.16-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 418f258c88c2f28f73a749fead9833fb56a8bd7f8d5cf497619a9de17e5bc6c0
MD5 6695fb8c9adfd4b73fc2c28ff02472d8
BLAKE2b-256 5af674888d53005fad416a43772dc9fad195b200cf885f26c9f2b8e9a917b4c0

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6c5176b87126dfb446ed03b9654c72ffd97ba3fb600d669b61ab5fcc4f13c553
MD5 c8b72c069b6063bba819d3a71744b233
BLAKE2b-256 9998f42227bb71000c69459b959c97e34d7ca3c9e9d9848b5e7533bf61b8b2c7

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fa89077d980085560394c292e035175dfd8903cc64b5876866e0f10b597cd64f
MD5 f753d45c8fd67a89451215df9c7cf147
BLAKE2b-256 e28c2fcc5ea5a12303aaa3b1d0d16028f0459b653e701ea256ecfebe7d465dc0

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.16-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.16-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8f8f4b1e9cbb56ac2fabbb94a115e12e701c5c746d23b6020c7db229c10e3131
MD5 2973ce0d2e83411ac969b04945bbc6d0
BLAKE2b-256 15f0a37076b3f5d87804b820977075da5b5e3c7b18dbe9089928c62784998017

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.16-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.16-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ef60313a497d2237c10d7508a45849d36e08d477a0e51935df7c813ddff1c867
MD5 a422806f6280c13b73d393a8ae40e92a
BLAKE2b-256 709f0c950fb353c661ac6ffcff2953b3b0e1328e9bb2ef3a6c1202ad4e0e455c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page