Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.6.tar.gz (20.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.6-cp310-abi3-win_amd64.whl (21.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.6-cp310-abi3-win32.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.6-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

sea_g2p-0.7.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.6-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.6-cp310-abi3-macosx_10_12_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file sea_g2p-0.7.6.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.6.tar.gz
  • Upload date:
  • Size: 20.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.3

File hashes

Hashes for sea_g2p-0.7.6.tar.gz
Algorithm Hash digest
SHA256 d0b17fe023b5ae528027774b1189abb5540fc6c008ad383ed253fd5f21675746
MD5 990b98c06cac23c06e636574b8de6ad0
BLAKE2b-256 f8ab662b0b3513e1079d0377fad25fbb6b8b78608e4a682c190222a9b89b6516

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.6-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.6-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.3

File hashes

Hashes for sea_g2p-0.7.6-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5447ce7b7fe769aca3b9e941626c55e2259327120a4266cfc1fa36094bfe361d
MD5 0dc2aa1a5295c86cb64993a78633117c
BLAKE2b-256 c727a8857c2f1a6db407d6d119364967fef74136da94fde96e95ed4f006ab345

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.6-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.6-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.3

File hashes

Hashes for sea_g2p-0.7.6-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 74f2fd365796b0e6693ee8ddc73d8478dacaa37695e1ced159265757a3d5f685
MD5 a68f2b34a78a0db5a82c786446c10772
BLAKE2b-256 a3d56e5276170509b8e5da5c6b2ff457ffc0cac540415437d5656e9b1e70a997

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.6-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.6-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e11cf603224c2146eb12b86c5d46056e27d1a7f97eab1298ed035be495c24804
MD5 282ca548b7f0ef89b2cb39147f09e760
BLAKE2b-256 a7c2b4bbf5cdc8dea3dde546fc761437108cb9b1db1c0a180c4209e4759fd8e8

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6ac90ff89dff3ac729e6340629364192afda4f6fde50621a259c87a19a1f2664
MD5 994496b315b64f57ac92b6c990e47bca
BLAKE2b-256 77e15e4faee0da3663ada997a7156359b49b3205b2a6773909ff2d77f0de205d

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.6-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.6-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7a950dc04ae4ed1f5d87b626a7eb6d6baafdcf0e1e786f5ffb9c3cc422e10889
MD5 67b989189468fd2bb671519d2bf84ef9
BLAKE2b-256 7b915b6a04040748d6369582f8f23e1d28df50aec91fddf50f9dc28e06f7b764

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.6-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.6-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f25d0b107f19cbdaf090893045a111b68cd0be231f884c4bc511291512bf2010
MD5 f0e6c1bc2d02328ebb434c01eb8bd424
BLAKE2b-256 2e6d34e348d5ebea9e3cc1824d66b344fe25d0f31dc3a8048f9cc0e31a130d56

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page