Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.10.tar.gz (20.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.10-cp310-abi3-win_amd64.whl (21.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.10-cp310-abi3-win32.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.10-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

sea_g2p-0.7.10-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.10-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.10-cp310-abi3-macosx_10_12_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file sea_g2p-0.7.10.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.10.tar.gz
  • Upload date:
  • Size: 20.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.10.tar.gz
Algorithm Hash digest
SHA256 073dda273f9601db0212c48da02add28cf84d8ce04a99020146dc331bc0af13b
MD5 6918fee5298d746bacf1beae00724e9d
BLAKE2b-256 f0865552f9d1f90ff654b6639a6375e4c798177f5ec077df4a46d0900e350664

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.10-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.10-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.10-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 170a08a5e1eea5413de693f543d71b8f462e67cdabdb7d67c41637b7c4f4b625
MD5 2b72570b3e9922a6f85d9138e30ec5d2
BLAKE2b-256 ff6b3d9db4a4a753e492275d4b99346b89c70ee8923ea85d9c760c7d02e20555

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.10-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.10-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.10-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 c39dbf178a2cff5a8e9cfd2bb4cf67d7bcae5dcf543e7070429a0b1ed74323a9
MD5 4b9170c13b90b083d2cc13edbf9181ff
BLAKE2b-256 3190e98c51dd29de97af09c2104906248603e65a70e1906d854be560633d5833

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.10-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.10-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dbe198d20ae53717bed3910311b3d6e39ddd4179c4808726de41dcfcc55bd6fd
MD5 2b450e28ba6a45fa5f29e4740c5a52ff
BLAKE2b-256 93d2faa4fab4f4f35fca8f988a6b814ea4e0dd5036a65d4265d1b3d393c7bfe7

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.10-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.10-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 11d0c8d3bea4cb24179faee2f01923b8d94daa6207dcfb394ccbf61d3b2f6f1c
MD5 ac40425861f929a5b8c0c67aa34f50c0
BLAKE2b-256 94776cc4f2826ecb4b7899b7cc768fb892d57a206ac56dc200fa87d2a07cca9f

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.10-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.10-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d8c701c448a260b5304904fffe3dbaa76fa28e32dfcf693e0873589b9aa5f7a9
MD5 8776c4b9896dc2f23c6731ab96620c87
BLAKE2b-256 48a699f790691c5c5a902509461cac3b8dcd94727316a68de3a43ec2c63013ab

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.10-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.10-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 aa2750dd6ab01b3ccd61f9cb47a34cb2f8dffbb11e5c6b7325ade10c19262310
MD5 4b4fb63b947dd6c4477bbadcd41c5137
BLAKE2b-256 d177bd546aa75c48661fb6afca9e7f8e726f5e4f7e9bec46f6b6757be49d8bcb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page