Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.
  • Markup tags: Wrap a span to control reading:
    • <en>...</en> — keep the content for the English phonemizer (e.g. <en>hello</en>).
    • <math>...</math> — read as a math formula: variable clusters are spelled letter-by-letter and operators/symbols are voiced, while function names (sin, cos, log, lim, ...) are preserved. <math>b² - 4ac</math>"bê bình phương trừ bốn a xê", <math>∫f dx</math>"tích phân ép đê ích".

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.14.tar.gz (20.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.14-cp310-abi3-win_amd64.whl (21.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.14-cp310-abi3-win32.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.14-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

sea_g2p-0.7.14-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.14-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.14-cp310-abi3-macosx_10_12_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file sea_g2p-0.7.14.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.14.tar.gz
  • Upload date:
  • Size: 20.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.14.tar.gz
Algorithm Hash digest
SHA256 ddaf119098ebe3f91bd30ac2c29e796df2a4c84747cd38cdf327f3cfd42ab159
MD5 6411d80721b6924abd89ec8e7ad5806b
BLAKE2b-256 289dc957dd1e94f990341c1a00e1bb405f976b423b2d332ac91023de4ac9e765

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.14-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.14-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.14-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 49b20b6733fa0b135b6de23aa71ad7f4ce25487ccbc7e8b5acb6897f37198cbd
MD5 e13aa99ee79eaa01938aa06a2fc4f276
BLAKE2b-256 c1b08b97183d1996040ea928bdb605d5db0a852bef38f114bfdacc7503940a78

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.14-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.14-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.14-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 ee97089c283f7c671afde0fccbe35c36fbe80ace1acbf99aa4fe3a0fbb43ada1
MD5 a3f151dc230f4f344e9c4a8906bae005
BLAKE2b-256 d333a520483a82eea9ace40cef203ec14813deb06902ffba440a68d714413441

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.14-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.14-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ebcd30ec0458972d13082e525ae9a7a9ea8f88796fe1da92819184de3291a76b
MD5 1def60a674a9a16b3b01cb26475ab89e
BLAKE2b-256 bebb3379abffef0ee03b95ba2352e2b78b7c6a9f366f979da2e7ba7fe81b2f9e

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.14-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.14-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7c4f084efc6a51326557e6aaab01c3e394383d212e30adf2fa563a3a6181097b
MD5 34c801ad1524d97d994e4e0ba1fbab7a
BLAKE2b-256 c36dbca465afa4d4a400f57f6905666d3b965a45070041d242a7f26360a6f600

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.14-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.14-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f9f21d064752fde17787170e961cc930da0200432d1768b62c55c2f3c295b411
MD5 e5282a981c21591c9f2332c4bf72c2a4
BLAKE2b-256 65212f296bf5db3ab7b37d9b28bf4c87b7a12312f2b7199f1ee4cad8abacf416

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.14-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.14-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 13dc476874f957cdaa9e17ad94b3bb97bbfecdbe03d1d6d6bf0bc39f43bed4e7
MD5 b37986509be2abd643d5986302c15ca3
BLAKE2b-256 a3c1daac367fae969a76175cdc80eca64503396ed7e43cb80b8ee1d10d65b881

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page