Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.6.4.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.6.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.4-cp310-abi3-win_amd64.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.6.4-cp310-abi3-win32.whl (21.0 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.6.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.6.4-cp310-abi3-macosx_11_0_arm64.whl (21.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.6.4-cp310-abi3-macosx_10_12_x86_64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.6.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.6.4.tar.gz.

File metadata

  • Download URL: sea_g2p-0.6.4.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.4.tar.gz
Algorithm Hash digest
SHA256 e08cc6c0a9cb4ca0e4a84c1dd3b5a3dbc042fcb6ba95d9a7af3a6cb9addd413e
MD5 cf0a4add5c828c4a3bde1616b2b606d0
BLAKE2b-256 6ac61b53fc3cfc7b2f25589e35b6c67a4d54931ccdca9d9eac5af12a148352d7

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 aaf857ea82e0f3278010d77be814d7c5b24f9b925be5369f0e1b15bced19c786
MD5 1d1994fac0c7481695f5f5ef18a6ec5c
BLAKE2b-256 4319ef9ee9902122f0de9ac535ecf70c0aff51bb5b1c9a5c489f0e16f50a103c

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 30f9a23187fdb9dc224cad00cdaf4cdfd04aadb285357be52dc5ab6515a25ba2
MD5 5e475b787251b87a4122e890ae17b946
BLAKE2b-256 f3911043e863457086502d61422048abefb429cfb080f6602d8048da41e29709

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.4-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.6.4-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.4-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3640ce75192466989c2426df535a2c6ee292493ed640c96136d0a633839c613a
MD5 b62594f97204f1af691d9d6fe7b5786d
BLAKE2b-256 2d169ad3d8ea7c61a62c58dd037c2cf206b40c87bf2573dddd3c0e7713faca27

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.4-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.6.4-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.0 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.4-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 ad22e8b38f1055520f61e8b95dc776e30e3566a8deb578faf471dcf928bd46b9
MD5 bd028808b14a95e2dfbb3802547e4cd4
BLAKE2b-256 02ae6d33e79f0c729d7c45eef10ad629b7538b13e2026a2d401b9d2f1200f892

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fa9559b91dcc804f49ab3b98e259900d85b18e4e33115f2a729c7b34e66d4e13
MD5 4a991dc020e9cebc06c8d1c20811cf4e
BLAKE2b-256 26ddd4288cb4e2e8c711b5707194f0130d220666d1d96668bee56ef0a3f87562

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.4-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.4-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c7276085bc64c8c536a94c8742128e4a263edc0403701f23de5d0e3065bc9a00
MD5 db5d909880c130bae4674e29f69f1fb7
BLAKE2b-256 cd6ccb9989819d24596a16807da679d446fbae590c39bbc20e2d1dfda1263e5c

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.4-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.4-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e436a41ec5a272975e787b913bfc55f0a1f659b8e9c175e309c298045a5c53b2
MD5 14b40901c15d6aabdb7cc931b9d833f0
BLAKE2b-256 2058aa80cfa96673390a550695d3e4105833f29a3de3b8e9ebc9b435a6d13eb4

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 df33a3cbe55bb928cda9b1b860b695e95bf5fb983614f6d018aec971f80429c3
MD5 43ae89c8ddbb68231bd2380ae483f823
BLAKE2b-256 c2078de66787a5516807d75e3c8a34614f876aa130c7d4330f71efae4638d68a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page