Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.6.5.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.6.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.5-cp310-abi3-win_amd64.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.6.5-cp310-abi3-win32.whl (21.0 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.6.5-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.6.5-cp310-abi3-macosx_11_0_arm64.whl (21.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.6.5-cp310-abi3-macosx_10_12_x86_64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.6.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.6.5.tar.gz.

File metadata

  • Download URL: sea_g2p-0.6.5.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.5.tar.gz
Algorithm Hash digest
SHA256 a184b47495191adccf55b8ae200d3b7ce35a4e64ced6fe040739eab1c127073b
MD5 5ae791129298746dd77af31457f702b2
BLAKE2b-256 3c04084e266c98bf451ff4af4db39b7860779916066391b20b7d34aa3b26989e

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0a313b7cd36dba3753e7163fe97bb94695e2816398fe024f47c5ec18046ad726
MD5 64b7c38e2a8da8ac6d31248c64345808
BLAKE2b-256 7802b454047e5988e540a044afc012a663077140e111934e9edf96177c2c6737

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 41801d7969f07c0241da322c6c98e916095165009dba6db3d526dfdcc44c4cae
MD5 5be0506b2adeacfe338d5cb8817b2f00
BLAKE2b-256 7792db9862aa260b7bf3f4d4ba316e475c7aec92b60d01bbb0d66a9613667132

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.5-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.6.5-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.5-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 8950c2c618f4b578b0b0926c467ab08efe4970d0ef6fbd0b624129369cc33748
MD5 6447c3809bdaea643aa6bfa6c8da9301
BLAKE2b-256 5ae9ca49a354c41230ac209655b6bb73caad081034f0dadfbd3be97841658146

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.5-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.6.5-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.0 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.5-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 a4d325d865d653b45d563e73c255044f42e3ec8740752edeb4ca790d2d16958d
MD5 a4f82e31b7c89cce2bf0579356cea04a
BLAKE2b-256 390dc8372f949c989f7c9386698187485cb3d7b39ac237b9c65e73c001dfad18

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.5-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.5-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 134af7f2eb7f1c5f0e72f2ec73cf2f74f5182a4926a2b7c72df85181011414ad
MD5 3a53909c9130227234b865178a24dad9
BLAKE2b-256 4622dd137357de8f8794a2b0046575023c5db99c7ac91ef40e753dcecc8d1b7f

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.5-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.5-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ca962f7d546a4dd62b26cb9f41c074d670bc5d35a6a7621d617ec15299c14b46
MD5 c1defb44079aebc26a2582d839bb23d7
BLAKE2b-256 cf9ec04441370a3b9d9e6b1aaac2f761860ad5e652feff2452ec9503cfcfb51b

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.5-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.5-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 601ad9520769e0b9783307e539e8c1d8af817b1186f83f7864eafe25cd58edd3
MD5 8a900e47190b7fa0ecabfd2b0a504c83
BLAKE2b-256 9ceec6389db82654fb36c2c6c41d9ce0e35ceae11ccfdf479af243fdb4cff86f

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b5450f0f670151d6b4a44ea162b587d462227b3110f3f7344e653f895587c649
MD5 7addd8fa9e5cba7fdebd7a9f49a5b037
BLAKE2b-256 c465e88830bd54065046edf695b508e78a52db2d3fe5f67ebd5d3e6780d044b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page