Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.5.tar.gz (20.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.7.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.7.5-cp310-abi3-win_amd64.whl (21.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.5-cp310-abi3-win32.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.5-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.5-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.5-cp310-abi3-macosx_10_12_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.7.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.7.5.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.5.tar.gz
  • Upload date:
  • Size: 20.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.5.tar.gz
Algorithm Hash digest
SHA256 15a0c52570858b4921c37d701a0953222f086c1d60da74e9fb92ba7e246ea8be
MD5 c06951a1682ea4bc6ace1f462dab9436
BLAKE2b-256 ca276cb64dd049e8ae5e95b414698501332facbccfc800a2b70ab2ba320aeba7

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 43be4c4c5479811ea48b51a81c27da6205207e92c520c5cf4fcb588cbd4c9e0e
MD5 b44b01197d6685e1ca2706d27479dbc9
BLAKE2b-256 d1a2d84fd9d746c096e8dc938fc0549bb32b2ecb5526d7d49d38029311be6f8c

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.5-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 44f0f957bc85ade92db972556afb0a8e50b959f918cb839a1743c605bbdbb0fa
MD5 bb54e5e528b6cde02a7b520ce38775ce
BLAKE2b-256 4f85e7b1316c5a4a96c6bded96986dc555075e4a36e4154971b37cd1fe8c9a0f

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.5-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.5-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.5-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 032d949afa49eb4c447be2dab8fe67b1c72f87bfdc584698f21af85acb8fa019
MD5 b9696f3fb86a332f3aaf0ec4f1d6e3db
BLAKE2b-256 b83ad427e606781f3a1d526d72ac0d2edf4cb8acd7c9567566e3565cddc895ba

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.5-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.5-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.5-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 3716f078e199b8b67b8c5cc535ab77cbed025848b7b7a4bf16b22cb5d478fdb7
MD5 ffc02e5815a21ebba4b3a94f2dc02f11
BLAKE2b-256 f99017ede4f74794f74109b1d0c9463eed7846c2196cc8703abe7a61f2c25c95

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.5-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.5-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 ae6dc27326e6fd978a4c18614d5a57fd85a65b06941042365f6259ea00cd102b
MD5 cc664748e6e24b3f056d7af0c5b8b1c3
BLAKE2b-256 749b6c2a9aa8b0f560a0179e11763d02ad0306b593cd58fbf14406c2624e7745

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.5-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.5-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 99c258534666dd0dbe75229e7fb1e52cbcf0b0a0aa267b4b1f19c35e493096ca
MD5 e1077be9556d2cfcb9e4b57d718752c1
BLAKE2b-256 628503dd243aecc708790f2019b04bc16ddea4942e98fec638283156d940a8c3

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.5-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.5-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 dde8a4a5e391860e830e9ca07b74d76b8b6ba287672c091fab0b1758a06e1284
MD5 830eb19c21cd6920515109b55395dfbd
BLAKE2b-256 45d849950a41df6b4ddff2e728ff4e5f3766131a96fc24ac5c51b53c434ca46c

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9bf6ac58fea01cf5d9f65a25796473e1ac49fc13566e0d4bfc54d72ac6300887
MD5 ba90915802ef6df58fa95bfd5c03c65f
BLAKE2b-256 c1c491bc5dc87d8fae1b86df0f4b475a3d1fd46673ef5debff2a1d0b99b40a47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page