Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.11.tar.gz (20.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.11-cp310-abi3-win_amd64.whl (21.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.11-cp310-abi3-win32.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.11-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

sea_g2p-0.7.11-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.11-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.11-cp310-abi3-macosx_10_12_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file sea_g2p-0.7.11.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.11.tar.gz
  • Upload date:
  • Size: 20.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.11.tar.gz
Algorithm Hash digest
SHA256 f758b92e2e17f1c5cfb5f6361377b911e03f7303a63618a9f8c6e52b37e87ca2
MD5 284e6ec6e73f141c0be439e98ae3e662
BLAKE2b-256 269f998d5ddeacaf546443295f4692b701c8e8139dfada6bc0c5471186f37ffb

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.11-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.11-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.11-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 7ef07ba1940a3bd445376b3c9f4ca9f942c7c2677dde1550946e1bbc4139a5cd
MD5 645bd4e348dc681228ff3801fe6c7b61
BLAKE2b-256 0e669df268b041ca22795e43f315be92b3b4f6080729312ecdccc955c4c1eb2a

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.11-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.11-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.11-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 3d2bd465622c396748634cf67798b717ba5cc5e5cd0164a169811b81251aea2a
MD5 98c99d71567a83a04e429347f6a4a94a
BLAKE2b-256 e331c21637dd8e6a09ccad72790dfb66e4a2b0edaf27457496c893d87046595e

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.11-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.11-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 35f38e57e13eac970b03953103baf881c38484e3895b2f15b7ee3f315fe86178
MD5 b9ce7e6fac055b69ab190d125dd0d1ab
BLAKE2b-256 7b0bc47374a2d48a2a9bb9e14d609a8b44a7065b79d67b3800d629ae041efb31

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.11-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.11-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9437c6fcd67acd4220cb55b3a97841dc3a4e8822b66d23b703515f7f66696342
MD5 db86725200901b28b6b4fad68087d948
BLAKE2b-256 cc68eb5d3b72d9dc4a59c49bb98485fbee9b38207ef0b5fe3792468c67e9aab4

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.11-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.11-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 da57456ba7630cf93b74dc395380fb0300b34c475bb1da5427e0e6156649b4f4
MD5 f74dc157239a676b2b0e99997fd38c1d
BLAKE2b-256 5e6f6f0ccbf12c5039f51ee91710925078290504bbf6548cef5fc2ed094dac1b

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.11-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.11-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a6dfd611af7f8f6bb4412f4178ee7b50570786beae9b9d1be4ee4c12b45038af
MD5 59156570eeffdcd1b3f4e4012a0ad400
BLAKE2b-256 b6cad8f04d24a3b1940b6d3eada3ec3c7af99f8e94443a5b635712579b1a3478

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page