Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.
  • Markup tags: Wrap a span to control reading:
    • <en>...</en> — keep the content for the English phonemizer (e.g. <en>hello</en>).
    • <math>...</math> — read as a math formula: variable clusters are spelled letter-by-letter and operators/symbols are voiced, while function names (sin, cos, log, lim, ...) are preserved. <math>b² - 4ac</math>"bê bình phương trừ bốn a xê", <math>∫f dx</math>"tích phân ép đê ích".

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.15.tar.gz (20.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.15-cp310-abi3-win_amd64.whl (21.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.15-cp310-abi3-win32.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.15-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

sea_g2p-0.7.15-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.15-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.15-cp310-abi3-macosx_10_12_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file sea_g2p-0.7.15.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.15.tar.gz
  • Upload date:
  • Size: 20.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.15.tar.gz
Algorithm Hash digest
SHA256 d8d8069255740a1837c6c7976b1ac1fa00275a05eeeace854639ad9d2626dd9e
MD5 92f5c0ef8ef69d346cd9de96313e9e86
BLAKE2b-256 dc0097dce848e7db6e74f7a14ab0e7ae318194d431a76daad9082ef5363db17a

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.15-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.15-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.15-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c2b19de3e41e474b968825035b1ad9182905a5d27a77b5ffe85207a1ad84683b
MD5 70a61ecf5d7718df4f30882875ce5361
BLAKE2b-256 7478cdd2318c29a731994ad02e8dbb074b5c232599837c9299b53ee3a6e32c8f

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.15-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.15-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.15-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 36be0addf3e69d10691a9b1cc2a3e90bbd64645ccbbe0a873923dca0466ee019
MD5 987c0f13752f85e215186efdbfabaf4c
BLAKE2b-256 6be21c57657d99137261d24b341e60afa2422c9bf9ee166973fce1f8c95296af

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.15-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.15-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e6d91a85ceb2bc47849d2f282a31c60013fc45fa7c4d38f177a2a19bfd14bc6f
MD5 6549accc6b53c79d31694bc365f845d7
BLAKE2b-256 75f4ecd12897f5a08958b5eae37c249e04fb3f1d191cc5e5fb075f773d320914

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.15-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.15-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4bce5726629c3e88a5f50dd28a6c79baeeff880a2c0f2fa3e8146c08ce7ca893
MD5 8df4862e906a05b90b1513bddf261f89
BLAKE2b-256 dd8413d4f32e6f0633edd5dfa3cf137a4bedeb1711d97a38bc7204c8a9a0325a

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.15-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.15-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 495f377605eb45a4c3cf7c6582ec5de5d41bccac77ccac45f0dbbc0a07fe9444
MD5 49c13a913b4c6e58533b5f83e9abe045
BLAKE2b-256 22cebb2aefdbfb73bf2eb8bc184a9005836b950efae0fb1759a78371977cf0c5

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.15-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.15-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c872ec816ace97e71b8a651ca70c0bb9392b0eeee47d9913cbe80f0037301338
MD5 5bb5fc8cdf4ec0f2e260065498272fdf
BLAKE2b-256 de1258843ecde9969f4d8c010bac8245f3c95b244c6007a85af6601cd1338ee6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page