Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.6.11.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.6.11-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.11-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.11-cp310-abi3-win_amd64.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.6.11-cp310-abi3-win32.whl (21.0 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.6.11-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.6.11-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.6.11-cp310-abi3-macosx_10_12_x86_64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.6.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.6.11.tar.gz.

File metadata

  • Download URL: sea_g2p-0.6.11.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.11.tar.gz
Algorithm Hash digest
SHA256 d6f44fa859498bed0b650413bec73b9c2776ba9ee552a645bd8d65c3aef2f85b
MD5 7290399b436ce70c51644ce6adb37103
BLAKE2b-256 a3bd5f62dc9b199db9475e40fbd2d2ff6edd519c3dc9de884892a56e081ccd4f

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.11-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.11-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5e2faa116769fb05dee79821c8330c0764a8ef2173951d11ac4049709b62661d
MD5 562da272ca0080757dc7a7ef534f87f7
BLAKE2b-256 c836987c591a3d2518033e416e4ae338527df3742dc406263521253a3d6898f5

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.11-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.11-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 04f45c6758b530b75f67ab9351694dbb5d45e40938f7ee6b4ece74b2cd18fa5b
MD5 4aa3eccbb4e20d9808baf9a957ed04a5
BLAKE2b-256 ea28171f433cce27b3040303c5f378df6b48850949a5106f55b6b31ef79e5d73

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.11-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.6.11-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.11-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 92ad8ceabed4745b8c397bb661bbe48b3b83ae5064b19a684b875033351ac362
MD5 5680781eab86bf61d8fd17b56d4498f3
BLAKE2b-256 614a9ad91bc1649967ac846556eac6cf445a2ceaa6d43d1ba3f1b0f85db6a2fd

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.11-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.6.11-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.0 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.11-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 180ffacec48aaa462d05c084a1848ca96c8eaf26d0a7804c74b2a965d904bb56
MD5 d0873f8f1aae7a165da3224b9cbb834a
BLAKE2b-256 5d4a209a7e3dbac69b6d50ed5bd0fb38f87489b9e59c8be24dda731ecd35ec70

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.11-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.11-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 02d8a8097f515eeb27d7d9204a222cc11990546f7fcee92e22944cb554cd3ca2
MD5 0fd21d6601230bc86ba7cb634a25b08d
BLAKE2b-256 3c69e9e3cf89adef9476cfe74b2db694aab9d15a2a00ccd366198f08ff2ba4f8

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.11-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.11-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a334dceb1b86b3567ade16806aff516b61bc0737c1a61c6a44345de3fc758f41
MD5 ca18b61bbca2a348166dc8a3e3b6d601
BLAKE2b-256 3d1d299395eec24fff678e2b97c43ad3b07a6e3d439fd94f4178d4be9b867d77

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.11-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.11-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 cf3b77c05f29d0fb38e19c6491a704dfa97d105dcb0b801713cba3a3570cbd42
MD5 080ef8db0831aa6480d10466077342e8
BLAKE2b-256 add5ccf36f2f7d9c420d0c195aefd6e98a55b19267c22fb47cf5c9c426332fdc

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 57972fed3c9bec3e9beda01adc760bf6c25d7a5d3591335a944216473ff91658
MD5 5a8ad3a6aaa76b4dbe5140ea22502ea8
BLAKE2b-256 fd25db2b6c2d21676fc3da5edaf5de000fcdef2ff92ad85ef1c7117618ec71a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page