Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.6.12.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.6.12-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.12-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.12-cp310-abi3-win_amd64.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.6.12-cp310-abi3-win32.whl (21.0 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.6.12-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.6.12-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.6.12-cp310-abi3-macosx_10_12_x86_64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.6.12-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.6.12.tar.gz.

File metadata

  • Download URL: sea_g2p-0.6.12.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.12.tar.gz
Algorithm Hash digest
SHA256 144937976084f4834af74bda9ead066b3cf2a71227476f908354cb533c028ea0
MD5 1b8ec5e0b967b5b41657f72bff35116a
BLAKE2b-256 7ee3f39f4a663e34b5a736712caf68e6afc289c110985627b522276bb480f1c6

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.12-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.12-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d28789638064c277219e2827d9b5043ed026d002c4468de0ebda783d03d79c79
MD5 4cdf542fcebd7343fe2e0fc7aeb245ac
BLAKE2b-256 03aefa02a600aa8cd2397689df7a4d59be66113c446f5fbfcf722cfdf5b785eb

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.12-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.12-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 747003decca6fa4758bb4d78e5fe489057a77ac342ab5e1d6cfab65e03eeb054
MD5 1987429503190964be7a2db39cfa3085
BLAKE2b-256 189b0209d3f395686c41ebb9e40fc481c927226fe0d3d32d176c57daf2ab5f9e

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.12-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.6.12-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.12-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 955a8789f53fbe3faab08cc7da186622fabc5b72905c4697d8a87eb7991de3a9
MD5 0e9950707dda2a1ff05daa086cbaae0d
BLAKE2b-256 21a60d86442b9e1a2934a0439bfd8eb1bdbee0ddd888984fae0757d1dfe4ba44

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.12-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.6.12-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.0 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.12-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 df967d67996636830a21af5af7748bb7805fc3c5788f2cfa013eee84d57255e3
MD5 5c2a2720fcaa524312686c2b7b33e0e8
BLAKE2b-256 fd8fa95f688ce4412d8b2862984a16d3dea60ecc360b8f990683a22b1384adbb

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.12-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.12-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9b380db2eb089fa7c4a1b47bb1e6afea401434a643132eb65362c689bd6695ea
MD5 0e0b810e5617c08bccab426b6329e1bf
BLAKE2b-256 6e4cba1aa894f2ba953280d8fb4887ad72f5441695bc64195bdec035ecf4d4c8

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.12-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.12-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fda6eabfa541d2104fa5174fbf95295e4bd0e4adfd332c87438911d72642ac07
MD5 6216879560bb974949752aba6c87dbf2
BLAKE2b-256 110eb17d140d23cda24f68ad3e90ae33a0c03bba97b10f7c13c5449550f7010b

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.12-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.12-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c356edd471f91d4f43b226eac8edc43e65e6b3be72046150494b68f5950a7034
MD5 237e129d9501df288980888bfad827a2
BLAKE2b-256 43a8f725e9e5e7e6431db2fbcbb3fc9127db95ca2bbce0cf3a6ea9b2042594f4

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.12-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.12-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 112567f30ffbb9609d3b04fb89c68491a745ba2ed3e01f7c0b66b2c898013179
MD5 fc8ff5932974647d90824bb15140ffa0
BLAKE2b-256 37e9941550deb04ee1b7d7b5e26a9b0c38767881c1d96fd86626427431393b9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page