Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.6.9.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.6.9-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.9-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.9-cp310-abi3-win_amd64.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.6.9-cp310-abi3-win32.whl (21.0 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.6.9-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.6.9-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.6.9-cp310-abi3-macosx_10_12_x86_64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.6.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.6.9.tar.gz.

File metadata

  • Download URL: sea_g2p-0.6.9.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.9.tar.gz
Algorithm Hash digest
SHA256 a9fa078a1573074be4953ec3eec327b4b2ea64f79968e1d471387bbd969f66d5
MD5 aae695447aae9dded78500b9be4ef261
BLAKE2b-256 0cd7044cbce0b7320335c92633434839277b81fe62ad9d7dbd7b935a2ebb0fa0

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.9-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.9-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9287b09df7050f60569097f7de7dac9d3cd1c17f159f41cd81de604ef1a4eb09
MD5 9401ecf2a91361748cdaead967a4c085
BLAKE2b-256 e56504bed3ab55d1754e3b902626aeb887c8387b53e584e82a52105b019273a3

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.9-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.9-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 7c01830d800d0c255291db953796ef687eba27b4017f9674ae88658cc8597a07
MD5 a309ad852a83c868cd19ed015445ae05
BLAKE2b-256 f22734d4e4e22c38132c91447dd07097bf831601f81db078e88703432e805c17

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.9-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.6.9-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.9-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c4a06cbd369e0a3a03e90d31c0d322844bcba5419eb69171d166f564a7a2ac6e
MD5 2665a087d14e423818005ce9d9437d64
BLAKE2b-256 bec7733d90dcd2afa8873659a7ce822f875597b4dbd7447d71e3ece4380c662a

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.9-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.6.9-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.0 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.9-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 165f6918371a5df5c5e4fa13b0f6e587e003e255a6f42a013fd5121a689f7230
MD5 bc10f1cfd5c68282d08f7ee68d3dcae0
BLAKE2b-256 fa5bb7868f02e966177e28b7b9db08f901109f7d52e855f6a4ef90af2736ad68

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.9-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.9-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9164cc334ffdc8aa6b3226ef16d7b1d8acf4730471c9824913ba61560228447c
MD5 0d5faf51a812b0667b51da7441e7b00c
BLAKE2b-256 0f36663b4c09756f2e5b44662fa537f337d4b0f33bc9b7ab44c92a9c0a915c09

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.9-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.9-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5a188578131e3a1c433315b279fb1ed3c020a837fced33b2d116daf2768e8df2
MD5 e59466311a40b405391a970f80c2c031
BLAKE2b-256 1c3ef2752fc7287796734cc8450c958f57e2df0d72666a63664e7a7cee68463e

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.9-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.9-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f3580df8e57b3fc0929ff7aedf555b167fcbd6d3a82746b7f4500d405c151874
MD5 9f4dd4650172230253f50a7649524db4
BLAKE2b-256 ed6f10436a5921a1772adfc2b24049253afa43bde1680208b102218c2bf803a5

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4c1860cdd63b9f0c3b58b7f4fafca3b901b053a26de19e34b4e450419b7d3374
MD5 cdb02a8aa5346d689fa70d48003e40b4
BLAKE2b-256 5a74e61c005a9e11947a6671313dd8334e8fc157eeb47968f0d791293d465b46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page