Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.6.8.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.6.8-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.8-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.8-cp310-abi3-win_amd64.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.6.8-cp310-abi3-win32.whl (21.0 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.6.8-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.6.8-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.6.8-cp310-abi3-macosx_10_12_x86_64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.6.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.6.8.tar.gz.

File metadata

  • Download URL: sea_g2p-0.6.8.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.8.tar.gz
Algorithm Hash digest
SHA256 c46dada613fb968484aca58abac1b387853093053ad917eef9c9f635625feac0
MD5 60020dd9b90336559086a0c2f67f5f86
BLAKE2b-256 dc223efd496b13d0482e016263aa2d944f2a842106190284780c90258d2c435b

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.8-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.8-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9b88967c6343e2c3d1abb241ed8a9a511d859273800a99fd0602c03c68046f46
MD5 5f69688892e450dee111bc8a0c38061a
BLAKE2b-256 efaf6f0fd9bdb0a959eeba7bc95ffea1ae781582f0db9050c652ca781a96d16e

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.8-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.8-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b3eaf0b8f88649b947a6ffe39e32cfac4abd5fb1677cfa5ea86fe1c677a2da91
MD5 aaeb141b896e1a0aa5af6b8318dcea06
BLAKE2b-256 c5e10f36473b2a3109e66fdebc64f9e9d8d3eb89d280466cd067ca825a068be9

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.8-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.6.8-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.8-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 ac5cec9f06b1a0ec645c44726a79dcb0043045de5efe98ea7c35783941c8d559
MD5 fa4cdcfa28158334309d21f785cc0b5a
BLAKE2b-256 26160e02e07dfa893475e92c54e5fbf6ac84f51f5519044cf9f12921322f94e9

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.8-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.6.8-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.0 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.8-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 dc50c01ca8496902437afd1e9cd3b8200c4e7819ca0e99d759f99a7e76f97521
MD5 61abf876f6b0636ec1f0133379c7aaaf
BLAKE2b-256 a297d5af6941be71c0df2f167e5cc92766f177431864b0b9a0c801af30def1a1

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.8-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.8-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 42897d304f9da4415a6b6543e79255eaa2e2d79e799efdb91d02457461e85804
MD5 1d9012a0eb1be30db33d3b6ffd372c7b
BLAKE2b-256 6287718bb6927ff6ea479fea390196e0ec13c3d7b6a8e63b83b920c63bba26af

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.8-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.8-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 b53cae4fa4308c319b1a2e857cf115a37b0797c14586ac8a1bef0db4a954d4f5
MD5 9d21caedbd32adc5825f51e1e22c1c7d
BLAKE2b-256 997fb4cc4138095ee324572e8112ccbf48ea095e31e84994dbbbe5d130fb026f

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.8-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.8-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 04d21132be6910598612e009c537a6ba0f27d3e9a18b00350cbfb2e10a8c1441
MD5 677ffce569b314d17f02488dd26cd2e3
BLAKE2b-256 4b23e590a57e461ee23a4e1d8201f43284f278164efc27219f98153ddba1c514

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6fc07f34c148670abb6718df24995b4ad0496fc7900c8c8e31eb32010fda44e7
MD5 b5690d6a1a01ade7776db90ffc0ed237
BLAKE2b-256 d083201ff4018b37f3a9b6e88a49bc842eb1b096c5bdf3fd5fa7cfbabd6281be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page