Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.3.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.3-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.7.3-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.7.3-cp310-abi3-win_amd64.whl (21.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.3-cp310-abi3-win32.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.3-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.3-cp310-abi3-macosx_10_12_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.7.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.7.3.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.3.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.3.tar.gz
Algorithm Hash digest
SHA256 47d0d4f7a13854a8b364ceefba776ee142541c257f3f887487d83fe2cdca4d50
MD5 6545a1707dab7b935bdcf05041eaf936
BLAKE2b-256 49fc6d236a94acea040a8646480e558be3e99a4d8de874d2884fe72b1623ae9a

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.3-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.3-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 dbb7f3d162e11591e81bc1f7d2bcb8e70c4ce030dcc0b1c0fd8b64ba511eb768
MD5 918b759a30acf7340003ef98d412ec32
BLAKE2b-256 8c5658edbd20fa3ec437d82751600ca86af01ad93d739e9266487d53f698e6fc

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.3-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.3-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a2206460fe8eecc7ea124f554f21c31c429772392a55151519beddb12a25d5c1
MD5 4742ec859c31e4e0da3350e0989d98b8
BLAKE2b-256 2d982996b18a5094ec7f2efcf3a126985e6f6dabe9cba041200fb005247c544c

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.3-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.3-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.3-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 28b4df2d72016a36ebca81ba7c87561404abdae931402650837f45a428c88d61
MD5 276157a844efca9bdace92825ffd4cdf
BLAKE2b-256 4d98156c64168de1e939ccf3406a796fd2de31d63bd8b13cfdc44d3d41424af8

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.3-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.3-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.3-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 26f31d5ac1820bd75b0e578226e07d2e1de4d75b091c4d2d836c6b04e7c9a72a
MD5 d1b18c6895a652235d4f68019ed21a21
BLAKE2b-256 27a9b36f458432e3a24b02a21a056c2f62611302b425f97c52d96bc3cf6eadb8

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4cec2dc8b697472ec30e8163d76ec57a62efc81222aa1450f8e860188d438fb5
MD5 8203c5c066b62e01a00295e231cd7e68
BLAKE2b-256 79ce74d736db75cdb1711dad4b60e67aa2a074f0645b1eca29860717af64ff3e

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 265d304fa3f91c4973edada0131e57a868f34ef8304894a3123d199cd71a64ad
MD5 e11f13a32f3d6cd979e50916cde5fe56
BLAKE2b-256 ce9ee9711d83000a9b83c655e43d10ba224c07f9d7c940c7f839065e8909a568

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.3-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.3-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 bac4d42775bded3227deeabaa0f0afeb890558c6c69fab634d661387807c96e4
MD5 cc2e0412adca4a0e7307ad90c4adff14
BLAKE2b-256 525f3e255c550b21dfa79f18b12a77d3d956290874d6abfe9ca37d027dc66291

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 25046fd092e2f444fa4ea578ff9b95afd608619b7fd9bb55c4318801ebb86cbb
MD5 3d70dd4ebb226b4415a1fee7e83bcc86
BLAKE2b-256 dbf27b4f9efbdc4e31925f50d100a4c532e6e63b86f06fde3448b843b65eb678

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page