Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.2.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.7.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.7.2-cp310-abi3-win_amd64.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.2-cp310-abi3-win32.whl (21.0 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.2-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.2-cp310-abi3-macosx_10_12_x86_64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.7.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.7.2.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.2.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.2.tar.gz
Algorithm Hash digest
SHA256 2efa2227512548e71d7c15be5c40341cceffbe6fb7e9d04de3bd43e4e486c5a1
MD5 9a8e9d660835e8be19083f7f2ef83ab8
BLAKE2b-256 423ec29ce173e76905a8fc21a48a7f6890074f3eeabd669a040a6c53bf6507ca

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c923836087f5498eec6763f7ca3cee46521e2688029451e1e4e615a5bc7bbd34
MD5 87ef1462590157424eefc041aebc434e
BLAKE2b-256 062e69950bdf3139557c8d96b2cd720cb00830c7eb227796567140843aece48c

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 cc37d42348e2414fddab3be9e1e472e150766381e307d511ab88d35f677ffc62
MD5 a1fce8959c2945e71fc8ed0bc12a92be
BLAKE2b-256 fb43d5aadc3ebf140e8fdf5611555ccf4ac89b82b89e7324185ee984806c9f57

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.2-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.2-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6a19f7e7c7c1b430f99697b56d2c9f06e1d32d3194ea43d4077f9714b199b4b7
MD5 23b4028e3df965d93dfd2d44c4dbfb69
BLAKE2b-256 a1ce4cc2b30f4d75577a0d6abe91e76bfb700fcb6323f8d2d680e6163e279add

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.2-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.2-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.0 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.2-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 7ba3b010697cef8123356895198ae3c8e51aeb4077d1caf8f6bde9313c07b6c7
MD5 dc47d3ccd02a5b36965545ac5bc1dbf9
BLAKE2b-256 caf0245a4f62ba76ae841ada2ded68c679c26291d22649394f290d22b6b80ef4

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 eba553c42ee6a03661ac6c02dfb4debff583c3f0fc3227ff0f1edf508a92b81d
MD5 51e4d476b16e8a0013ae770aae0451f9
BLAKE2b-256 bd01293b12c2b6bee2dc1644e312e949559ec0b99216f44f01b9bd1f318a3c75

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 952308b8bd472ecec01942c8eaae3b737f47dbfa01becbd20d774f5ab473cb63
MD5 9ff09498722f1dfa2aadc4acdddbb841
BLAKE2b-256 c9a7840e7798a0203988e7b7a03601df4507a7dec4228f3dab1793fd9c0e7227

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.2-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.2-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c340ca7835f14bdbed2a5b69e330198b5e81bba3764205134902024534f82c7c
MD5 08cd3feedaf38c6006f3c6611306fce1
BLAKE2b-256 f3b56989354e0d5f0f73042c15f73253aa91674ef7deef17fbefd2dbc7e8bcbb

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 510d11900fd011a7fd7e7b46a2618355b435777c7cdba7929806f6ba7326fa5f
MD5 599716f8d367083a18da8ef79fd0b297
BLAKE2b-256 f62a9d403aa8d8e7f068f324fa9fc80d8d2d17044b1340c0f3c311b45dacc78b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page