Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.9.tar.gz (20.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.9-cp310-abi3-win_amd64.whl (21.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.9-cp310-abi3-win32.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.9-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

sea_g2p-0.7.9-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.9-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.9-cp310-abi3-macosx_10_12_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file sea_g2p-0.7.9.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.9.tar.gz
  • Upload date:
  • Size: 20.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.9.tar.gz
Algorithm Hash digest
SHA256 e74d492ae06cd9da84afd02f29fa648a98b928471488a83d68e93485758a71ea
MD5 0ea5fa3835c0ce9f8d7d2a8063a1c52c
BLAKE2b-256 dabc1d47bf05e228291854268a36ff820c48ca806e146467650ce7df08d4a9a3

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.9-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.9-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.9-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 8b13d0811a5707ae4dc3c53288f47d9d6b66f1b79dae7518aea6843d2c6731f6
MD5 d8910f71a1e8a376927b53343179d056
BLAKE2b-256 4eb52e4c13e749cf440c37610522edcde86ed30c3982fdb1406d5243ec7b8717

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.9-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.9-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sea_g2p-0.7.9-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 20de618127a275c1e0be5b0c1c4a29729126aa89bee3be8f62b4d14ee8533a37
MD5 448b5a7a7de6f0d3efdd8e2796c1532a
BLAKE2b-256 19e82c75aaed31af08395cc3819b8bfe807c3d78da91ce2280a8b83e0be58400

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.9-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.9-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f0851b3cd38a50c610a27fd14d95a617322c0be794b49e46c1263b5a8af56131
MD5 9038734c09b1e2d0d3a6e7955ed0471a
BLAKE2b-256 8926c02c22c47b0e57677c8995403855ca94cd98add4ad0d0560d92119b4da42

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.9-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.9-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 82fe4c0e68458239222eac86cde349b812dc3fe9ddf9dfaa43eab56358787216
MD5 1b2008600b6fcd47003b7257fb01ecf8
BLAKE2b-256 d227e34be3218a4c1b043b4a862c8bd98763d6eba50236115797bfda964fcce2

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.9-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.9-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 62e9b539b17c13cf8b85e21b3b4c6ae78dee7b4c312aba3188d89237fb6f1a18
MD5 cdab084246bdc7dc403c6c405a74b484
BLAKE2b-256 77c963fafe7f0ab4b8cc1d21c54cc5f68564edd637dd0c692a4dd37030c773de

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.9-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.9-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f46b8155f1edd05d1d0b69d8797ea4123027bbb4394ae92f75b894e55fe2f054
MD5 b8fed4e3901faa3c08e283b98314221d
BLAKE2b-256 2a9cefb7f2afc89932780658c2c8d6ea32718404d8a1b1e4484133f3e10df630

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page