Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.6.6.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.6.6-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.6-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.6-cp310-abi3-win_amd64.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.6.6-cp310-abi3-win32.whl (21.0 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.6.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.6.6-cp310-abi3-macosx_11_0_arm64.whl (21.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.6.6-cp310-abi3-macosx_10_12_x86_64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.6.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.6.6.tar.gz.

File metadata

  • Download URL: sea_g2p-0.6.6.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.6.tar.gz
Algorithm Hash digest
SHA256 896880f727c96268b10c6371ed4cb3e86b2e7a8d41798812955aaf79ea8b94f5
MD5 de51fea5d5ef894e4ca3170823694dfb
BLAKE2b-256 412cd210e9030f668f503a391f7880251d97b944cecb2de3e6ddce19d534486a

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.6-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.6-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6aebfb87868fd8d9a13e5ad3b517eee1185eea121f193ddd1a9f9898991e535d
MD5 feec3fb491850345fca5e12a1088d9ac
BLAKE2b-256 4b639c8bb68ab324ef79d8827ff282c5063b76ead663131ed7af51c6950ff548

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.6-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.6-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8d739aaf0c89e27a1ec3a006779a942bbd1bb406fffe76597d9c6468de64735a
MD5 7ccd4c1aa6898961f2f3246eefbe5e4b
BLAKE2b-256 e9845b8b46404bcee3a1146bab31a7406d22cb8398a27efa80e4cd47237e10b3

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.6-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.6.6-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.6-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 2def361cc5f2a8a6a5d60ff95482de86bf96659a23574044f153635079882cd0
MD5 a6fb45957a9148c4b1b406b0c7f0bd93
BLAKE2b-256 1ca329d0796d4167698caa3a8549383a948f763ad83628d1390823c5cbb83033

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.6-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.6.6-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.0 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.6-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 b9b0b664d939f1827b9da739e3b45d293518161b8f60a06960d5428e394a1950
MD5 703f56bfc1591b49446c624a867a1fb4
BLAKE2b-256 f6151c027293c8ed9f044872a8cae16d843723f59225258de5f94bd74ddc53ac

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.6-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 37d2f3ef2b0ebc42612d1a033c8701de6047956a325424bcbe975cefeb00f228
MD5 3fbda9af8068a68c9d33d616cb4a509b
BLAKE2b-256 40bdad11563e2b23915a7cc3a2db58b592f6fbb5c649355670196e17b0777b6a

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.6-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.6-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 745312474dc307599ee3f5339ab73ed55cfd1a1e97363b5948ba7fcc397b5882
MD5 26b7b1ecfdacfb9640b7ac7315022df7
BLAKE2b-256 c399d94e06ce61789caa2a6a8ecc2aa21b9086d302d68f5eb28a92c80b66f549

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.6-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.6-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a4ceb392a2161834c7814dffd71805c109f1cb67b82a1e506cc293d53bf45f06
MD5 18f44d8fa0bcc3fae7e728e0be76e382
BLAKE2b-256 fa4e916c3a544718f0683ef6184e262edd4a0e1112a27c0b8b2f1382c16ac1d6

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 52c05d37062e407a9e112100fc9382ccbd5acace7cce3003d0beb7007cde85cf
MD5 36fa5b748f33aabe8593b2f1d3adf290
BLAKE2b-256 dec6be16fd319721fc0301de414925ba578f14fabcf906debcc04f27e530725d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page