Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.6.10.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.6.10-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.10-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.10-cp310-abi3-win_amd64.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.6.10-cp310-abi3-win32.whl (21.0 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.6.10-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.6.10-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.6.10-cp310-abi3-macosx_10_12_x86_64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.6.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.6.10.tar.gz.

File metadata

  • Download URL: sea_g2p-0.6.10.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.10.tar.gz
Algorithm Hash digest
SHA256 fd70729562f31abf669322df4c0922de4c8018e61b35400296e1f0341e9ad92a
MD5 c38e8d92c8991668b469f9b478cf9942
BLAKE2b-256 c0204e9ce1bad3893b063ffa71f8d0daff493ded80025735f2b303e727a6714f

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.10-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.10-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e96aa08c61a01a138bea4130433b37c672eeb6cd5062f402d6bcd1abc678e710
MD5 f179ba975f034b35fbc577dedebd7089
BLAKE2b-256 0666763b1c36c6dad23a6a13bf6ee3d03de2866d9400712bf821c1e9b705532d

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.10-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.10-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 3d8c669c1c3820baed3239e13c6921a78c250011c4fc4f37b092569d8663d5ce
MD5 6aa9d6f2a99432d39b9f3fe8f75abcb4
BLAKE2b-256 ff9508a898b78cf4e7b7ec1bfe71fffff7068d74ddf51ce63d95da6674eab51a

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.10-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.6.10-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.10-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 65afb4829e31a407ac14deee9bab96f151df9f5c5ce984a39f4ad10dd25c5db3
MD5 13dd86fe120aa3044c7b44e1647ac883
BLAKE2b-256 32bcee4a230a969e96d3234ac320b63835e0649b846db91f7eb15c74fbf88709

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.10-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.6.10-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.0 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.10-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 d443f62e44fe81c7c52c885ecb993046a485feff89f35a1b1ccae098f3ab579c
MD5 e53a9d572e568338f3b1364463143b4b
BLAKE2b-256 b27e0faf1ad116c9117772769f934314ac4861f06f38b17376d0e468cb42fc12

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.10-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.10-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f660e89e007ad526a3892458063efaea55302aa42ddfbc00c273c9e064bf7d46
MD5 593caaf0d53fcce7e72e0217931f5f2a
BLAKE2b-256 615d129ed1591aaf4cc46d71613c014bb3702be8720eff3d30e45e9cb42d1dd7

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.10-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.10-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1740949a4643f7deaa99144338db92ee503a40a26d7820c5f47356ab1deedc38
MD5 0effca16c12f6d13290bef13af12f5e1
BLAKE2b-256 10feeec65dd61e84433637589ec8a8f4b43ac7cd015aaa32e28338e5daed93bf

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.10-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.10-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 80df6c1a6677476eb1f31f407a0038680b7683b1e70e35ba5d312c8fa7a7fffe
MD5 4911c11a964dc12bf8a8e906e719e1cc
BLAKE2b-256 8103e9f3b80494e5756bef6ea42f9418bdd9d3ce2b6de366ef8ef03b71cadc4f

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee59df86f9c3dd3f4c5e61feb66762c3d5f338c19c38feebe5941970fb2f248d
MD5 5630a7809ece26fe9273b0874a145b2a
BLAKE2b-256 eaee9e979941c5645381650fc27a72e9b4ff936f96a6c2bd504520523b8366c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page