Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.7.4.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.7.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.7.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.7.4-cp310-abi3-win_amd64.whl (21.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.7.4-cp310-abi3-win32.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.7.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.7.4-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.7.4-cp310-abi3-macosx_10_12_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.7.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.7.4.tar.gz.

File metadata

  • Download URL: sea_g2p-0.7.4.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.4.tar.gz
Algorithm Hash digest
SHA256 1fe3c7c1c81007d2adc66be1e432dc692ee2ffd7d5773293006ec60f61936e60
MD5 cbb263b523ccdab315e1acebf0ccba53
BLAKE2b-256 965d4e7e47e3d9b7772e0b54464179c91704f772f11dd6d1b01691ce41a80533

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0e2b091a141425f784b8e12c86a737544c310054159b4cfa623303f0f0586014
MD5 4cb00e14cafaf3e3e7eb9287bdbd7e98
BLAKE2b-256 bc5bab9771ed07aad82f6ce8caccff79c20c469a2e3b5bf406032be3847aeb24

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.4-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 12cfa918f74135360a4cf1da07542b5e771beb1718711bb5777531e69eb7c94e
MD5 f93012dcdb6f3d6f8ce079c8ea5a739c
BLAKE2b-256 927fa0a20a882ad055101bd20e75e16e4c78ea5240007766823a505821e5c782

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.4-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.7.4-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.4-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 00374060f5beb63cf9ac5698c792a6233ad70773e947c212c16df7f5aab5a8f1
MD5 7a5cbafc2c78278b8d62ac683ce84c4e
BLAKE2b-256 91c00c6c1c0a07a38a948afb53d6ccd4b3589261b4b3b0922f8950fc665d9d3b

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.4-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.7.4-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.7.4-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 a14e32e6edd83c304385db66b70ad8dc8aac6fd1fb333bfc10d1e60569518d4e
MD5 bf32b7bd8bf708ccc6d45e5688785897
BLAKE2b-256 eb59130499906291268568e24a82e57dae2cf9cd0e5e49ce35709e6c98a15ed9

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.4-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0ed7ed855c9bc7d02439fbeb7fc810bf48c9979fd5702daa9c28cd227e296305
MD5 5d6b2f0df88d9442c474828aa2835a35
BLAKE2b-256 dd6bbee5db640b615c01ac56c8b6da98de186e0ffd01eb2c8a41cad88c4ef2b6

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.4-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.4-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 942cb66273cb4c256fca04c39bf869a50ffa9d67151b76788db200ecfb0d28ca
MD5 e9148dd797848d1b0b3f3005db960c1b
BLAKE2b-256 a449f44fc7cf676526922a19821d43c8ef5ad838c18811ef6e5fba5d3f2c0460

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.4-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.4-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 0b0ea8088969d9625f10e7a521bd006e692f6b999cecd33c863dce86e03c7de3
MD5 d36e1200cc970c5ba10c7b9dba801481
BLAKE2b-256 cb22c1dbb061db73292ea1704f859b3c92ba4a038512c2ea428a9ba4eb5c74de

See more details on using hashes here.

File details

Details for the file sea_g2p-0.7.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.7.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 710b73e8bf07bb4255631be2f904c4aeeb1cb4c60207edce1857a32047b93717
MD5 8936005ad4718a5995d65a4491b5e18d
BLAKE2b-256 ba329eaf9800aafe27ed038afb9542144cfb1278e5ec7632549765d532643bd5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page