Skip to main content

Fast multilingual text-to-phoneme converter for South East Asian languages.

Project description

🦭 SEA-G2P

image

Fast multilingual text-to-phoneme converter for South East Asian languages.

Author: Pham Nguyen Ngoc Bao

🚀 Used By

SEA-G2P is the core phonemization engine powering:

  • VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.

By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.

Installation

pip install sea-g2p

Usage

Simple Pipeline

from sea_g2p import SEAPipeline

pipeline = SEAPipeline(lang="vi")

# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.

# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)

Individual Modules

from sea_g2p import Normalizer, G2P

normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")

# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']

Features

  • Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
  • Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
  • Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
  • Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
  • Bilingual Support: Handles mixed Vietnamese/English text seamlessly.

📊 Performance

The following benchmarks were conducted on a dataset of 1,000,000 sentences:

Module Implementation Throughput
Normalizer Rust Core (Parallel) ~41,000 sentences/s
G2P Rust Core (Parallel) ~415,000 sentences/s

Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)

Technical Architecture

SEA-G2P is designed for maximum performance in production environments:

  • Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead.
  • String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
  • Binary Search: Words are pre-sorted during the build process, allowing O(log n) lookup speeds directly on the memory-mapped data.

For full details on the specification, see src/g2p/mod.rs.

Development

To install for development purposes:

  1. Clone the repository:

    git clone https://github.com/pnnbao97/sea-g2p
    cd sea-g2p
    
  2. Install in editable mode:

    pip install -e .
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sea_g2p-0.6.7.tar.gz (20.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sea_g2p-0.6.7-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.7-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

sea_g2p-0.6.7-cp310-abi3-win_amd64.whl (21.1 MB view details)

Uploaded CPython 3.10+Windows x86-64

sea_g2p-0.6.7-cp310-abi3-win32.whl (21.0 MB view details)

Uploaded CPython 3.10+Windows x86

sea_g2p-0.6.7-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (21.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sea_g2p-0.6.7-cp310-abi3-macosx_11_0_arm64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sea_g2p-0.6.7-cp310-abi3-macosx_10_12_x86_64.whl (21.3 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

sea_g2p-0.6.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file sea_g2p-0.6.7.tar.gz.

File metadata

  • Download URL: sea_g2p-0.6.7.tar.gz
  • Upload date:
  • Size: 20.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.7.tar.gz
Algorithm Hash digest
SHA256 13ddab3eac60259dd20f200a6da3dd4bf6929ccd2ddb1157511fd974f79c8547
MD5 9fbd786d53ee52086e2bc09098f1201f
BLAKE2b-256 7a7c94120958d4b33a0669172fa275293b750ecaa24279b8a4ebc3dd1b2ff821

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.7-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.7-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1690d968562b08e92ac400b28ce69ca920e3e01da77c35ffd9ca55b8014049d8
MD5 056953d01cf125f91973738ac6f22c3a
BLAKE2b-256 a8dca4f6e2e0d2de20e469733b34bebd9bb2b4268aa2523a3614c7c982d51d3e

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.7-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.7-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 56ec90614a3e099909bc882f3fdf141ec3a8131875304fb4721d5eceba924368
MD5 ec2d0497c8a3b6dc223a16f3d1267c6a
BLAKE2b-256 1c5c5b20dd34119465e0a4c02223de678278621ee7523bf6e93e57a0756e4b40

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.7-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: sea_g2p-0.6.7-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 21.1 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.7-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 0a08d61707c3e6d526309e4bd98f7992df6dea8c11fcb90ceb9a7950759f12e5
MD5 1d209b5eac0fe3c59b668c49a1048eef
BLAKE2b-256 0052781ce18cb35043d2e2bad01a5402f01a9c54f6f3d5fd83465e3960e99442

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.7-cp310-abi3-win32.whl.

File metadata

  • Download URL: sea_g2p-0.6.7-cp310-abi3-win32.whl
  • Upload date:
  • Size: 21.0 MB
  • Tags: CPython 3.10+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for sea_g2p-0.6.7-cp310-abi3-win32.whl
Algorithm Hash digest
SHA256 5de1d658b485036a7946bbd153eab82534f84905e6fe5e079ae3c167a1d4bb68
MD5 b75f5e8c345b12a950ae1f641b3161ec
BLAKE2b-256 956462e53951bae6ea3d8b0cda4132442cf076587a5d2f37023185608c94eb43

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.7-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.7-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 9714cfc975047ecdf915c73f023e0f94b42b9e6160d41798cb15af03fb93d3ae
MD5 e918a3dc984bf81ad3696e76e0c32970
BLAKE2b-256 be27f58a601a0ec4fdd3830c6af270afdad4268843618fb785e376a861729679

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.7-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.7-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 586d67e02d901541bd45a48ab2727c27fe8d5645156fdc692d487b73e193f6e8
MD5 b90ee4ca4abb111b0aed447f365f45e2
BLAKE2b-256 9370f82fcff86a4bbab64e17ed110fcb5cfd4009c44b43eb9e3c63f500ecd49c

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.7-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.7-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 fddec1e542bb244895cfaf8d284a5e2956332312b98d6d10726f6d6e4325bc89
MD5 a34cd551159fda31bea71698c64343c6
BLAKE2b-256 7b389a87abed56522c465c51830b0b22f9d39275ccfffd8c499e544b9e18e245

See more details on using hashes here.

File details

Details for the file sea_g2p-0.6.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sea_g2p-0.6.7-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5699c170061ad9f67a749e998a26fc577cc78655211316738527b6d2fb78d4de
MD5 ec58688b467cae7ef472f00626bb08f0
BLAKE2b-256 753a0cde702db4f4e35f150c0bb4d33da5f0a34ba25f1ad1cc7230b372168483

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page