Fast multilingual text-to-phoneme converter for South East Asian languages.
Project description
🦭 SEA-G2P
Fast multilingual text-to-phoneme converter for South East Asian languages.
Author: Pham Nguyen Ngoc Bao
🚀 Used By
SEA-G2P is the core phonemization engine powering:
- VieNeu-TTS: An advanced on-device Vietnamese Text-to-Speech model with instant voice cloning.
By using SEA-G2P, VieNeu-TTS achieves high-fidelity pronunciation and seamless Vietnamese-English code-switching.
Installation
pip install sea-g2p
Usage
Simple Pipeline
from sea_g2p import SEAPipeline
pipeline = SEAPipeline(lang="vi")
# Single text
result = pipeline.run("Giá SP500 hôm nay là 4.200,5 điểm.")
print(result)
#zˈaːɜ ˈɛɜt̪ pˈe nˈam tʃˈam hˈom nˈaj lˌaː2 bˈoɜn ŋˈi2n hˈaːj tʃˈam fˈəɪ4 nˈam ɗˈiɛ4m.
# Batch processing (Parallel)
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."] * 1000
results = pipeline.run(texts)
Individual Modules
from sea_g2p import Normalizer, G2P
normalizer = Normalizer(lang="vi")
g2p = G2P(lang="vi")
# Automatic parallel processing when list is passed
texts = ["Giá cổ phiếu tăng từ $0.000045 lên $1,234.5678 trong 3.5×10^6 giao dịch.", "Hãy gửi email đến support@example.com."]
normalized = normalizer.normalize(texts)
print(normalized)
#['giá cổ phiếu tăng từ không chấm không không không không bốn lăm <en>u s d</en> lên một nghìn hai trăm ba mươi bốn phẩy năm sáu bảy tám <en>u s d</en> trong ba chấm năm nhân mười mũ sáu giao dịch.', 'hãy gửi email đến <en>support</en> a còng <en>example</en> chấm com.']
phonemes = g2p.convert(normalized)
print(phonemes)
#['zˈaːɜ kˈo4 fˈiɛɜw t̪ˈaŋ t̪ˌy2 xˌoŋ tʃˈəɜm xˌoŋ xˌoŋ xˌoŋ xˌoŋ bˈoɜn lˈam jˈuː ˈɛs dˈiː lˈen mˈo6t̪ ŋˈi2n hˈaːj tʃˈam bˈaː mˈyəj bˈoɜn fˈəɪ4 nˈam sˈaɜw bˈa4j t̪ˈaːɜm jˈuː ˈɛs dˈiː tʃˈɔŋ bˈaː tʃˈəɜm nˈam ɲˈən mˈyə2j mˈu5 sˈaɜw zˈaːw zˈi6c.', 'hˈa5j ɣˈy4j ˈiːmeɪl ɗˌeɜn səpˈɔːɹt ˈaː kˈɔ2ŋ ɛɡzˈæmpəl tʃˈəɜm kˈɔm.']
Features
- Blazing Fast: Core engine rewritten in Rust with binary mmap lookup.
- Multithreading: Automatic parallel processing using Rayon/Rust for batch inputs.
- Zero Dependency: Pre-compiled wheels for Windows, Linux, and macOS.
- Smart Normalization: Specialized for Vietnamese (numbers, dates, technical terms).
- Bilingual Support: Handles mixed Vietnamese/English text seamlessly.
📊 Performance
The following benchmarks were conducted on a dataset of 1,000,000 sentences:
| Module | Implementation | Throughput |
|---|---|---|
| Normalizer | Rust Core (Parallel) | ~41,000 sentences/s |
| G2P | Rust Core (Parallel) | ~415,000 sentences/s |
Total Pipeline Throughput: ~37,000 sentences/s (Tested on CPython 3.12, Windows 11, Multithreaded)
Technical Architecture
SEA-G2P is designed for maximum performance in production environments:
- Memory Mapping (mmap): Instead of loading a huge JSON/SQLite into RAM, we use a custom binary format (
.bin) mapped directly into memory. This allows near-instant startup and extremely low memory overhead. - String Pooling: To minimize file size, all unique strings (words and phonemes) are stored once in a global string pool and referenced by 4-byte IDs.
- Binary Search: Words are pre-sorted during the build process, allowing
O(log n)lookup speeds directly on the memory-mapped data.
For full details on the specification, see src/g2p/mod.rs.
Development
To install for development purposes:
-
Clone the repository:
git clone https://github.com/pnnbao97/sea-g2p cd sea-g2p
-
Install in editable mode:
pip install -e .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sea_g2p-0.7.2.tar.gz.
File metadata
- Download URL: sea_g2p-0.7.2.tar.gz
- Upload date:
- Size: 20.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2efa2227512548e71d7c15be5c40341cceffbe6fb7e9d04de3bd43e4e486c5a1
|
|
| MD5 |
9a8e9d660835e8be19083f7f2ef83ab8
|
|
| BLAKE2b-256 |
423ec29ce173e76905a8fc21a48a7f6890074f3eeabd669a040a6c53bf6507ca
|
File details
Details for the file sea_g2p-0.7.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: sea_g2p-0.7.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 21.4 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c923836087f5498eec6763f7ca3cee46521e2688029451e1e4e615a5bc7bbd34
|
|
| MD5 |
87ef1462590157424eefc041aebc434e
|
|
| BLAKE2b-256 |
062e69950bdf3139557c8d96b2cd720cb00830c7eb227796567140843aece48c
|
File details
Details for the file sea_g2p-0.7.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: sea_g2p-0.7.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 21.4 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc37d42348e2414fddab3be9e1e472e150766381e307d511ab88d35f677ffc62
|
|
| MD5 |
a1fce8959c2945e71fc8ed0bc12a92be
|
|
| BLAKE2b-256 |
fb43d5aadc3ebf140e8fdf5611555ccf4ac89b82b89e7324185ee984806c9f57
|
File details
Details for the file sea_g2p-0.7.2-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: sea_g2p-0.7.2-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 21.1 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a19f7e7c7c1b430f99697b56d2c9f06e1d32d3194ea43d4077f9714b199b4b7
|
|
| MD5 |
23b4028e3df965d93dfd2d44c4dbfb69
|
|
| BLAKE2b-256 |
a1ce4cc2b30f4d75577a0d6abe91e76bfb700fcb6323f8d2d680e6163e279add
|
File details
Details for the file sea_g2p-0.7.2-cp310-abi3-win32.whl.
File metadata
- Download URL: sea_g2p-0.7.2-cp310-abi3-win32.whl
- Upload date:
- Size: 21.0 MB
- Tags: CPython 3.10+, Windows x86
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ba3b010697cef8123356895198ae3c8e51aeb4077d1caf8f6bde9313c07b6c7
|
|
| MD5 |
dc47d3ccd02a5b36965545ac5bc1dbf9
|
|
| BLAKE2b-256 |
caf0245a4f62ba76ae841ada2ded68c679c26291d22649394f290d22b6b80ef4
|
File details
Details for the file sea_g2p-0.7.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: sea_g2p-0.7.2-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 21.4 MB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eba553c42ee6a03661ac6c02dfb4debff583c3f0fc3227ff0f1edf508a92b81d
|
|
| MD5 |
51e4d476b16e8a0013ae770aae0451f9
|
|
| BLAKE2b-256 |
bd01293b12c2b6bee2dc1644e312e949559ec0b99216f44f01b9bd1f318a3c75
|
File details
Details for the file sea_g2p-0.7.2-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: sea_g2p-0.7.2-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 21.3 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
952308b8bd472ecec01942c8eaae3b737f47dbfa01becbd20d774f5ab473cb63
|
|
| MD5 |
9ff09498722f1dfa2aadc4acdddbb841
|
|
| BLAKE2b-256 |
c9a7840e7798a0203988e7b7a03601df4507a7dec4228f3dab1793fd9c0e7227
|
File details
Details for the file sea_g2p-0.7.2-cp310-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: sea_g2p-0.7.2-cp310-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 21.3 MB
- Tags: CPython 3.10+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c340ca7835f14bdbed2a5b69e330198b5e81bba3764205134902024534f82c7c
|
|
| MD5 |
08cd3feedaf38c6006f3c6611306fce1
|
|
| BLAKE2b-256 |
f3b56989354e0d5f0f73042c15f73253aa91674ef7deef17fbefd2dbc7e8bcbb
|
File details
Details for the file sea_g2p-0.7.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: sea_g2p-0.7.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 21.4 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
510d11900fd011a7fd7e7b46a2618355b435777c7cdba7929806f6ba7326fa5f
|
|
| MD5 |
599716f8d367083a18da8ef79fd0b297
|
|
| BLAKE2b-256 |
f62a9d403aa8d8e7f068f324fa9fc80d8d2d17044b1340c0f3c311b45dacc78b
|