Pure-Python, zero-dependency AhoTTS grapheme-to-phoneme (G2P) for Basque and Spanish
Project description
ahotts-g2p
Pure-Python, zero-dependency, version-aware grapheme-to-phoneme (G2P) for Basque (euskara) and Spanish, faithfully reproducing the AhoTTS text-to-speech front-end.
ahotts-g2p turns text into the single-char IPA training string used by
StyleTTS2/VITS-style models, matching the AhoTTS engines that phonemized the
public HiTZ Basque voices. It is stdlib-only: no C build, no shared libraries,
no runtime dependencies.
Install
pip install ahotts-g2p
From source:
pip install -e .[test]
Quick start
from ahotts_g2p import phonemize
phonemize("Bai.") # 'bAj .'
phonemize("Ez, horrek ez du balio!") # 'Eʂ , Orek eʂ tU βalIo !'
phonemize("Kaixo mundua", lang="eu", version="classic") # 'kajʃO mundUa'
phonemize("Hola mundo.", lang="es", version="classic") # 'Ola mUndo'
# Northern (Iparralde / Iparrahotsa) Basque dialect
phonemize("hori horrek", lang="eu", dialect="northern") # 'hOɾi hoʁEk'
CLI:
python -m ahotts_g2p "Kaixo mundua"
cat sentences.txt | python -m ahotts_g2p
Also exported: SAMPA_TO_IPA, the ordered SAMPA -> IPA mapping table.
Versions
AhoTTS has a real engine lineage. Different public voices were phonemized by
different generations, with visibly different output, so the API takes a
version (classic/modern). The default is modern.
| Version | Upstream source | Consuming model | Distinctive behaviour |
|---|---|---|---|
classic |
aholab/AhoTTS, original engine | HiTZ VITS voices | dictionary STR_MRK stress (original eu_dicc), vowel offglides (au -> aw) |
modern |
arrandi/phonemizer-eus-esp modulo1y2 + eu_dicc_20250326 |
HiTZ/StyleTTS2-eu | dictionary STR_MRK stress (newer dict), silent-h stress shift, ʝ palatalisation, punctuation tokens |
(pyAhoTTS builds the classic engine from ekaitz-zarraga/AhoTTS, a packaging fork of aholab/AhoTTS with build/portability changes only -- no algorithmic difference.)
Full detail in docs/versions.md.
Dialects
Basque has a Northern (Iparralde) variety with its own AhoTTS engine,
AhoTTS_Iparrahotsa. It is exposed as a dialect (dialect="northern",
default "standard"), independent of version:
phonemize("Euskara Euskal Herriko hizkuntza da.", lang="eu", dialect="northern")
# 'Ewʂkaɾa ewʂkAl heʁIko hiskUnVa ðA'
The Northern dialect pronounces /h/, has the French vowel ü -> /y/, a uvular
rhotic /ʁ/, a remapped sibilant system (s -> ʂ, z -> s, ts -> tʂ), and
j/dd -> /ɟ/. It is a faithful port of the
AhoTTS_Iparrahotsa fork.
Full detail in docs/dialects.md.
Accuracy
Correctness is parity with the AhoTTS reference engines, measured per version on held-out corpora (positional word match):
| Language | classic | modern |
|---|---|---|
Spanish (es) |
100% | 100% |
Basque (eu) |
99.94% | 99.90% |
The Northern Basque dialect reaches 99.61% word parity (418/430 exact lines) against the AhoTTS_Iparrahotsa binary; see docs/dialects.md.
The held-out corpora ship as test fixtures, so the figures reproduce with no
binaries: pytest tests/test_oracle.py. See docs/accuracy.md.
Pipeline
text -> normalize -> g2p -> syllabify -> stress -> SAMPA -> IPA -> single-char
Numbers, ordinals and roman numerals are expanded to the target-language number
words; punctuation is preserved as separate tokens (modern) or dropped
(classic). Per-
word lexical stress and the phonetic-exception rules are driven by the decoded
dictionary flags. See docs/architecture.md.
Supported languages
- Basque (
eu) -- full linguistic pipeline with HDIC-dictionary POS tagging and accentual-group stress. - Spanish (
es) -- dictionary-free g2p and stress.
Where it fits
| Project | Role |
|---|---|
| AhoTTS (Aholab, UPV/EHU) | upstream C++ engine; the algorithm source |
| pyAhoTTS | Python bindings to the AhoTTS C++ library (needs a build) |
| ahotts-g2p (this repo) | pure-Python reimplementation of the G2P, no build |
| phoonnx | downstream consumer -- ONNX TTS runtime that uses this G2P |
License
GPL-3.0-or-later, matching upstream AhoTTS. This is a derivative of the GPL AhoTTS linguistic rules, so it is distributed under the same licence. The AhoTTS algorithms and dictionaries are credited to Aholab (UPV/EHU). See docs/licensing.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ahotts_g2p-0.1.0.tar.gz.
File metadata
- Download URL: ahotts_g2p-0.1.0.tar.gz
- Upload date:
- Size: 629.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9db97f59a572d33ce98729cf8c8085dad57276fcc0b28452d9127a14a7a2263d
|
|
| MD5 |
6e460cdbdede899df4440b29e432603d
|
|
| BLAKE2b-256 |
7da22992e3092b330baf0850d9ac7c7fbb9147ff0dc580670e68a5a8e5cbcc37
|
File details
Details for the file ahotts_g2p-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ahotts_g2p-0.1.0-py3-none-any.whl
- Upload date:
- Size: 655.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
481ee0683954790baeb43cde35de5009bda45b3d37bb1d17e80910a2606708d3
|
|
| MD5 |
5699847beee48bc1fce9ca21e0528d2f
|
|
| BLAKE2b-256 |
538f77576d489c6b3c9ec835ff5a439e0392e8fd089ae8f2f97cebc81769c39e
|