Skip to main content

Pure-Python, zero-dependency AhoTTS grapheme-to-phoneme (G2P) for Basque and Spanish

Project description

ahotts-g2p

license: GPL-3.0

Pure-Python, zero-dependency, version-aware grapheme-to-phoneme (G2P) for Basque (euskara) and Spanish, faithfully reproducing the AhoTTS text-to-speech front-end.

ahotts-g2p turns text into the single-char IPA training string used by StyleTTS2/VITS-style models, matching the AhoTTS engines that phonemized the public HiTZ Basque voices. It is stdlib-only: no C build, no shared libraries, no runtime dependencies.

Install

pip install ahotts-g2p

From source:

pip install -e .[test]

Quick start

from ahotts_g2p import phonemize

phonemize("Bai.")                                    # 'bAj .'
phonemize("Ez, horrek ez du balio!")                 # 'Eʂ , Orek eʂ tU βalIo !'
phonemize("Kaixo mundua", lang="eu", version="classic")   # 'kajʃO mundUa'
phonemize("Hola mundo.", lang="es", version="classic")    # 'Ola mUndo'

# Northern (Iparralde / Iparrahotsa) Basque dialect
phonemize("hori horrek", lang="eu", dialect="northern")   # 'hOɾi hoʁEk'

CLI:

python -m ahotts_g2p "Kaixo mundua"
cat sentences.txt | python -m ahotts_g2p

Also exported: SAMPA_TO_IPA, the ordered SAMPA -> IPA mapping table.

Versions

AhoTTS has a real engine lineage. Different public voices were phonemized by different generations, with visibly different output, so the API takes a version (classic/modern). The default is modern.

Version Upstream source Consuming model Distinctive behaviour
classic aholab/AhoTTS, original engine HiTZ VITS voices dictionary STR_MRK stress (original eu_dicc), vowel offglides (au -> aw)
modern arrandi/phonemizer-eus-esp modulo1y2 + eu_dicc_20250326 HiTZ/StyleTTS2-eu dictionary STR_MRK stress (newer dict), silent-h stress shift, ʝ palatalisation, punctuation tokens

(pyAhoTTS builds the classic engine from ekaitz-zarraga/AhoTTS, a packaging fork of aholab/AhoTTS with build/portability changes only -- no algorithmic difference.)

Full detail in docs/versions.md.

Dialects

Basque has a Northern (Iparralde) variety with its own AhoTTS engine, AhoTTS_Iparrahotsa. It is exposed as a dialect (dialect="northern", default "standard"), independent of version:

phonemize("Euskara Euskal Herriko hizkuntza da.", lang="eu", dialect="northern")
# 'Ewʂkaɾa ewʂkAl heʁIko hiskUnVa ðA'

The Northern dialect pronounces /h/, has the French vowel ü -> /y/, a uvular rhotic /ʁ/, a remapped sibilant system (s -> ʂ, z -> s, ts -> tʂ), and j/dd -> /ɟ/. It is a faithful port of the AhoTTS_Iparrahotsa fork. Full detail in docs/dialects.md.

Accuracy

Correctness is parity with the AhoTTS reference engines, measured per version on held-out corpora (positional word match):

Language classic modern
Spanish (es) 100% 100%
Basque (eu) 99.94% 99.90%

The Northern Basque dialect reaches 99.61% word parity (418/430 exact lines) against the AhoTTS_Iparrahotsa binary; see docs/dialects.md.

The held-out corpora ship as test fixtures, so the figures reproduce with no binaries: pytest tests/test_oracle.py. See docs/accuracy.md.

Pipeline

text -> normalize -> g2p -> syllabify -> stress -> SAMPA -> IPA -> single-char

Numbers, ordinals and roman numerals are expanded to the target-language number words; punctuation is preserved as separate tokens (modern) or dropped (classic). Per- word lexical stress and the phonetic-exception rules are driven by the decoded dictionary flags. See docs/architecture.md.

Supported languages

  • Basque (eu) -- full linguistic pipeline with HDIC-dictionary POS tagging and accentual-group stress.
  • Spanish (es) -- dictionary-free g2p and stress.

Where it fits

Project Role
AhoTTS (Aholab, UPV/EHU) upstream C++ engine; the algorithm source
pyAhoTTS Python bindings to the AhoTTS C++ library (needs a build)
ahotts-g2p (this repo) pure-Python reimplementation of the G2P, no build
phoonnx downstream consumer -- ONNX TTS runtime that uses this G2P

License

GPL-3.0-or-later, matching upstream AhoTTS. This is a derivative of the GPL AhoTTS linguistic rules, so it is distributed under the same licence. The AhoTTS algorithms and dictionaries are credited to Aholab (UPV/EHU). See docs/licensing.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ahotts_g2p-0.1.0a2.tar.gz (629.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ahotts_g2p-0.1.0a2-py3-none-any.whl (655.2 kB view details)

Uploaded Python 3

File details

Details for the file ahotts_g2p-0.1.0a2.tar.gz.

File metadata

  • Download URL: ahotts_g2p-0.1.0a2.tar.gz
  • Upload date:
  • Size: 629.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ahotts_g2p-0.1.0a2.tar.gz
Algorithm Hash digest
SHA256 5b8b2cc40f538efcbfd8b53dfad9322743e0000a6339db095431aea7f3f88d17
MD5 0e7a3b4a3fa68acb3b6be9ac5130c8a6
BLAKE2b-256 94b24dd2592435d73b37ae2d755d2c5532dea704512983995e7d825f90464740

See more details on using hashes here.

File details

Details for the file ahotts_g2p-0.1.0a2-py3-none-any.whl.

File metadata

  • Download URL: ahotts_g2p-0.1.0a2-py3-none-any.whl
  • Upload date:
  • Size: 655.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ahotts_g2p-0.1.0a2-py3-none-any.whl
Algorithm Hash digest
SHA256 2241742309ba121ad8d83d81d290e752c37a10e1e83d149783df9ab07cbd5f3c
MD5 51f04634e7b0f932c0c3a294d5e709fc
BLAKE2b-256 f7eaf5e78b0c0c026de4e24c2ff9ed0c6ba87281aa06dd87e09acdc9acc97264

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page