Skip to main content

Pure-Python, zero-dependency AhoTTS grapheme-to-phoneme (G2P) for Basque and Spanish

Project description

ahotts-g2p

license: GPL-3.0

Pure-Python, zero-dependency, version-aware grapheme-to-phoneme (G2P) for Basque (euskara) and Spanish, faithfully reproducing the AhoTTS text-to-speech front-end.

ahotts-g2p turns text into the single-char IPA training string used by StyleTTS2/VITS-style models, matching the AhoTTS engines that phonemized the public HiTZ Basque voices. It is stdlib-only: no C build, no shared libraries, no runtime dependencies.

Install

pip install ahotts-g2p

From source:

pip install -e .[test]

Quick start

from ahotts_g2p import phonemize

phonemize("Bai.")                                    # 'bAj .'
phonemize("Ez, horrek ez du balio!")                 # 'Eʂ , Orek eʂ tU βalIo !'
phonemize("Kaixo mundua", lang="eu", version="classic")   # 'kajʃO mundUa'
phonemize("Hola mundo.", lang="es", version="classic")    # 'Ola mUndo'

# Northern (Iparralde / Iparrahotsa) Basque dialect
phonemize("hori horrek", lang="eu", dialect="northern")   # 'hOɾi hoʁEk'

CLI:

python -m ahotts_g2p "Kaixo mundua"
cat sentences.txt | python -m ahotts_g2p

Also exported: SAMPA_TO_IPA, the ordered SAMPA -> IPA mapping table.

Versions

AhoTTS has a real engine lineage. Different public voices were phonemized by different generations, with visibly different output, so the API takes a version (classic/modern). The default is modern.

Version Upstream source Consuming model Distinctive behaviour
classic aholab/AhoTTS, original engine HiTZ VITS voices dictionary STR_MRK stress (original eu_dicc), vowel offglides (au -> aw)
modern arrandi/phonemizer-eus-esp modulo1y2 + eu_dicc_20250326 HiTZ/StyleTTS2-eu dictionary STR_MRK stress (newer dict), silent-h stress shift, ʝ palatalisation, punctuation tokens

(pyAhoTTS builds the classic engine from ekaitz-zarraga/AhoTTS, a packaging fork of aholab/AhoTTS with build/portability changes only -- no algorithmic difference.)

Full detail in docs/versions.md.

Dialects

Basque has a Northern (Iparralde) variety with its own AhoTTS engine, AhoTTS_Iparrahotsa. It is exposed as a dialect (dialect="northern", default "standard"), independent of version:

phonemize("Euskara Euskal Herriko hizkuntza da.", lang="eu", dialect="northern")
# 'Ewʂkaɾa ewʂkAl heʁIko hiskUnVa ðA'

The Northern dialect pronounces /h/, has the French vowel ü -> /y/, a uvular rhotic /ʁ/, a remapped sibilant system (s -> ʂ, z -> s, ts -> tʂ), and j/dd -> /ɟ/. It is a faithful port of the AhoTTS_Iparrahotsa fork. Full detail in docs/dialects.md.

Accuracy

Correctness is parity with the AhoTTS reference engines, measured per version on held-out corpora (positional word match):

Language classic modern
Spanish (es) 100% 100%
Basque (eu) 99.94% 99.90%

The Northern Basque dialect reaches 99.61% word parity (418/430 exact lines) against the AhoTTS_Iparrahotsa binary; see docs/dialects.md.

The held-out corpora ship as test fixtures, so the figures reproduce with no binaries: pytest tests/test_oracle.py. See docs/accuracy.md.

Pipeline

text -> normalize -> g2p -> syllabify -> stress -> SAMPA -> IPA -> single-char

Numbers, ordinals and roman numerals are expanded to the target-language number words; punctuation is preserved as separate tokens (modern) or dropped (classic). Per- word lexical stress and the phonetic-exception rules are driven by the decoded dictionary flags. See docs/architecture.md.

Supported languages

  • Basque (eu) -- full linguistic pipeline with HDIC-dictionary POS tagging and accentual-group stress.
  • Spanish (es) -- dictionary-free g2p and stress.

Where it fits

Project Role
AhoTTS (Aholab, UPV/EHU) upstream C++ engine; the algorithm source
pyAhoTTS Python bindings to the AhoTTS C++ library (needs a build)
ahotts-g2p (this repo) pure-Python reimplementation of the G2P, no build
phoonnx downstream consumer -- ONNX TTS runtime that uses this G2P

License

GPL-3.0-or-later, matching upstream AhoTTS. This is a derivative of the GPL AhoTTS linguistic rules, so it is distributed under the same licence. The AhoTTS algorithms and dictionaries are credited to Aholab (UPV/EHU). See docs/licensing.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ahotts_g2p-0.1.0.tar.gz (629.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ahotts_g2p-0.1.0-py3-none-any.whl (655.2 kB view details)

Uploaded Python 3

File details

Details for the file ahotts_g2p-0.1.0.tar.gz.

File metadata

  • Download URL: ahotts_g2p-0.1.0.tar.gz
  • Upload date:
  • Size: 629.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ahotts_g2p-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9db97f59a572d33ce98729cf8c8085dad57276fcc0b28452d9127a14a7a2263d
MD5 6e460cdbdede899df4440b29e432603d
BLAKE2b-256 7da22992e3092b330baf0850d9ac7c7fbb9147ff0dc580670e68a5a8e5cbcc37

See more details on using hashes here.

File details

Details for the file ahotts_g2p-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ahotts_g2p-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 655.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ahotts_g2p-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 481ee0683954790baeb43cde35de5009bda45b3d37bb1d17e80910a2606708d3
MD5 5699847beee48bc1fce9ca21e0528d2f
BLAKE2b-256 538f77576d489c6b3c9ec835ff5a439e0392e8fd089ae8f2f97cebc81769c39e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page