A tokenizer, text cleaner, and phonemizer for many human languages.
Project description
Gruut
A tokenizer, text cleaner, and IPA phonemizer for several human languages that supports SSML.
from gruut import sentences
text = 'He wound it around the wound, saying "I read it was $10 to read."'
for sent in sentences(text, lang="en-us"):
for word in sent:
if word.phonemes:
print(word.text, *word.phonemes)
which outputs:
He h ˈi
wound w ˈaʊ n d
it ˈɪ t
around ɚ ˈaʊ n d
the ð ə
wound w ˈu n d
, |
saying s ˈeɪ ɪ ŋ
I ˈaɪ
read ɹ ˈɛ d
it ˈɪ t
was w ə z
ten t ˈɛ n
dollars d ˈɑ l ɚ z
to t ə
read ɹ ˈi d
. ‖
Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts.
A subset of SSML is also supported:
from gruut import sentences
ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<s>Today at 4pm, 2/1/2000.</s>
<s xml:lang="it">Un mese fà, 2/1/2000.</s>
</speak>"""
for sent in sentences(ssml_text, ssml=True):
for word in sent:
if word.phonemes:
print(sent.idx, word.lang, word.text, *word.phonemes)
with the output:
0 en-US Today t ə d ˈeɪ
0 en-US at ˈæ t
0 en-US four f ˈɔ ɹ
0 en-US P p ˈi
0 en-US M ˈɛ m
0 en-US , |
0 en-US February f ˈɛ b j u ˌɛ ɹ i
0 en-US first f ˈɚ s t
0 en-US , |
0 en-US two t ˈu
0 en-US thousand θ ˈaʊ z ə n d
0 en-US . ‖
1 it Un u n
1 it mese ˈm e s e
1 it fà f a
1 it , |
1 it due d j u
1 it gennaio d͡ʒ e n n ˈa j o
1 it duemila d u e ˈm i l a
1 it . ‖
See the documentation for more details.
Installation
pip install gruut
Languages besides English can be added during installation. For example, with French and Italian support:
pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]
The extra pip repo is needed for an updated num2words fork that includes support for more languages.
You may also manually download language files and use put them in $XDG_CONFIG_HOME/gruut/
($HOME/.config/gruut
by default).
gruut will look for language files in the directory $XDG_CONFIG_HOME/gruut/<lang>/
if the corresponding Python package is not installed. Note that <lang>
here is the full language name, e.g. de-de
instead of just de
.
Supported Languages
gruut currently supports:
- Arabic (
ar
) - Czech (
cs
orcs-cz
) - German (
de
orde-de
) - English (
en
oren-us
) - Spanish (
es
ores-es
) - Farsi/Persian (
fa
) - French (
fr
orfr-fr
) - Italian (
it
orit-it
) - Luxembourgish (
lb
) - Dutch (
nl
) - Russian (
ru
orru-ru
) - Swedish (
sv
orsv-se
) - Swahili (
sw
)
The goal is to support all of voice2json's languages
Dependencies
- Python 3.7 or higher
- Linux
- Tested on Debian Bullseye
- num2words fork and Babel
- Currency/number handling
- num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)
- gruut-ipa
- IPA pronunciation manipulation
- pycrfsuite
- Part of speech tagging and grapheme to phoneme models
- pydateparser
- Date parsing for multiple languages
Numbers, Dates, and More
gruut
can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g., <s lang="...">
).
The following types of expressions can be automatically expanded into words by gruut
:
- Numbers - "123" to "one hundred and twenty three" (disable with
verbalize_numbers=False
or--no-numbers
)- Relies on
Babel
for parsing andnum2words
for verbalization
- Relies on
- Dates - "1/1/2020" to "January first, twenty twenty" (disable with
verbalize_dates=False
or--no-dates
)- Relies on
pydateparser
for parsing and bothBabel
andnum2words
for verbalization
- Relies on
- Currency - "$10" to "ten dollars" (disable with
verbalize_currency=False
or--no-currency
)- Relies on
Babel
for parsing and bothBabel
andnum2words
for verbalization
- Relies on
- Times - "12:01am" to "twelve oh one A M" (disable with
verbalize_times=False
or--no-times
)- English only
- Relies on
num2words
for verbalization
Command-Line Usage
The gruut
module can be executed with python3 -m gruut --language <LANGUAGE> <TEXT>
or with the gruut
command (from setup.py
).
The gruut
command is line-oriented, consuming text and producing JSONL.
You will probably want to install jq to manipulate the JSONL output from gruut
.
Plain Text
Takes raw text and outputs JSONL with cleaned words/tokens.
echo 'This, right here, is some "RAW" text!' \
| gruut --language en-us \
| jq --raw-output '.words[].text'
This
,
right
here
,
is
some
"
RAW
"
text
!
More information is available in the full JSON output:
gruut --language en-us 'More text.' | jq .
Output:
{
"idx": 0,
"text": "More text.",
"text_with_ws": "More text.",
"text_spoken": "More text",
"par_idx": 0,
"lang": "en-us",
"voice": "",
"words": [
{
"idx": 0,
"text": "More",
"text_with_ws": "More ",
"leading_ws": "",
"training_ws": " ",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": "JJR",
"phonemes": [
"m",
"ˈɔ",
"ɹ"
],
"is_major_break": false,
"is_minor_break": false,
"is_punctuation": false,
"is_break": false,
"is_spoken": true,
"pause_before_ms": 0,
"pause_after_ms": 0
},
{
"idx": 1,
"text": "text",
"text_with_ws": "text",
"leading_ws": "",
"training_ws": "",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": "NN",
"phonemes": [
"t",
"ˈɛ",
"k",
"s",
"t"
],
"is_major_break": false,
"is_minor_break": false,
"is_punctuation": false,
"is_break": false,
"is_spoken": true,
"pause_before_ms": 0,
"pause_after_ms": 0
},
{
"idx": 2,
"text": ".",
"text_with_ws": ".",
"leading_ws": "",
"training_ws": "",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": null,
"phonemes": [
"‖"
],
"is_major_break": true,
"is_minor_break": false,
"is_punctuation": false,
"is_break": true,
"is_spoken": false,
"pause_before_ms": 0,
"pause_after_ms": 0
}
],
"pause_before_ms": 0,
"pause_after_ms": 0
}
For the whole input line and each word, the text
property contains the processed input text with normalized whitespace while text_with_ws
retains the original whitespace. The text_spoken
property only contains words that are spoken, so punctuation and breaks are excluded.
Within each word, there is:
idx
- zero-based index of the word in the sentencesent_idx
- zero-based index of the sentence in the input textpos
- part of speech tag (if available)phonemes
- list of IPA phonemes for the word (if available)is_minor_break
-true
if "word" separates phrases (comma, semicolon, etc.)is_major_break
-true
if "word" separates sentences (period, question mark, etc.)is_break
-true
if "word" is a major or minor breakis_punctuation
-true
if "word" is a surrounding punctuation mark (quote, bracket, etc.)is_spoken
-true
if not a break or punctuation
See python3 -m gruut <LANGUAGE> --help
for more options.
SSML
A subset of SSML is supported:
<speak>
- wrap around SSML textlang
- set language for document
<p>
- paragraphlang
- set language for paragraph
<s>
- sentence (disables automatic sentence breaking)lang
- set language for sentence
<w>
/<token>
- word (disables automatic tokenization)lang
- set language for wordrole
- set word role (see word roles)
<lang lang="...">
- set language inner text<voice name="...">
- set voice of inner text<say-as interpret-as="">
- force interpretation of inner textinterpret-as
one of "spell-out", "date", "number", "time", or "currency"format
- way to format text depending oninterpret-as
- number - one of "cardinal", "ordinal", "digits", "year"
- date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
<break time="">
- Pause for given amount of time- time - seconds ("123s") or milliseconds ("123ms")
<mark name="">
- User-defined mark (marks_before
andmarks_after
attributes of words/sentences)- name - name of mark
<sub alias="">
- substitutealias
for inner text<phoneme ph="...">
- supply phonemes for inner textph
- phonemes for each word of inner text, separated by whitespace
<lexicon id="...">
- inline or external pronunciation lexiconid
- unique id of lexicon (used in<lookup ref="...">
)uri
- if empty or missing, lexicon is inline- One or more
<lexeme>
child elements with:- Optional
role="..."
([word roles][#word-roles] separated by whitespace) <grapheme>WORD</grapheme>
- word text<phoneme>P H O N E M E S</phoneme>
- word pronunciation (phonemes separated by whitespace)
- Optional
<lookup ref="...">
- use pronunciation lexicon for child elementsref
- id from a<lexicon id="...">
Word Roles
During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as gruut:<TAG>
. For initialisms and spell-out
, the role gruut:letter
is used to indicate that e.g., "a" should be spoken as /eɪ/
instead of /ə/
.
For en-us
, the following additional roles are available from the part-of-speech tagger:
gruut:CD
- numbergruut:DT
- determinergruut:IN
- preposition or subordinating conjunctiongruut:JJ
- adjectivegruut:NN
- noungruut:PRP
- personal pronoungruut:RB
- adverbgruut:VB
- verbgruut:VB
- verb (past tense)
Inline Lexicons
Inline pronunciation lexicons are supported via the <lexicon>
and <lookup>
tags. gruut diverges slightly from the SSML standard here by allowing lexicons to be defined within the SSML document itself (url
is blank or missing). Additionally, the id
attribute of the <lexicon>
element can be left off to indicate a "default" inline lexicon that does not require a corresponding <lookup>
tag.
For example, the following document will yield three different pronunciations for the word "tomato":
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<lexicon xml:id="test" alphabet="ipa">
<lexeme>
<grapheme>
tomato
</grapheme>
<phoneme>
<!-- Individual phonemes are separated by whitespace -->
t ə m ˈɑ t oʊ
</phoneme>
</lexeme>
<lexeme>
<grapheme role="fake-role">
tomato
</grapheme>
<phoneme>
<!-- Made up pronunciation for fake word role -->
t ə m ˈi t oʊ
</phoneme>
</lexeme>
</lexicon>
<w>tomato</w>
<lookup ref="test">
<w>tomato</w>
<w role="fake-role">tomato</w>
</lookup>
</speak>
The first "tomato" will be looked up in the U.S. English lexicon (/t ə m ˈeɪ t oʊ/
). Within the <lookup>
tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a role attached (selecting a made up pronunciation in this case).
Even further from the SSML standard, gruut allows you to leave off the <lexicon>
id entirely. With no id
, a <lookup>
tag is no longer needed, allowing you to override the pronunciation of any word in the document:
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<!-- No id means change all words without a lookup -->
<lexicon>
<lexeme>
<grapheme>
tomato
</grapheme>
<phoneme>
t ə m ˈɑ t oʊ
</phoneme>
</lexeme>
</lexicon>
<w>tomato</w>
</speak>
This will yield a pronunciation of /t ə m ˈɑ t oʊ/
for all instances of "tomato" in the document (unless they have a <lookup>
).
Intended Audience
gruut is useful for transforming raw text into phonetic pronunciations, similar to phonemizer. Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a carefully chosen inventory.
For each supported language, gruut includes a:
- A word pronunciation lexicon built from open source data
- See pron_dict
- A pre-trained grapheme-to-phoneme model for guessing word pronunciations
Some languages also include:
- A pre-trained part of speech tagger built from open source data:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file gruut-2.4.0.tar.gz
.
File metadata
- Download URL: gruut-2.4.0.tar.gz
- Upload date:
- Size: 85.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a49f693266a3a1ab5a6bde77a8f560ef27712b4169b5a6b02e6a1a873342e19e |
|
MD5 | bd39118707abc1b256f296e4f7bf779a |
|
BLAKE2b-256 | fce16b5a01ef36b5341d5d0899401e4413594dfaa21f86cfc05be8efb25baf81 |