Skip to main content

Tools for transcribing languages into IPA.

Project description

A library and tool for transliterating orthographic text as IPA (International Phonetic Alphabet).

Usage

The principle script for transliterating orthographic text as IPA is epitranscriber.py. It takes one argument, the ISO 639-3 code for the language of the orthographic text, takes orthographic text at standard in and writes Unicode IPA to standard out. $ echo “Düğün olur bayram gelir” | epitranscribe.py “tur-Latn” dyɰyn oluɾ bajɾam ɟeliɾ $ epitranscribe.py “tur-Latn” < orthography.txt > phonetic.txt Additionally, the small Python modules epitran and epitran.vector can be used to easily write more sophisticated Python programs for deploying the Epitran mapping tables. This is documented below.

Using the epitran Module

The most general functionality in the epitran module is encapsulated in the very simple Epitran class:

Epitran(code, preproc=True, postproc=True, ligatures=False, cedict_file=None).

Its constructor takes one argument, code, the ISO 639-3 code of the language to be transliterated plus a hyphen plus a four letter code for the script (e.g. ‘Latn’ for Latin script, ‘Cyrl’ for Cyrillic script, and ‘Arab’ for a Perso-Arabic script). It also takes optional keyword arguments: * preproc and postproc enable pre- and post-processors. These are enabled by default. * ligatures enables non-standard IPA ligatures like “ʤ” and “ʨ”. * cedict_file gives the path to the CC-CEDict dictionary file (relevant only when working with Mandarin Chinese and which, because of licensing restrictions cannot be distributed with Epitran).

>>> import epitran
>>> epi = epitran.Epitran('uig-Arab')  # Uyghur in Perso-Arabic script
It is now possible to use the Epitran class for English and Mandarin Chinese (Simplified and Traditional) G2P as well as the other langugages that use Epitran's "classic" model. For Chinese, it is necessary to point the constructor to a copy of the [CC-CEDict](https://cc-cedict.org/wiki/) dictionary:

        import epitran epi = epitran.Epitran('cmn-Hans',
        cedict\_file='cedict\_1\_0\_ts\_utf-8\_mdbg.txt') The
        ``Epitran`` class has only one "public" method right now,
        ``transliterate``:

Epitran.transliterate(text, normpunc=False, ligatures=False). Convert text (in Unicode-encoded orthography of the language specified in the constructor) to IPA, which is returned. normpunc enables punctuation normalization and ligatures enables non-standard IPA ligatures like “ʤ” and “ʨ”. Usage is illustrated below:

>>> epi.transliterate(u'Düğün')
u'dy\u0270yn'
>>> print(epi.transliterate(u'Düğün'))
dyɰyn

Epitran.word_to_tuples(word, normpunc=False): Takes a word (a Unicode string) in a supported orthography as input and returns a list of tuples with each tuple corresponding to an IPA segment of the word. The tuples have the following structure:

(
    character_category :: String,
    is_upper :: Integer,
    orthographic_form :: Unicode String,
    phonetic_form :: Unicode String,
    segments :: List<Tuples>
)

Note that word_to_tuples is not implemented for all language-script pairs.

The codes for character_category are from the initial characters of the two character sequences listed in the “General Category” codes found in Chapter 4 of the Unicode Standard. For example, “L” corresponds to letters and “P” corresponds to production marks. The above data structure is likely to change in subsequent versions of the library. The structure of segments is as follows:

(
    segment :: Unicode String,
    vector :: List<Integer>
)

Here is an example of an interaction with word_to_tuples:

>>> import epitran
>>> epi = epitran.Epitran('tur-Latn')
>>> epi.word_to_tuples(u'Düğün')
[(u'L', 1, u'D', u'd', [(u'd', [-1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 0, -1])]), (u'L', 0, u'u\u0308', u'y', [(u'y', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1])]), (u'L', 0, u'g\u0306', u'\u0270', [(u'\u0270', [-1, 1, -1, 1, 0, -1, -1, 0, 1, -1, -1, 0, -1, 0, -1, 1, -1, 0, -1, 1, -1])]), (u'L', 0, u'u\u0308', u'y', [(u'y', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1])]), (u'L', 0, u'n', u'n', [(u'n', [-1, 1, 1, -1, -1, -1, 1, -1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 0, -1])])]

Preprocessors and Their Pitfalls

In order to build a maintainable orthography to phoneme mapper, it is sometimes necessary to employ preprocessors that make contextual substitutions of symbols before text is passed to a orthography-to-IPA mapping system that preserves relationships between input and output characters. This is particularly true of languages with a poor sound-symbols correspondence (like French and English). Languages like French are particularly good targets for this approach because the pronunication of a given string of letters is highly predictable even though the individual symbols often do not map neatly into sounds. (Sound-symbol correspondence is so poor in English that effective English G2P systems rely heavily on pronouncing dictionaries.)

Preprocessing the inputs words to allow for straightforward grapheme-to-phoneme mappings (as is done in the current version of epitran for some languages) is advantaeous because the restricted regular expression language used to write the preprocessing rules is more powerful than the language for the mapping rules and allows the equivalent of many mapping rules to be written with a single rule. Without them, providing epitran support for languages like French and German would not be practical. However, they do present some problems. Specifically, when using a language with a preprocessor, one must be aware that the input word will not always be identical to the concatenation of the orthographic strings (orthographic_form) output by Epitran.word_to_tuples. Instead, the output of word_to_tuple will reflect the output of the preprocessor, which may delete, insert, and change letters in order to allow direct orthography-to-phoneme mapping at the next step. The same is true of other methods that rely on Epitran.word_to_tuple such as VectorsWithIPASpace.word_to_segs from the epitran.vector module (deprecated).

Using the epitran.vector Module (deprecated)

The epitran.vector module is also very simple. It contains one class, VectorsWithIPASpace, including one method of interest, word_to_segs:

The constructor for VectorsWithIPASpace takes two arguments: - code: the language-script code for the language to be processed. - spaces: the codes for the punctuation/symbol/IPA space in which the characters/segments from the data are expected to reside. The available spaces are listed below.

Its principle method is word_to_segs:

VectorWithIPASpace.word_to_segs(word, normpunc=False) Word is a Unicode string. If the keyword argument normpunc is set to True, punctuation disovered in word is normalized to ASCII equivalents.

A typical interaction with the VectorsWithIPASpace object via the word_to_segs method is illustrated here:

>>> import epitran.vector
>>> vwis = epitran.vector.VectorsWithIPASpace('uzb-Latn', 'uzb-with_attached_suffixes-space')
>>> vwis.word_to_segs(u'darë')
[(u'L', 0, u'd', u'd\u032a', u'40', [-1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, 1, -1, -1, -1, -1, -1, 0, -1]), (u'L', 0, u'a', u'a', u'37', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, 1, -1]), (u'L', 0, u'r', u'r', u'54', [-1, 1, 1, 1, 0, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, 0, 0, 0, -1, 0, -1]), (u'L', 0, u'e\u0308', u'ja', u'46', [-1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, 0, -1, 1, -1, -1, -1, 0, -1]), (u'L', 0, u'e\u0308', u'ja', u'37', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, 1, -1])]

(It is important to note that, though the word that serves as input–darë–has four letters, the output contains four tuples because the last letter in darë actually corresponds to two IPA segments, /j/ and /a/.) The returned data structure is a list of tuples, each with the following structure:

(
    character_category :: String,
    is_upper :: Integer,
    orthographic_form :: Unicode String,
    phonetic_form :: Unicode String,
    in_ipa_punc_space :: Integer,
    phonological_feature_vector :: List<Integer>
)

A few notes are in order regarding this data structure: - character_category is defined as part of the Unicode standard (Chapter 4). It consists of a single, uppercase letter from the set {‘L’, ‘M’, ‘N’, ‘P’, ‘S’, ‘Z’, ‘C’}.. The most frequent of these are ‘L’ (letter), ‘N’ (number), ‘P’ (punctuation), and ‘Z’ (separator [including separating white space]). - is_upper consists only of integers from the set {0, 1}, with 0 indicating lowercase and 1 indicating uppercase. - The integer in in_ipa_punc_space is an index to a list of known characters/segments such that, barring degenerate cases, each character or segment is assignmed a unique and globally consistant number. In cases where a character is encountered which is not in the known space, this field has the value -1. - The length of the list phonological_feature_vector should be constant for any instantiation of the class (it is based on the number of features defined in panphon) but is–in principles–variable. The integers in this list are drawn from the set {-1, 0, 1}, with -1 corresponding to ‘-’, 0 corresponding to ‘0’, and 1 corresponding to ‘+’. For characters with no IPA equivalent, all values in the list are 0.

Language Support

Transliteration Language/Script Pairs

Code

Language (Script)

aar-Latn

Afar

amh-Ethi

Amharic

aze-Cyrl

Azerbaijani (Cyrillic)

aze-Latn

Azerbaijani (Latin)

ben-Beng

Bengali

ceb-Latn

Cebuano

cmn-Hans

Mandarin (Simplified)

cmn-Hant

Mandarin (Traditional)

ckb-Arab

Sorani

deu-Latn

German

deu-Latn-np

German*

eng-Latn

English**

fas-Arab

Farsi (Perso-Arabic)

fra-Latn

French

fra-Latn-np

French*

hau-Latn

Hausa

hin-Deva

Hindi

hun-Latn

Hungarian

ilo-Latn

Ilocano

ind-Latn

Indonesian

ita-Latn

Italian

jav-Latn

Javanese

kaz-Cyrl

Kazakh (Cyrillic)

kaz-Latn

Kazakh (Latin)

kin-Latn

Kinyarwanda

kir-Arab

Kyrgyz (Perso-Arabic)

kir-Cyrl

Kyrgyz (Cyrillic)

kir-Latn

Kyrgyz (Latin)

krm-Latn

Kurmanji

mar-Deva

Marathi

nld-Latn

Dutch

nya-Latn

Chichewa

orm-Latn

Oromo

pan-Guru

Punjabi (Eastern)

rus-Cyrl

Russian

sna-Latn

Shona

som-Latn

Somali

spa-Latn

Spanish

swa-Latn

Swahili

swe-Latn

Swedish

tam-Taml

Tamil

tel-Telu

Telugu

tgk-Cyrl

Tajik

tgl-Latn

Tagalog

tha-Thai

Thai

tir-Ethi

Tigrinya

tuk-Cyrl

Turkmen (Cyrillic)

tuk-Latn

Turkmen (Latin)

tur-Latn

Turkish (Latin)

uig-Arab

Uyghur (Perso-Arabic)

uzb-Cyrl

Uzbek (Cyrillic)

uzb-Latn

Uzbek (Latin)

vie-Latn

Vietnamese

xho-Latn

Xhosa

yor-Latn

Yoruba

zul-Latn

Zulu

*These language preprocessors and maps naively assume a phonemic orthography. **English G2P requires the installation of the CMU Flite speech synthesis system.

Language “Spaces”

Code

Language

Note

deu-Latn

German

nld-Latn

Dutch

spa-Latn

Spanish

tur-Latn-suf

Turkish

Based on data with suffixes attached

tur-Latn-nosuf

Turkish

Based on data with suffixes removed

uzb-Latn-suf

Uzbek

Based on data with suffixes attached

Note that major languages, including French, are missing from this table to to a lack of appropriate text data.

Installation of Flite (for English G2P)

For use with most languages, Epitran requires no special installation steps. It can be installed as an ordinarary python package, either with pip or by running python setup.py install in the root of the source directory. However, English G2P in Epitran relies on CMU Flite, a speech synthesis package by Alan Black and other speech researchers at Carnegie Mellon University. For the current version of Epitran, you should follow the installation instructions for lex_lookup, which is used as the default G2P interface for Epitran.

t2p

The epitran.flite module shells out to the flite speech synthesis system to do English G2P. Flite must be installed in order for this module to function. The t2p binary from flite is not installed by default and must be manually copied into the path. An illustration of how this can be done on a Unix-like system is given below. Note that GNU gmake is required and that, if you have another make installed, you may have to call gmake explicitly:

$ tar xjf flite-2.0.0-release.tar.bz2
$ cd flite-2.0.0-release/
$ ./configure && make
$ sudo make install
$ sudo cp bin/t2p /usr/local/bin

You should adapt these instructions to local conditions. Installation on Windows is easiest when using Cygwin. You will have to use your discretion in deciding where to put t2p.exe on Windows, since this may depend on your python setup. Other platforms are likely workable but have not been tested.

lex_lookup

t2p does not behave as expected on letter sequences that are highly infrequent in English. In such cases, t2p gives the pronunciation of the English letters of the name, rather than an attempt at the pronunciation of the name. There is a different binary included in the most recent (pre-release) versions of Flite that behaves better in this regard, but takes some extra effort to install. To install, you need to obtain at least version 2.0.5 of Flite. Untar and compile the source, following the steps below, adjusting where appropriate for your system:

$ tar xjf flite-2.0.5-current.tar.bz2
$ cd flite-2.0.5-current
$ ./configure && make
$ sudo make install
$ cd testsuite
$ make lex_lookup
$ sudo cp lex_lookup /usr/local/bin

When installing on MacOS and other systems that use a BSD version of cp, some modification to a Makefile must be made in order to install flite-2.0.5 (between steps 3 and 4). Edit main/Makefile and change both instances of cp -pd to cp -pR. Then resume the steps above at step 4.

Usage

To use lex_lookup, simply instantiate Epitran as usual, but with the code set to ‘eng-Latn’:

>>> import epitran
>>> epi = epitran.Epitran('eng-Latn')
>>> print epi.transliterate(u'Berkeley')
bɹ̩kli

Extending Epitran with map files, preprocessors and postprocessors

Language support in Epitran is provided through map files, which define mappings between orthographic and phonetic units, preprocessors that run before the map is applied, and postprocessors that run after the map is applied. These are all defined in UTF8-encoded, comma-delimited value (CSV) files. The files are each named -.csv where is the (three letter, all lowercase) ISO 639-3 code for the language and is the (four letter, capitalized) ISO 15924 code for the script. These files reside in the data directory of the Epitran installation under the map, pre, and post subdirectories, respectively.

Map files (mapping tables)

The map files are simple, two-column files where the first column contains the orthgraphic characters/sequences and the second column contains the phonetic characters/sequences. For many languages (most languages with unambiguous, phonemically adequate orthographies) just this easy-to-produce mapping file is adequate to produce a serviceable G2P system.

The first row is a header and is discarded. For consistency, it should contain the fields “Orth” and “Phon”. The following rows by consist of fields of any length, separated by a comma. The same phonetic form (the second field) may occur any number of times but an orthographic form may only occur once. Where one orthograrphic form is a prefix of another form, the longer form has priority in mapping. In other words, matching between orthographic units and orthographic strings is greedy. Mapping works by finding the longest prefix of the orthographic form and adding the corresponding phonetic string to the end of the phonetic form, then removing the prefix from the orthographic form and continuing, in the same manner, until the orthographic form is consumed. If no non-empty prefix of the orthographic form is present in the mapping table, the first character in the orthographic form is removed and appended to the phonetic form. The normal sequence then resumes. This means that non-phonetic characters may end up in the “phonetic” form, which we judge to be better than loosing information through an inadequate mapping table.

Preprocesssors and postprocessors

For language-script pairs with more complicated orthographies, it is sometimes necessary to manipulate the orthographic form prior to mapping or to manipulate the phonetic form after mapping. This is done, in Epitran, with grammars of context-sensitive string rewrite rules. In truth, these rules would be more than adequate to solve the mapping problem as well but in practical terms, it is usually easier to let easy-to-understand and easy-to-maintain mapping files carry most of the weight of conversion and reserve the more powerful context sensitive grammar formalism for pre- and post-processing.

To make it easy to edit the files in a spreadsheet (like LibreOffice Calc), the files are formatted as CSV. Of course, they can be edited in text editor as well. The first row is a header, which should have the fields “a”, “b”, “X”, and “Y”, corresponding to the parts of “a → b / X _ Y”, which can be read as “a is rewritten as b in the context between X and Y”. It is equivalent to XaY → XbY. Each subsequent row is a rule in this format. The symbol “#” matches a word-boundary (at the beginning and end of a word-length token). For example, a rule that changes “e” to “ə” at the end of a word, for use in a postprocessor, would have the following form:

e,ə,,#

Which corresponds to:

e → ə / _ #

The rules apply in order, so earlier rules may “feed” and “bleed” later rules. Therefore, their sequence is very important and can be leveraged in order to achieve valuable results.

All of the fields are strings (of zero or more characters). If “a” is the empty string, the rule will insert “b” in the environment between “X” and “Y”. If “b” is the empty string, the rule will delete “a” in the environment betwee “X” and “Y”. It is sometimes useful to write rules that insert custom symbols that trigger (or prevent the triggering of) subsequent rules (and which are subsequently deleted). By convention, these symbols consist of lowercase characters enclosed in angle brackets (“<” and “>”).

The strings are combined to form a regular expression using the python regex module (a drop-in replacement for the re module). Because of this, it is possible to use most regex notation in the strings. For example, to replace “a” with “aa” before “b”, “d”, or “g’, one would use the following rule:

a,aa,,(b|d|g)

or, less optimally:

a,aa,,[bdg]

There is a special construct for handling cases of metathesis (where “AB” is replaced with “BA”). For example, the rule:

(?P<sw1>[เแโไใไ])(?P<sw2>.),,,

Will “swap” the positions of any character in “เแโไใไ” and any following character.

Project details


Release history Release notifications | RSS feed

This version

0.10

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epitran-0.10.tar.gz (43.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page