epitran

Tools for transcribing languages into IPA.

Project description

A library and tool for transliterating orthographic text as IPA (International Phonetic Alphabet).

Usage

The principle script for transliterating orthographic text as IPA is epitranscriber.py. It takes one argument, the ISO 639-3 code for the language of the orthographic text, takes orthographic text at standard in and writes Unicode IPA to standard out.

$ echo "Düğün olur bayram gelir" | epitranscribe.py "tur-Latn" dyɰyn oluɾ bajɾam ɟeliɾ
$ epitranscribe.py "tur-Latn" < orthography.txt > phonetic.txt

Additionally, the small Python modules epitran and epitran.vector can be used to easily write more sophisticated Python programs for deploying the Epitran mapping tables. This is documented below.

Using the epitran Module

The functionality in the epitran module is encapsulated in the very simple Epitran class. Its constructor takes one argument, code, the ISO 639-3 code of the language to be transliterated plus a hyphen plus a four letter code for the script (e.g. ‘Latn’ for Latin script, ‘Cyrl’ for Cyrillic script, and ‘Arab’ for a Person-Arabic script).

>>> import epitran
>>> epi = epitran.Epitran('tur-Latn')

The Epitran class has only a few “public” method (to the extent that such a concept exists in Python). The most important are transliterate and word_to_tuples:

Epitran.transliterate(text): Convert text (in Unicode-encoded orthography of the language specified in the constructor) to IPA, which is returned.

>>> epi.transliterate(u'Düğün')
u'dy\u0270yn'
>>> print(epi.transliterate(u'Düğün'))
dyɰyn

Epitran.word_to_tuples(word, normpunc=False): Takes a word (a Unicode string) in a supported orthography as input and returns a list of tuples with each tuple corresponding to an IPA segment of the word. The tuples have the following structure:

(
    character_category :: String,
    is_upper :: Integer,
    orthographic_form :: Unicode String,
    phonetic_form :: Unicode String,
    segments :: List<Tuples>
)

The codes for character_category are from the initial characters of the two character sequences listed in the “General Category” codes found in Chapter 4 of the Unicode Standard. For example, “L” corresponds to letters and “P” corresponds to production marks. The above data structure is likely to change in subsequent versions of the library. The structure of segments is as follows:

(
    segment :: Unicode String,
    vector :: List<Integer>
)

Here is an example of an interaction with word_to_tuples:

>>> import epitran
>>> epi = epitran.Epitran('tur-Latn')
>>> epi.word_to_tuples(u'Düğün')
[(u'L', 1, u'D', u'd', [(u'd', [-1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 0, -1])]), (u'L', 0, u'u\u0308', u'y', [(u'y', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1])]), (u'L', 0, u'g\u0306', u'\u0270', [(u'\u0270', [-1, 1, -1, 1, 0, -1, -1, 0, 1, -1, -1, 0, -1, 0, -1, 1, -1, 0, -1, 1, -1])]), (u'L', 0, u'u\u0308', u'y', [(u'y', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1])]), (u'L', 0, u'n', u'n', [(u'n', [-1, 1, 1, -1, -1, -1, 1, -1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 0, -1])])]

Preprocessors and Their Pitfalls

In order to build a maintainable orthography to phoneme mapper, it is sometimes necessary to employ preprocessors that make contextual substitutions of symbols before text is passed to a orthography-to-IPA mapping system that preserves relationships between input and output characters. This is particularly true of languages with a poor sound-symbols correspondence (like French and English). Languages like French are particularly good targets for this approach because the pronunication of a given string of letters is highly predictable even though the individual symbols often do not map neatly into sounds. (Sound-symbol correspondence is so poor on English that effective English G2P systems rely heavily on pronouncing dictionaries.)

Preprocessing the inputs words to allow for straightforward grapheme-to-phoneme mappings (as is done in the current version of epitran for some languages) is advantaeous because the restricted regular expression language used to write the preprocessing rules is more powerful than the language for the mapping rules and allows the equivalent of many mapping rules to be written with a single rule. Without them, providing epitran support for languages like French and German would not be practical. However, they do present some problems. Specifically, when using a language with a preprocessor, one must be aware that the input word will not always be identical to the concatenation of the orthographic strings (orthographic_form) output by Epitran.word_to_tuples. Instead, the output of word_to_tuple will reflect the output of the preprocessor, which may delete, insert, and change letters in order to allow direct orthography-to-phoneme mapping at the next step. The same is true of other methods that rely on Epitran.word_to_tuple such as VectorsWithIPASpace.word_to_segs from the epitran.vector module.

Using the epitran.vector Module

The epitran.vector module is also very simple. It contains one class, VectorsWithIPASpace, including one method of interest, word_to_segs:

The constructor for VectorsWithIPASpace takes two arguments: - code: the language-script code for the language to be processed. - space: the code for the punctuation/symbol/IPA space in which the characters/segments from the data are expected to reside. The available spaces are listed below.

It’s principle method is word_to_segs:

VectorWithIPASpace.word_to_segs(word, normpunc=False) Word is a Unicode string. If the keyword argument normpunc is set to True, punctuation disovered in word is normalized to ASCII equivalents.

A typical interaction with the VectorsWithIPASpace object via the word_to_segs method is illustrated here:

>>> import epitran.vector
>>> vwis = epitran.vector.VectorsWithIPASpace('uzb-Latn', 'uzb-with_attached_suffixes-space')
>>> vwis.word_to_segs(u'darë')
[(u'L', 0, u'd', u'd\u032a', u'40', [-1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, 1, -1, -1, -1, -1, -1, 0, -1]), (u'L', 0, u'a', u'a', u'37', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, 1, -1]), (u'L', 0, u'r', u'r', u'54', [-1, 1, 1, 1, 0, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, 0, 0, 0, -1, 0, -1]), (u'L', 0, u'e\u0308', u'ja', u'46', [-1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, 0, -1, 1, -1, -1, -1, 0, -1]), (u'L', 0, u'e\u0308', u'ja', u'37', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, -1, -1, 1, 1, -1, 1, -1])]

(It is important to note that, though the word that serves as input–darë–has four letters, the output contains four tuples because the last letter in darë actually corresponds to two IPA segments, /j/ and /a/.) The returned data structure is a list of tuples, each with the following structure:

(
    character_category :: String,
    is_upper :: Integer,
    orthographic_form :: Unicode String,
    phonetic_form :: Unicode String,
    in_ipa_punc_space :: Integer,
    phonological_feature_vector :: List<Integer>
)

A few notes are in order regarding this data structure: - character_category is defined as part of the Unicode standard (Chapter 4). It consists of a single, uppercase letter from the set {‘L’, ‘M’, ‘N’, ‘P’, ‘S’, ‘Z’, ‘C’}.. The most frequent of these are ‘L’ (letter), ‘N’ (number), ‘P’ (punctuation), and ‘Z’ (separator [including separating white space]). - is_upper consists only of integers from the set {0, 1}, with 0 indicating lowercase and 1 indicating uppercase. - The integer in in_ipa_punc_space is an index to a list of known characters/segments such that, barring degenerate cases, each character or segment is assignmed a unique and globally consistant number. In cases where a character is encountered which is not in the known space, this field has the value -1. - The length of the list phonological_feature_vector should be constant for any instantiation of the class (it is based on the number of features defined in panphon) but is–in principles–variable. The integers in this list are drawn from the set {-1, 0, 1}, with -1 corresponding to ‘-’, 0 corresponding to ‘0’, and 1 corresponding to ‘+’. For characters with no IPA equivalent, all values in the list are 0.

Language Support

Transliteration Languages

Language “Spaces”

Code	Language	Note
deu-Latn	German
nld-Latn	Dutch
spa-Latn	Spanish
tur-Latn-suf	Turkish	Based on data with suffixes attached
tur-Latn-nosuf	Turkish	Based on data with suffixes removed
uzb-Latn-suf	Uzbek	Based on data with suffixes attached

Note that major languages, including French, are missing from this table to to a lack of appropriate text data.

Project details

Release history Release notifications | RSS feed

1.34.0

Oct 16, 2025

1.33.1

Oct 16, 2025

1.33.0

Oct 16, 2025

1.32.0

Oct 16, 2025

1.31.0

Oct 16, 2025

1.30.0

Oct 16, 2025

1.29.0

Oct 16, 2025

1.28.0

Oct 16, 2025

1.27.0

Oct 16, 2025

1.26.2

Oct 16, 2025

1.26.1

Oct 15, 2025

1.26.0

Jan 23, 2025

1.25.1

Jul 30, 2024

1.25

Mar 8, 2024

1.24

Sep 27, 2022

1.23

Sep 13, 2022

1.22

Jun 11, 2022

1.21

Jun 6, 2022

1.20

Jun 6, 2022

1.19

May 17, 2022

1.18

Apr 15, 2022

1.17

Apr 4, 2022

1.16

Feb 21, 2022

1.15

Nov 12, 2021

1.14

Nov 11, 2021

1.13

Nov 11, 2021

1.12

Oct 4, 2021

1.11

Apr 27, 2021

1.10

Apr 22, 2021

1.9

Jan 19, 2021

1.8

Nov 25, 2019

1.7

Nov 25, 2019

1.6

Nov 22, 2019

1.5

Nov 20, 2019

1.4

Nov 12, 2019

1.3

Nov 11, 2019

1.2

Oct 13, 2019

1.1

Aug 2, 2019

1.0

Aug 1, 2019

0.73

Jul 31, 2019

0.72

Jul 31, 2019

0.71

Jul 25, 2019

0.70

Jul 22, 2019

0.69

Jul 22, 2019

0.68

Jul 22, 2019

0.67

Jul 20, 2019

0.66

Jul 17, 2019

0.65

Jul 17, 2019

0.64

Jul 17, 2019

0.63

Jul 16, 2019

0.62

Jul 16, 2019

0.61

Jul 11, 2019

0.60

Jul 9, 2019

0.59

Jul 2, 2019

0.58

May 9, 2019

0.57

Oct 19, 2018

0.56

Jul 4, 2018

0.55

Jul 4, 2018

0.54

Jul 4, 2018

0.53

Jul 3, 2018

0.52

Jul 2, 2018

0.51

Jul 2, 2018

0.50

Jun 30, 2018

0.49

Jun 27, 2018

0.47

Jun 7, 2018

0.46

Jun 5, 2018

0.45

Jun 5, 2018

0.44

Apr 12, 2018

0.43

Apr 10, 2018

0.42

Apr 10, 2018

0.41

Apr 9, 2018

0.40

Apr 9, 2018

0.39

Mar 5, 2018

0.38

Feb 13, 2018

0.37

Oct 19, 2017

0.36

Oct 16, 2017

0.35

Aug 22, 2017

0.34

Aug 22, 2017

0.33

Aug 21, 2017

0.32

Aug 20, 2017

0.31

Aug 20, 2017

0.30

Aug 17, 2017

0.29

Aug 17, 2017

0.28

Aug 17, 2017

0.27

Aug 16, 2017

0.26

Aug 15, 2017

0.25

Aug 13, 2017

0.24

Aug 13, 2017

0.23

Aug 8, 2017

0.22

Aug 8, 2017

0.21

Aug 8, 2017

0.20

Aug 8, 2017

0.19

Jul 7, 2017

0.18

Apr 28, 2017

0.17

Apr 27, 2017

0.16

Apr 25, 2017

0.15

Apr 25, 2017

0.14

Apr 25, 2017

0.13

Apr 24, 2017

0.12

Apr 22, 2017

0.11

Apr 17, 2017

0.10

Apr 12, 2017

0.9

Apr 5, 2017

0.8

Apr 5, 2017

0.7

Feb 17, 2017

0.6

Feb 15, 2017

0.5

Feb 1, 2017

0.4

Aug 27, 2016

0.3

Jul 29, 2016

This version

0.2

May 9, 2016

0.1

Apr 30, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epitran-0.2.tar.gz (18.7 kB view details)

Uploaded May 9, 2016 Source

File details

Details for the file epitran-0.2.tar.gz.

File metadata

Download URL: epitran-0.2.tar.gz
Upload date: May 9, 2016
Size: 18.7 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for epitran-0.2.tar.gz
Algorithm	Hash digest
SHA256	`1e0c45a8dab11d1ca7661170894071e84bb29d86da685b50ef0ede42ff40ae9a`
MD5	`d8c2433939a1235010d72486e65ea575`
BLAKE2b-256	`03eac6f13d940929f799ee8a89fec8aebb0fc6d2168f992351f1196a74e53c9c`

See more details on using hashes here.

epitran 0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Usage

Using the epitran Module

Preprocessors and Their Pitfalls

Using the epitran.vector Module

Language Support

Transliteration Languages

Language “Spaces”

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes