Skip to main content

sequence and joint-sequence modelling tool for g2p

Project description

Build Status Sequitur G2P

A trainable Grapheme-to-Phoneme converter.

Introduction

Sequitur G2P is a data-driven grapheme-to-phoneme converter written at RWTH Aachen University by Maximilian Bisani.

The method used in this software is described in

   M. Bisani and H. Ney: "Joint-Sequence Models for Grapheme-to-Phoneme
   Conversion". Speech Communication, Volume 50, Issue 5, May 2008,
   Pages 434-451

   (available online at http://dx.doi.org/10.1016/j.specom.2008.01.002)

This software is made available to you under terms of the GNU Public License. It can be used for experimentation and as part of other free software projects. For details see the licensing terms below.

If you publish about work that involves the use of this software, please cite the above paper. (You should feel obliged to do so by rules of good scientific conduct.)

The original README contains also these lines: You may contact the author with any questions or comments via e-mail: maximilian.bisani@rwth-aachen.de. For questions regarding current releases of Sequitur G2P contact Pavel Golik (golik@cs.rwth-aachen.de). but we are not sure how active they are. If needed, feel free to create an issue on https://github.com/sequitur-g2p/sequitur-g2p. We will try to help.

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License Version 2 (June 1991) as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, you will find it at http://www.gnu.org/licenses/gpl.html, or write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110, USA.

Should a provision of no. 9 and 10 of the GNU General Public License be invalid or become invalid, a valid provision is deemed to have been agreed upon which comes closest to what the parties intended commercially. In any case guarantee/warranty shall be limited to gross negligent actions or intended actions or fraudulent concealment.

Installing

To build and use this software you need to have the following part installed:

To install change to the source directory and type: python setup.py install --prefix /usr/local You may substitute /usr/local with some other directory. If you do so make sure that some-other-directory/lib/python2.5/site-packages/ is in your PYTHONPATH, e.g. by typing export PYTHONPATH=some-other-directory/lib/python2.7/site-packages

You can also install via pip by pointing it at this repository. You still need SWIG and a C++ compiler.

pip install numpy
pip install git+https://github.com/sequitur-g2p/sequitur-g2p@master

Note, when installing on MacOS, you might run into issues due to the default libc being clang's one. If that is the case, try installing it with:

CPPFLAGS="-stdlib=libstdc++" pip install git+https://github.com/sequitur-g2p/sequitur-g2p@master

Using

Sequitur G2P is a data-driven grapheme-to-phoneme converter. Actually, it can be applied to any monotonous sequence translation problem, provided the source and target alphabets are small (less than 255 symbols). Data-driven means that you need to train it with example pronunciations. It has no built-in linguistic knowledge whatsoever, which means that it should work for any alphabetic language. Training takes a pronunciation dictionary and creates a model file. The model file can then be used to transcribe words that where not in the dictionary.

Here is step-by-step guide to get you started:

  1. Obtain a pronunciation dictionary for training. The format is one word per line. Each line contains the orthographic form of the word followed by the corresponding phonemic transcription. The word and all phonemes need to be separated by white space. The word and phoneme symbols may thus not contain blanks. We'll assume your training lexicon is called train.lex, and that you set aside some portion for testing purposes as test.lex, which is disjoint from train.lex.

  2. Train a model. To create a first model type:

    g2p.py --train train.lex --devel 5% --write-model model-1

    This first model will be rather poor because it is only a unigram. To create higher order models you need to run g2p.py again:

    g2p.py --model model-1 --ramp-up --train train.lex --devel 5% --write-model model-2

    Repeat this a couple of times

    g2p.py --model model-2 --ramp-up --train train.lex --devel 5% --write-model model-3
    g2p.py --model model-3 --ramp-up --train train.lex --devel 5% --write-model model-4
    ...
    
  3. Evaluate the model. To find out how accurately your model can transcribe unseen words type:

    g2p.py --model model-6 --test test.lex

  4. Transcribe new words. Prepare a list of words you want to transcribe as a simple text file words.txt with one word per line (and no phonemic transcription), then type:

    g2p.py --model model-3 --apply words.txt

Random comments:

  • You cannot open models created in a python3 environment inside a python2 environment. The opposite works.
  • Whenever a file name is required, you can specify "-" to mean standard in, or standard out.
  • If a file name ends in ".gz", it is assumed that the file is (or should be) compressed using gzip.
  • For the time being you need to type g2p.py --help and/or read the source to find out the other things g2p.py can do. Sorry about that.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sequitur-g2p-1.0.1668.22.tar.gz (44.9 kB view details)

Uploaded Source

Built Distributions

sequitur_g2p-1.0.1668.22-cp310-cp310-win_amd64.whl (132.7 kB view details)

Uploaded CPython 3.10 Windows x86-64

sequitur_g2p-1.0.1668.22-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

sequitur_g2p-1.0.1668.22-cp310-cp310-macosx_10_15_universal2.whl (243.2 kB view details)

Uploaded CPython 3.10 macOS 10.15+ universal2 (ARM64, x86-64)

sequitur_g2p-1.0.1668.22-cp39-cp39-win_amd64.whl (132.7 kB view details)

Uploaded CPython 3.9 Windows x86-64

sequitur_g2p-1.0.1668.22-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

sequitur_g2p-1.0.1668.22-cp39-cp39-macosx_10_15_x86_64.whl (148.8 kB view details)

Uploaded CPython 3.9 macOS 10.15+ x86-64

sequitur_g2p-1.0.1668.22-cp38-cp38-win_amd64.whl (132.8 kB view details)

Uploaded CPython 3.8 Windows x86-64

sequitur_g2p-1.0.1668.22-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

sequitur_g2p-1.0.1668.22-cp38-cp38-macosx_10_15_x86_64.whl (149.2 kB view details)

Uploaded CPython 3.8 macOS 10.15+ x86-64

sequitur_g2p-1.0.1668.22-cp37-cp37m-win_amd64.whl (132.8 kB view details)

Uploaded CPython 3.7m Windows x86-64

sequitur_g2p-1.0.1668.22-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

sequitur_g2p-1.0.1668.22-cp37-cp37m-macosx_10_15_x86_64.whl (149.3 kB view details)

Uploaded CPython 3.7m macOS 10.15+ x86-64

sequitur_g2p-1.0.1668.22-cp36-cp36m-win_amd64.whl (140.9 kB view details)

Uploaded CPython 3.6m Windows x86-64

sequitur_g2p-1.0.1668.22-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

sequitur_g2p-1.0.1668.22-cp36-cp36m-macosx_10_14_x86_64.whl (149.1 kB view details)

Uploaded CPython 3.6m macOS 10.14+ x86-64

File details

Details for the file sequitur-g2p-1.0.1668.22.tar.gz.

File metadata

  • Download URL: sequitur-g2p-1.0.1668.22.tar.gz
  • Upload date:
  • Size: 44.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for sequitur-g2p-1.0.1668.22.tar.gz
Algorithm Hash digest
SHA256 ef3c4f7fa26aa19e2818385191b81cfd7cf52ee2ff500131e1692d0de230f0dd
MD5 3dfa1e93348234e63c9630e4793193fb
BLAKE2b-256 1f40d70c355d5c05399680dd939c78e5d39dbfaef652f15bd38256f2ede2a339

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a156e8cb35fb5f715168b9bad373c3a2a0fc329fc8cd8fa19178e05ef32e7089
MD5 5a4d37f42e4490cb304065fe4154afcf
BLAKE2b-256 12d3212fdb402b7a2d03e39d0f7fedf6de87764ac60e48e3ec0b5254368442fc

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f44018be26e693a1168b65e54f704b0bf3a6c2d71da87d7f4d9e9e773ae81b9c
MD5 3ad1c14bb7c8091ff5c02cfd2f1eb9e6
BLAKE2b-256 cc9c156abf0dc3f333cfbf0a6e0b55705f04028b19ca422d96b9817373128d58

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp310-cp310-macosx_10_15_universal2.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp310-cp310-macosx_10_15_universal2.whl
Algorithm Hash digest
SHA256 9eb85208b5be8ac91531e08f374fa69cb660fcec9dc496c4426495fa2af9d360
MD5 0927a14651ab0fa6ad2680c979a498ff
BLAKE2b-256 04c219e47eabd352ecba1933b7fc2ec891e01b67e25fd7c66eaaba2c4c98271f

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 cd7d953d67f515a0ff05a71e08b060e9796a58210bda9a471856e4b29aa15dc4
MD5 47939441dacea0164b50eef21739162d
BLAKE2b-256 4a41106f0d6a5927c3dd89eb1784e42c74206aef5095633c7dc7e353dc1faaa5

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 8066e4510e2c0066ed6098f93812fda6edf14e071f969f13b3c3b5a1240a4f2c
MD5 d2fb0845a717b18d8c89e60bcbd14de5
BLAKE2b-256 c2389adb745503d90d1bd51c1c45bcf0bbf4c604cdf6fe988ac8b77b1135e644

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 3c66fd9acd62b81d4a9b1887198048e6de97acb249172cdd475292f062e5b79a
MD5 7bff1d47ab794053834485ffc0ee64c0
BLAKE2b-256 b002b9489e41c2b9d84bb36f3797c1b9753d6b81fb92408b5fe13bd28c02c891

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 6aa21511c1f9d567dda9330009976835771619bd19b7a413060083eaa9d6f98b
MD5 811c0c4a6c197afe2c6d2fff48de7944
BLAKE2b-256 f20a139ea61398f2efe4b984bcc890b71c433a4ea0489ace070da816d75947b0

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 5d83309ff9bc1688274115be6f295e99c0002d4d4463d2edd780094f1eb6aa37
MD5 6675d2c089839ba33bef9a7f63e5dad2
BLAKE2b-256 be04d348f84b87cd952d9c557f8b66bb409357498a113280f9f4ff6b809074bd

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 504c61447b7b8736eb81d5a3d14be94d84d677651c2751b1e9b5aafe98112ef1
MD5 102c7b0bad1a20b0b9b2ae765b554eb6
BLAKE2b-256 4108c2f837d3b0aa49f7717fb30ecec1b17f8e68dc69eb5774552f8672bcf51f

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 22cd4710ac901cf7aee06254fd1daa66bfd78e2f4a6b9787b2579832be687795
MD5 c679f497aadc3ff3c39c0038b13a3705
BLAKE2b-256 6b7af844ab61a8b92f975406c8af0128b8ec3277a4a4c5dcd96de5769c1ac9ff

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 e8373a990f1adb641831ccbecd1f9a2a67834196c92df8aabdbac5f0c4574220
MD5 03764ff240242327e687b6de24b7168b
BLAKE2b-256 09cce09653dafe5ab9ee84db71f843fc69eaddcfdd0b0d967cb270fcd50fea3e

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 24c81bb0cbdf288536e0e81e0ec90ac6984a4bff98b844b7afaf8b9ae7a613f0
MD5 4254e3c8e138276ca6c50c826dcbf9d5
BLAKE2b-256 faa11497d9cedcff8cd78e5019dfc144cc92098e335054baf02c2234db2b6850

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp36-cp36m-win_amd64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 5b51e12e3cdc4845ba6df842441b36662c0ca78fba5073a4e4300f70314f0485
MD5 ca2ff69eca38879d5194f5c09358f800
BLAKE2b-256 3588297c20801b155feb2addf584f1a35cc0da72d00a2a0e3ffabc76ab3044c5

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b44c95f51788578c7e1678fdf646f521a4e77b34047d6d18312416035fdf0ac4
MD5 8dd3ba68cc77f2883c9efa474a998d80
BLAKE2b-256 f555333570f69ed216b03f09523f58807df441472129352a9ad447553fb3921e

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.22-cp36-cp36m-macosx_10_14_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.22-cp36-cp36m-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 43d7dc5bb728a53f90d23ed3979a7e7cdd8bef96d4a49a205aaa7ddf13e23113
MD5 ed10f99d258a95aab093088b3fe9f493
BLAKE2b-256 536e691a8697833402d777d41e061e8161ee70100a6ec63cf1649f8f715dd3fa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page