Skip to main content

sequence and joint-sequence modelling tool for g2p

Project description

Build Status Sequitur G2P

A trainable Grapheme-to-Phoneme converter.

Introduction

Sequitur G2P is a data-driven grapheme-to-phoneme converter written at RWTH Aachen University by Maximilian Bisani.

The method used in this software is described in

   M. Bisani and H. Ney: "Joint-Sequence Models for Grapheme-to-Phoneme
   Conversion". Speech Communication, Volume 50, Issue 5, May 2008,
   Pages 434-451

   (available online at http://dx.doi.org/10.1016/j.specom.2008.01.002)

This software is made available to you under terms of the GNU Public License. It can be used for experimentation and as part of other free software projects. For details see the licensing terms below.

If you publish about work that involves the use of this software, please cite the above paper. (You should feel obliged to do so by rules of good scientific conduct.)

The original README contains also these lines: You may contact the author with any questions or comments via e-mail: maximilian.bisani@rwth-aachen.de. For questions regarding current releases of Sequitur G2P contact Pavel Golik (golik@cs.rwth-aachen.de). but we are not sure how active they are. If needed, feel free to create an issue on https://github.com/sequitur-g2p/sequitur-g2p. We will try to help.

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License Version 2 (June 1991) as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, you will find it at http://www.gnu.org/licenses/gpl.html, or write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110, USA.

Should a provision of no. 9 and 10 of the GNU General Public License be invalid or become invalid, a valid provision is deemed to have been agreed upon which comes closest to what the parties intended commercially. In any case guarantee/warranty shall be limited to gross negligent actions or intended actions or fraudulent concealment.

Installing

To build and use this software you need to have the following part installed:

To install change to the source directory and type: python setup.py install --prefix /usr/local You may substitute /usr/local with some other directory. If you do so make sure that some-other-directory/lib/python2.5/site-packages/ is in your PYTHONPATH, e.g. by typing export PYTHONPATH=some-other-directory/lib/python2.7/site-packages

You can also install via pip by pointing it at this repository. You still need SWIG and a C++ compiler.

pip install numpy
pip install git+https://github.com/sequitur-g2p/sequitur-g2p@master

Note, when installing on MacOS, you might run into issues due to the default libc being clang's one. If that is the case, try installing it with:

CPPFLAGS="-stdlib=libstdc++" pip install git+https://github.com/sequitur-g2p/sequitur-g2p@master

Using

Sequitur G2P is a data-driven grapheme-to-phoneme converter. Actually, it can be applied to any monotonous sequence translation problem, provided the source and target alphabets are small (less than 255 symbols). Data-driven means that you need to train it with example pronunciations. It has no built-in linguistic knowledge whatsoever, which means that it should work for any alphabetic language. Training takes a pronunciation dictionary and creates a model file. The model file can then be used to transcribe words that where not in the dictionary.

Here is step-by-step guide to get you started:

  1. Obtain a pronunciation dictionary for training. The format is one word per line. Each line contains the orthographic form of the word followed by the corresponding phonemic transcription. The word and all phonemes need to be separated by white space. The word and phoneme symbols may thus not contain blanks. We'll assume your training lexicon is called train.lex, and that you set aside some portion for testing purposes as test.lex, which is disjoint from train.lex.

  2. Train a model. To create a first model type:

    g2p.py --train train.lex --devel 5% --write-model model-1

    This first model will be rather poor because it is only a unigram. To create higher order models you need to run g2p.py again:

    g2p.py --model model-1 --ramp-up --train train.lex --devel 5% --write-model model-2

    Repeat this a couple of times

    g2p.py --model model-2 --ramp-up --train train.lex --devel 5% --write-model model-3
    g2p.py --model model-3 --ramp-up --train train.lex --devel 5% --write-model model-4
    ...
    
  3. Evaluate the model. To find out how accurately your model can transcribe unseen words type:

    g2p.py --model model-6 --test test.lex

  4. Transcribe new words. Prepare a list of words you want to transcribe as a simple text file words.txt with one word per line (and no phonemic transcription), then type:

    g2p.py --model model-3 --apply words.txt

Random comments:

  • You cannot open models created in a python3 environment inside a python2 environment. The opposite works.
  • Whenever a file name is required, you can specify "-" to mean standard in, or standard out.
  • If a file name ends in ".gz", it is assumed that the file is (or should be) compressed using gzip.
  • For the time being you need to type g2p.py --help and/or read the source to find out the other things g2p.py can do. Sorry about that.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sequitur_g2p-1.0.1668.30.tar.gz (82.2 kB view details)

Uploaded Source

Built Distributions

sequitur_g2p-1.0.1668.30-cp312-cp312-win_amd64.whl (134.3 kB view details)

Uploaded CPython 3.12 Windows x86-64

sequitur_g2p-1.0.1668.30-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

sequitur_g2p-1.0.1668.30-cp312-cp312-macosx_10_9_universal2.whl (246.4 kB view details)

Uploaded CPython 3.12 macOS 10.9+ universal2 (ARM64, x86-64)

sequitur_g2p-1.0.1668.30-cp311-cp311-win_amd64.whl (133.9 kB view details)

Uploaded CPython 3.11 Windows x86-64

sequitur_g2p-1.0.1668.30-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

sequitur_g2p-1.0.1668.30-cp311-cp311-macosx_10_9_universal2.whl (245.7 kB view details)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

sequitur_g2p-1.0.1668.30-cp310-cp310-win_amd64.whl (133.9 kB view details)

Uploaded CPython 3.10 Windows x86-64

sequitur_g2p-1.0.1668.30-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

sequitur_g2p-1.0.1668.30-cp310-cp310-macosx_11_0_x86_64.whl (150.4 kB view details)

Uploaded CPython 3.10 macOS 11.0+ x86-64

sequitur_g2p-1.0.1668.30-cp39-cp39-win_amd64.whl (134.0 kB view details)

Uploaded CPython 3.9 Windows x86-64

sequitur_g2p-1.0.1668.30-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

sequitur_g2p-1.0.1668.30-cp39-cp39-macosx_11_0_x86_64.whl (150.4 kB view details)

Uploaded CPython 3.9 macOS 11.0+ x86-64

File details

Details for the file sequitur_g2p-1.0.1668.30.tar.gz.

File metadata

  • Download URL: sequitur_g2p-1.0.1668.30.tar.gz
  • Upload date:
  • Size: 82.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for sequitur_g2p-1.0.1668.30.tar.gz
Algorithm Hash digest
SHA256 76dbd71b100acd1d3395514b0bd5a7fbdcf2051d0f6ef59b193dcaa93f8763cb
MD5 a16cf7309977e69608ff9f2ec2ec6a73
BLAKE2b-256 9e68bdcf68d981bb5242f000eea96a1566eefedf4dc274932e163f2b5470e8c5

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6438ede4fab0dfd6477a693072e5ef8db63e8c305d442e058845e2f0b16c7a54
MD5 5206ec6c239109ab80ac5331235b1445
BLAKE2b-256 3ff7b6773e3c12842754a6318cb70e7cf25d4e4cd074dd6e13bc5abad2b2c1f1

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0f40ea55e88713d9ec88116c68baebe582c200c2c278cfc7ddc24f96ecbf34ab
MD5 628f3174238ee9e6cd4f5f3f37f5c778
BLAKE2b-256 6832013a7b5a0471ea677d525e2d3db8a021e049103c964c21bdcc5487207bd7

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp312-cp312-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp312-cp312-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 963b2fa4649c1243384e3f35ee432e80c93e7d9b8440d3abc7e1d77d466a5298
MD5 099bd22095104c079939dfbe98360635
BLAKE2b-256 a7e82b927bd7908533f1336abe760a74425ce631edaac8bb603bebab3e48a8a4

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 d9fe1ba79200375ded43a5df01e5e24bd6cfc3a805c45f32c95045e989c1db89
MD5 f334f95824885347fb569adf329b0852
BLAKE2b-256 80ba4093ae0d49204f0792dd571ea832b6a7ac62205b2ece0af1a68c68f10322

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 76ebfc3cc5938ce6fe9c29c51bfec9d9630144243cc975740826e9b0d8d7e5a3
MD5 21ade8d32a8f84328f85e68a94545a44
BLAKE2b-256 49b52b709027b0e83a94c4b6ba4c9c84370afd3fa824cdccc5c78b2011c5d134

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 32ee6128daa1c21a2af408c97098293cb9fe32f4f463de131f391c64b41fea7d
MD5 981c61e2b94c4a89d91d5cb7fb4dbbd8
BLAKE2b-256 8f219a709fbe10f61dd1d2b3eac2b337fcfedd59f25b7431bb17d0d0d80be52d

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 3c000856bf21922e25877baade879e34f2dd12900c3dfe480789b429171d6979
MD5 cd1ae57948f2d9a7ec7c04cb7effa8fb
BLAKE2b-256 75351a2ebbec8f1bb4c3bd6ca4f94189857344902c99227a53cb74d57cd667e7

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7fce133bc15a023c6c355d6fd4d6953dc376a0a574cd73e176fa23198985fbfd
MD5 3796814478ad854948dc93cb79e53eb2
BLAKE2b-256 ca375ac08edc789fc125cea1d6451f4ff062c13e74e3a8e49bc4c1b7a4eb0b47

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 28023bf80316813c032341fd1704408983f69f2a3bca63b2029326ebfacd7c32
MD5 ed5dcb208606b36f7992b8a265c69f31
BLAKE2b-256 4cc0031e96e1f21a789c6cff953dfa95ac391c210fdf7b2ec7d4ff70be9e8f4b

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 d34fd8ec713111f5ac598106379d0d81d934ccc4246715638d41158aed40dbbc
MD5 267401dc6b60b9442d6399972dbe87a7
BLAKE2b-256 65aad100ce887313ef7308c3e7e04315649b54761e883309fd68401ba807623c

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c471b381e7f8ab90267cd53c38eb21243538137396b5bec5a4606922d6a14465
MD5 d41b7f30e96c9b321824bd3a3a055e87
BLAKE2b-256 d1841f2ba1c7e1079c30f568292f65f0129b0d24719c17c3bb4256b8e9e39096

See more details on using hashes here.

File details

Details for the file sequitur_g2p-1.0.1668.30-cp39-cp39-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for sequitur_g2p-1.0.1668.30-cp39-cp39-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 c007e65acfb501d22eeedb0775d24bcd907132ddfbc82871195ac23aaf9da66a
MD5 266e83dae8ede664ff9fb24dc73681fb
BLAKE2b-256 101a8b4286f0a61b3ad168e7f3f96e868d7ad151e89154f46f8de0755768d9ed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page