Scraping grapheme-to-phoneme data from Wiktionary.
WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronunciation dictionaries mined using this tool.
If you use WikiPron in your research, please cite the following:
Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman (2020). Massively multilingual pronunciation mining with WikiPron. In In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223-4228. [bibtex]
WikiPron requires Python 3.6+. It is available through pip:
pip install wikipron
After installation, the terminal command
wikipron will be available. As a
basic example, the following command scrapes G2P data for French:
Specifying the Language
The language is indicated by a three-letter ISO
639-2 or ISO
639-3 language code,
fra for French. For which languages can be scraped,
is the complete list of languages on Wiktionary that have pronunciation entries.
Specifying the Dialect
One can optionally specify dialects to target using the
--dialect flag. The
dialect name can be found together with the transcription on Wiktionary. For
example, "(UK, US) IPA: /təˈmɑːtəʊ/". To restrict to the union of dialects use
the pipe character '|': e.g.,
--dialect='General American | US'.
Transcriptions which lack a dialect specification are selected regardless of the
value of this flag.
By default, the
segments library is used
to segment the transcription into whitespace. The segmentation tends to place
IPA diacritics and modifiers on the "parent" symbol. For instance, [kʰæt] is
kʰ æ t. This can be disabled using the
The scraped data is organized with each <word, pronunciation> pair on its own
line, where the word and pronunciation are separated by a tab. Note that the
pronunciation is in International Phonetic Alphabet
by spaces that correctly handle the combining and modifier diacritics for
modeling purposes, e.g., we have
kʰ æ t with the aspirated k instead of
k ʰ æ t.
For illustration, here is a snippet of French data scraped by WikiPron:
accrémentitielle a k ʁ e m ɑ̃ t i t j ɛ l accrescent a k ʁ ɛ s ɑ̃ accrétion a k ʁ e s j ɔ̃ accrétions a k ʁ e s j ɔ̃
By default, the scraped data appears in the terminal. To save the data in a TSV file, please redirect the standard output to a filename of your choice:
wikipron fra > fra.tsv
wikipron terminal command has an array of options to configure your
scraping run. For a full list of the options, please run
The underlying module can also be used from Python. A standard workflow looks like:
import wikipron config = wikipron.Config(key="fra") # French, with default options. for word, pron in wikipron.scrape(config): ...
We also make available a database of 2.5 million word/pronunciation pairs mined using WikiPron.
We host grapheme-to-phoneme models and modeling software in a separate repository.
The source code of WikiPron is hosted on GitHub at https://github.com/kylebgorman/wikipron, where development also happens.
For the latest changes not yet released through
pip or working on the codebase
yourself, you may obtain the latest source code through GitHub and
Create a fork of the
wikipronrepo on your GitHub account.
Locally, make sure you are in some sort of a virtual environment (venv, virtualenv, conda, etc).
Download and install the library in the "editable" mode together with the core and dev dependencies within the virtual environment:
git clone https://github.com/<your-github-username>/wikipron.git cd wikipron pip install --upgrade pip setuptools pip install -r requirements.txt pip install --no-deps -e .
We keep track of notable changes in CHANGELOG.md.
For questions, bug reports, and feature requests, please file an issue.
If you would like to contribute to the
wikipron codebase, please see
WikiPron is released under an Apache 2.0 license. Please see LICENSE.txt for details.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.