Skip to main content

Scraping grapheme-to-phoneme data from Wiktionary

Project description

WikiPron

PyPI version Supported Python versions CircleCI Paper Conference

WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronunciation dictionaries mined using this tool.

If you use WikiPron in your research, please cite the following:

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman (2020). Massively multilingual pronunciation mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223-4228. [bibtex]

Command-line tool

Installation

pip install wikipron

Usage

Quick start

After installation, the terminal command wikipron will be available. As a basic example, the following command scrapes G2P data for French:

wikipron fra

Specifying the language

The language is indicated by a three-letter ISO 639-3 language code, e.g., fra for French. For which languages can be scraped, here is the complete list of languages on Wiktionary that have pronunciation entries.

Specifying the dialect

One can optionally specify dialects to target using the --dialect flag. The dialect name can be found together with the transcription on Wiktionary. For example, "(UK, US) IPA: /təˈmɑːtəʊ/". To restrict to the union of dialects use the pipe character '|': e.g., --dialect='General American | US'. Transcriptions which lack a dialect specification are selected regardless of the value of this flag.

Specifying the transcription level

By default, WikiPron selects broad pronunciations in angled brackets /like this/. One can instead select narrow transcriptions written [like this] using the --narrow flag. Note that some languages only have broad or narrow transcriptions (e.g., Russian only has the latter.

Segmentation

By default, the segments library is used to segment the transcription into whitespace. The segmentation tends to place IPA diacritics and modifiers on the "parent" symbol. For instance, [kʰæt] is rendered kʰ æ t. This can be disabled using the --no-segment flag.

Parentheses

Some transcriptions contain parentheses to indicate optional sounds (e.g., English A&E /eɪ.ən(d)ˈiː/). The --parens flag controls how they are handled: expand (default) generates all variants, skip removes parentheses and their content, and show keeps parentheses as-is in the output.

Output

The scraped data is organized with each <word, pronunciation> pair on its own line, where the word and pronunciation are separated by a tab. Note that the pronunciation is in International Phonetic Alphabet (IPA), segmented by spaces that correctly handle the combining and modifier diacritics for modeling purposes, e.g., we have kʰ æ t with the aspirated k instead of k ʰ æ t.

For illustration, here is a snippet of French data scraped by WikiPron:

accrémentitielle    a k ʁ e m ɑ̃ t i t j ɛ l
accrescent  a k ʁ ɛ s ɑ̃
accrétion   a k ʁ e s j ɔ̃
accrétions  a k ʁ e s j ɔ̃

By default, the scraped data appears in the terminal. To save the data in a TSV file, please redirect the standard output to a filename of your choice:

wikipron fra > fra.tsv

Advanced options

The wikipron terminal command has an array of options to configure your scraping run. For a full list of the options, please run wikipron -h.

Python API

The underlying module can also be used from Python. A standard workflow looks like:

import wikipron

config = wikipron.Config(key="fra")  # French, with default options.
for word, pron in wikipron.scrape(config):
    ...

Data

We also make available a database of over 3 million word/pronunciation pairs mined using WikiPron.

Models

We host grapheme-to-phoneme models and modeling software in a separate repository.

Development

Repository

The source code of WikiPron is hosted on GitHub at https://github.com/CUNY-CL/wikipron, where development also happens.

For the latest changes not yet released through pip or working on the codebase yourself, you may obtain the latest source code through GitHub and git:

  1. Create a fork of the wikipron repo on your GitHub account.

  2. Clone from your fork:

    git clone https://github.com/<your-github-username>/wikipron.git
    cd wikipron
    
  3. Set up a Python virtual environment. We recommend using uv:

    uv python install 3.14
    uv venv --python 3.14
    source .venv/bin/activate
    
  4. Install WikiPron in the "editable" mode together with the core and dev dependencies:

    uv pip install -e ".[dev]"
    

We keep track of notable changes in CHANGELOG.md.

Contributing

For questions, bug reports, and feature requests, please file an issue.

If you would like to contribute to the wikipron codebase, please see CONTRIBUTING.md.

License

WikiPron is released under an Apache 2.0 license. Please see LICENSE.txt for details.

Please note that Wiktionary data in the data/ directory has its own licensing terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikipron-2.0.0.tar.gz (31.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikipron-2.0.0-py3-none-any.whl (35.3 kB view details)

Uploaded Python 3

File details

Details for the file wikipron-2.0.0.tar.gz.

File metadata

  • Download URL: wikipron-2.0.0.tar.gz
  • Upload date:
  • Size: 31.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for wikipron-2.0.0.tar.gz
Algorithm Hash digest
SHA256 f656fadc557bc19f4afbfbf037fc76bff0930a84301948a1f17704eb0c0a8dec
MD5 40a44fac22553b47d292ad7aacb1d7f4
BLAKE2b-256 8dc76838be6d46eb555d012c8e768ad256e37c293a3d65c9d2e98e7032b43e4d

See more details on using hashes here.

Provenance

The following attestation bundles were made for wikipron-2.0.0.tar.gz:

Publisher: release.yml on CUNY-CL/wikipron

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file wikipron-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: wikipron-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 35.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for wikipron-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 263d8f87ce9febd2e2015cc42939bb9d67111e47f940c8eef0dc223d65eb3634
MD5 425f78ea7782b782c5cdc76ca2adedb4
BLAKE2b-256 2b2da8c7eb6f17e9e8df66667ca2b0309fde26b87d84edd87fafba1ab02ddf4c

See more details on using hashes here.

Provenance

The following attestation bundles were made for wikipron-2.0.0-py3-none-any.whl:

Publisher: release.yml on CUNY-CL/wikipron

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page