Scraping grapheme-to-phoneme data from Wiktionary.
Project description
WikiPron
WikiPron is a command line toolkit for scraping grapheme-to-phoneme (G2P) data from Wiktionary.
Installation
WikiPron requires Python 3.6+. It is available through pip:
pip install wikipron
Usage
Quick Start
After installation, the terminal command wikipron
will be available.
As a basic example, the following command scrapes G2P data for French:
wikipron fra
Specifying the Language
The language is indicated by a three-letter
ISO 639-2 or
ISO 639-3
language code, e.g., fra
for French.
For which languages can be scraped,
here
is the complete list of languages on Wiktionary that have pronunciation entries.
Output
The scraped data is organized with each <word, pronunciation> pair on its
own line, where the word and pronunciation are separated by a tab.
Note that the pronunciation is in
International Phonetic Alphabet (IPA),
segmented by spaces that correctly handle the combining and modifier diacritics
for modeling purposes,
e.g., we have kʰ æ t
with the aspirated k instead of k ʰ æ t
.
For illustration, here is a snippet of French data scraped by WikiPron:
accrémentitielle a k ʁ e m ɑ̃ t i t j ɛ l
accrescent a k ʁ ɛ s ɑ̃
accrétion a k ʁ e s j ɔ̃
accrétions a k ʁ e s j ɔ̃
By default, the scraped data appears in the terminal. To save the data in a TSV file, please redirect the standard output to a filename of your choice:
wikipron fra > fra.tsv
Advanced Options
The wikipron
terminal command has an array of options to configure
your scraping run.
For a full list of the options, please run wikipron -h
.
Python API
The underlying module can also be used from Python. A standard workflow looks like:
import wikipron
config = wikipron.Config(key="fra") # French, with default options.
for word, pron in wikipron.scrape(config):
...
Development
The source code of WikiPron is hosted on GitHub at https://github.com/kylebgorman/wikipron, where development also happens.
For the latest changes not yet released through pip
or working on the codebase
yourself, you may obtain the latest source code through GitHub and git
:
-
Create a fork of the
wikipron
repo on your GitHub account. -
Locally, make sure you are in some sort of a virtual environment (venv, virtualenv, conda, etc).
-
Download and install the library in the "editable" mode together with the core and dev dependencies within the virtual environment:
git clone https://github.com/<your-github-username>/wikipron.git cd wikipron pip install --upgrade pip setuptools pip install -r requirements.txt pip install --no-deps -e .
We keep track of notable changes in CHANGELOG.md.
Contribution
For questions, bug reports, and feature requests, please file an issue.
If you would like to contribute to the wikipron
codebase,
please see
CONTRIBUTING.md.
License
Apache 2.0. Please see LICENSE.txt for details.
Please note that Wiktionary data has its own licensing terms , as does the other data in the languages/ subdirectory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.