Skip to main content

Wikidata and Wikipedia language data extraction

Project description

Scribe Logo

platform rtd issues language pypi pypistatus license coc mastodon matrix

Wikidata and Wikipedia language data extraction

Scribe-Data contains the scripts for extracting and formatting language data from Wikidata and Wikipedia. Updates to the data can be done using scribe_data/wikidata/update_data.py and the notebooks within the scribe_data/load directory.

[!NOTE]
The contributing section has information for those interested, with the articles and presentations in featured by also being good resources for learning more about Scribe.

Scribe applications are available on iOS, Android (WIP) and Desktop (planned).

Check out Scribe's architecture diagrams for an overview of the organization including our applications, services and processes. It depicts the projects that Scribe is developing as well as the relationships between them and the external systems with which they interact. Also check out the Wikidata and Scribe Guide for an overview of Wikidata and querying language data from it.

Contents

Process

scribe_data/wikidata/update_data.py and the notebooks within the various scribe_data directories are used to update all data for Scribe-iOS, with this functionality later being expanded to update Scribe-Android and Scribe-Desktop when they're active.

The main data update process in update_data.py triggers SPARQL queries to query language data from Wikidata using SPARQLWrapper as a URI. The autosuggestion process derives popular words from Wikipedia as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in gen_autosuggestions.ipynb. Emojis are further sourced from Unicode CLDR, with this process being ran in gen_emoji_lexicon.ipynb.

Running update_data.py is done via the following CLI command:

python3 src/scribe_data/wikidata/update_data.py

The ultimate goal is that this repository will house language packs that are periodically updated with new Wikidata lexicographical data and data from other sources. These packs would then be available to download by users of Scribe applications.

Contributing

Public Matrix Chat

Scribe uses Matrix for communications. You're more than welcome to join us in our public chat rooms to share ideas, ask questions or just say hi :)

Please see the contribution guidelines and Wikidata and Scribe Guide if you are interested in contributing to Scribe-Data. Work that is in progress or could be implemented is tracked in the issues and projects.

[!NOTE]
Just because an issue is assigned on GitHub doesn't mean that the team isn't interested in your contribution! Feel free to write in the issues and we can potentially reassign it to you.

Those interested can further check the -next release- and -priority- labels in the issues for those that are most important, as well as those marked good first issue that are tailored for first time contributors.

After your first few pull requests organization members would be happy to discuss granting you further rights as a contributor, with a maintainer role then being possible after continued interest in the project. Scribe seeks to be an inclusive and supportive organization. We'd love to have you on the team!

Ways to Help

Road Map

The Scribe road map can be followed in the organization's project board where we list the most important issues along with their priority, status and an indication of which sub projects they're included in (if applicable).

[!NOTE]
Consider joining our bi-weekly developer syncs!

Data Edits

[!NOTE]
Please see the Wikidata and Scribe Guide for an overview of Wikidata and how Scribe uses it.

Scribe does not accept direct edits to the grammar JSON files as they are sourced from Wikidata. Edits can be discussed and the queries themselves will be changed and ran before an update. If there is a problem with one of the files, then the fix should be made on Wikidata and not on Scribe. Feel free to let us know that edits have been made by opening a data issue and we'll be happy to integrate them!

Environment Setup

[!IMPORTANT]

Suggested IDE extensions

VS Code

The development environment for Scribe-Data can be installed via the following steps:

  1. Fork the Scribe-Data repo, clone your fork, and configure the remotes:

[!NOTE]

Consider using SSH

Alternatively to using HTTPS as in the instructions below, consider SSH to interact with GitHub from the terminal. SSH allows you to connect without a user-pass authentication flow.

To run git commands with SSH, remember then to substitute the HTTPS URL, https://github.com/..., with the SSH one, git@github.com:....

  • e.g. Cloning now becomes git clone git@github.com:<your-username>/Scribe-Data.git

GitHub also has their documentation on how to Generate a new SSH key 🔑

# Clone your fork of the repo into the current directory.
git clone https://github.com/<your-username>/Scribe-Data.git
# Navigate to the newly cloned directory.
cd Scribe-Data
# Assign the original repo to a remote called "upstream".
git remote add upstream https://github.com/scribe-org/Scribe-Data.git
  • Now, if you run git remote -v you should see two remote repositories named:
    • origin (forked repository)
    • upstream (Scribe-Data repository)
  1. Use Python venv to create the local development environment within your Scribe-Data directory:
  • On Unix or MacOS, run:

    python3 -m venv venv  # make an environment named venv
    source venv/bin/activate # activate the environment
    
  • On Windows (using Command Prompt), run:

    python -m venv venv
    venv\Scripts\activate.bat
    

After activating the virtual environment, install the required dependencies and set up pre-commit by running:

pip install --upgrade pip  # make sure that pip is at the latest version
pip install -r requirements.txt  # install dependencies
pip install -e .  # install the local version of Scribe-Data
pre-commit install  # install pre-commit hooks
# pre-commit run --all-files  # lint and fix common problems in the codebase

[!NOTE] Feel free to contact the team in the Data room on Matrix if you're having problems getting your environment setup!

Supported Languages

Scribe's goal is functional, feature-rich keyboards and interfaces for all languages. Check the language_data_extraction directory for queries for currently supported languages and those that have substantial data on Wikidata.

The following table shows the supported languages and the amount of data available for each on Wikidata and via Unicode CLDR for emojis:

Languages Nouns Verbs Translations* Prepositions† Emoji Keywords
French 18,044 6,574 67,652 - 2,488
German 194,687 3,634 67,652 215 2,898
Italian 59,191 7,649 67,652 - 2,457
Portuguese 5,268 538 67,652 - 2,327
Russian 194,567 15 67,652 15 3,827
Spanish 61,650 7,912 67,652 - 3,134
Swedish 47,007 4,678 67,652 - 2,913

* Given the current beta status where words are machine translated.

Only for languages for which preposition annotation is needed.

Featured By

Articles and Presentations on Scribe

2024

2023

2022


Wikimedia Deutschland logo linking to an article on Scribe in the tech news blog.           Wikimedia Foundation logo linking to the MediaWiki new developers page.           Google Summer of Code logo linking to its website.          

Powered By

Contributors

Many thanks to all the Scribe-Data contributors! 🚀

Blog posts

List of referenced posts

Wikimedia Communities


Wikidata logo           Wikipedia logo          

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

scribe_data-3.3.0-py3-none-any.whl (46.4 MB view details)

Uploaded Python 3

File details

Details for the file scribe_data-3.3.0-py3-none-any.whl.

File metadata

  • Download URL: scribe_data-3.3.0-py3-none-any.whl
  • Upload date:
  • Size: 46.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.5

File hashes

Hashes for scribe_data-3.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0c561efcc7b4e5b92efa6e44926b0a7cae2820580c86d93c7fccc9d30e7a0bce
MD5 1a0b2132504b7edbaef5bc0479f8867e
BLAKE2b-256 84b83ff906c1c09293d443604dbfe9bfc087b779f693cf8e35095dd5c4cd24d5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page