Skip to main content

Wikidata and Wikipedia data extraction for Scribe applications

Project description

Scribe Logo

    issues language pypi pypistatus license coc codestyle

    Wikidata and Wikipedia data extraction for Scribe applications

    This repository contains the scripts for extracting and formatting data from Wikidata and Wikipedia for Scribe applications. Updates to the language keyboard and interface data can be done using scribe_data/load/update_data.py.

    Contents

    Process

    scribe_data/load/update_data.py is used to update all data for Scribe-iOS, with this functionality later being expanded to update Scribe-Android and Scribe-Desktop when they're active. The autosuggestion process further derives popular words from Wikipedia as well as those words that normally follow them for an effective baseline feature until natural language processing techniques are employed. Functions to generate autosuggestions are ran in scribe_data/load/gen_autosuggestions.ipynb.

    The ultimate goal is that this repository will house language packs that are periodically updated with new Wikidata lexicographical data, with these packs then being available to download by users of Scribe applications.

    Contributing

    Work that is in progress or could be implemented is tracked in the issues. Please see the contribution guidelines if you are interested in contributing to Scribe-Data. Also check the -priority- labels in the issues for those that are most important, as well as those marked good first issue that are tailored for first time contributors.

    Ways to Help

    Data Edits

    Scribe does not accept direct edits to the grammar JSON files as they are sourced from Wikidata. Edits can be discussed and the queries themselves will be changed and ran before an update. If there is a problem with one of the files, then the fix should be made on Wikidata and not on Scribe. Feel free to let us know that edits have been made by opening a data issue and we'll be happy to integrate them!

    Supported Languages

    Scribe's goal is functional, feature-rich keyboards and interfaces for all languages. Check the extract_transform directory for queries for currently supported languages and those that have substantial data on Wikidata.

    The following table shows the supported languages and the amount of data available for each on Wikidata:

    Languages Nouns Verbs Translations* Prepositions†
    French 16,681 1,545 67,652 -
    German 29,230 3,542 67,652 187
    Italian 8,399 73 67,652 -
    Portuguese 5,176 495 67,652 -
    Russian 194,408 11 67,652 13
    Spanish 24,656 3,792 67,652 -
    Swedish 42,718 4,394 67,652 -

    * Given the current beta status where words are machine translated.

    Only for languages for which preposition annotation is needed.

    Featured By

    Articles and Presentations on Scribe

    2022


    Wikimedia Deutschland Logo           MediaWiki logo          

    Powered By

    List of references


    Wikidata logo           Wikipedia logo          

    Project details


    Download files

    Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

    Source Distributions

    No source distribution files available for this release.See tutorial on generating distribution archives.

    Built Distribution

    scribe_data-2.0.0-py3-none-any.whl (70.4 kB view hashes)

    Uploaded Python 3

    Supported by

    AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page