Wikidata and Wikipedia data extraction for Scribe applications
Project description
Wikidata and Wikipedia data extraction for Scribe applications
This repository contains the scripts for extracting and formatting data from Wikidata and Wikipedia for Scribe applications. Updates to the language keyboard and interface data can be done using scribe_data/load/update_data.py.
Contents
Process ⇧
scribe_data/load/update_data.py is used to update all data for Scribe-iOS, with this functionality later being expanded to update Scribe-Android and Scribe-Desktop when they're active. The autosuggestion process further derives popular words from Wikipedia as well as those words that normally follow them for an effective baseline feature until natural language processing techniques are employed. Functions to generate autosuggestions are ran in scribe_data/load/gen_autosuggestions.ipynb.
The ultimate goal is that this repository will house language packs that are periodically updated with new Wikidata lexicographical data, with these packs then being available to download by users of Scribe applications.
Contributing ⇧
Work that is in progress or could be implemented is tracked in the issues and projects. Please see the contribution guidelines if you are interested in contributing to Scribe-Data. Also check the -priority-
labels in the issues for those that are most important, as well as those marked good first issue
that are tailored for first time contributors.
After your first few pull requests organization members would be happy to discuss granting you further rights as a contributor, with a maintainer role then being possible after continued interest in the project. Scribe seeks to be an inclusive and supportive organization. We'd love to have you on the team!
Ways to Help ⇧
- Reporting bugs as they're found 🐞
- Working on new features ✨
- Documentation for onboarding and project cohesion 📝
- Adding language data to Scribe-Data via Wikidata! 🗃️
Road Map ⇧
The Scribe road map can be followed in the organization's project board where we list the most important issues along with their priority, status and an indication of which sub projects they're included in (if applicable).
Data Edits ⇧
Scribe does not accept direct edits to the grammar JSON files as they are sourced from Wikidata. Edits can be discussed and the queries themselves will be changed and ran before an update. If there is a problem with one of the files, then the fix should be made on Wikidata and not on Scribe. Feel free to let us know that edits have been made by opening a data issue and we'll be happy to integrate them!
Supported Languages ⇧
Scribe's goal is functional, feature-rich keyboards and interfaces for all languages. Check the extract_transform directory for queries for currently supported languages and those that have substantial data on Wikidata.
The following table shows the supported languages and the amount of data available for each on Wikidata:
Languages | Nouns | Verbs | Translations* | Prepositions† |
---|---|---|---|---|
French | 16,815 | 5,450 | 67,652 | - |
German | 29,272 | 3,557 | 67,652 | 187 |
Italian | 8,646 | 73 | 67,652 | - |
Portuguese | 5,191 | 495 | 67,652 | - |
Russian | 194,419 | 11 | 67,652 | 13 |
Spanish | 27,128 | 4,036 | 67,652 | - |
Swedish | 42,807 | 4,394 | 67,652 | - |
*
Given the current beta
status where words are machine translated.
†
Only for languages for which preposition annotation is needed.
Featured By ⇧
Articles and Presentations on Scribe
2022
- Presentation slides for a session at the 2022 Wikimania Hackathon
- Presentation slides for a talk with CocoaHeads Berlin
- Video on Scribe for Wikimedia Celtic Knot 2022
- Presentation slides for a talk with the LD4 Wikidata Affinity Group
- Scribe featured for new developers on MediaWiki
- Presentation slides for Wikimedia Hackathon 2022
- Blog post on Scribe-iOS for Wikimedia Tech News (DE / Tweet)
- Presentation slides for Wikidata Data Reuse Days 2022
Powered By
Contributors
Many thanks to all the Scribe-Data contributors! 🚀
Blog posts
List of referenced posts
Wikimedia Communities
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for scribe_data-2.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f803ec2189f2ff73375696e7f8478c5344fb99ae895963d76bf9abf5374d23c2 |
|
MD5 | 7a55ec2dae4fec31c1be5181bbfd666a |
|
BLAKE2b-256 | 11f3d7258e13617ed74f3b97d117e8ba5e001c6bce6b828dc4fb9467c15021ed |