Extract a compact Spanish dictionary from Wikcionario, with elegance
Project description
Cervantes
Extract a compact Spanish dictionary from Wikcionario, with elegance
In Spanish literature - and all over the world - the works of Miguel de Cervantes are considered an obra maestra because of their stylistic 🦋elegance, witty remarks and humanistic depth of thought...
...which is why I wish to dedicate this project to his memory: more precisely, this is my library for creating a customized, Wiktionary-based corpus of the Spanish language.
Cervantes is a type-checked library for Python, built on top of WikiPrism, focusing on:
-
Parsing an XML dump of Wikcionario and extracting Spanish terms from each wiki page
-
Classifying each term according to a set of grammar categories
-
Providing a Spanish-related Dictionary, backed by a SQLite db, that can be used for custom analysis via SQL queries
Despite its sophisticated regex-based engine, Cervantes has a minimalist programming interface; furthermore, it is designed to be a core plugin of Jardinero - which makes it extremely simple to use, via a web-application user interface.
No matter the scenario, it is essential to explore the SQL schema of its underlying database: for details, please consult the sections below.
Installation
To install Cervantes, just run:
pip install info.gianlucacosta.cervantes
or, if you are using Poetry:
poetry add info.gianlucacosta.cervantes
Extracting a custom dictionary from Wikcionario
Once Cervantes is installed in your Python environment, you can import it just like any other Python library...
...or you can run it as an extension module within Jardinero's infrastructure! 🥳
Just make sure that both Jardinero and Cervantes are installed, then run:
python -OO -m info.gianlucacosta.jardinero info.gianlucacosta.cervantes
You will then be able to create a dictionary and perform SQL queries via Jardinero's web user interface.
Cervantes also supports Jardinero's developer mode: in that case, the system will refer to your local copy of Wikcionario - which must be a BZ2 archive residing at the following address:
http://localhost:8000/eswiktionary-latest-pages-articles.xml.bz2
Usually, you can make this URL available by running:
python -m http.server
from within the directory containing your Wikcionario dump file.
For a more detailed explanation about the developer mode, please refer to Jardinero's documentation.
Database schema
Every single table in the database created by Cervantes has two fields:
-
entry (TEXT NOT NULL) - denoting the term within the dictionary
-
pronunciation (TEXT) - the IPA pronunciation, with an ASCII apostrophe character (and not a more sophisticated Unicode symbol) before the syllable having the primary stress
Given the nature of the extraction process, there are no foreign keys enforcing consistency between tables (for example, between verbs and verb_forms) - but one can still perform JOINs according to one's needs.
Table: prepositions
Field | Type | Required | Primary key |
---|---|---|---|
entry | TEXT | * | * |
pronunciation | TEXT |
Table: interjections
Field | Type | Required | Primary key |
---|---|---|---|
entry | TEXT | * | * |
pronunciation | TEXT |
Table: conjunctions
Field | Type | Required | Primary key |
---|---|---|---|
entry | TEXT | * | * |
pronunciation | TEXT |
Table: adverbs
Field | Type | Required | Primary key |
---|---|---|---|
entry | TEXT | * | * |
pronunciation | TEXT | ||
kind | TEXT | * |
Table: verbs
Field | Type | Required | Primary key |
---|---|---|---|
entry | TEXT | * | * |
pronunciation | TEXT | ||
kind | TEXT | * |
Table: pronouns
Field | Type | Required | Primary key |
---|---|---|---|
entry | TEXT | * | * |
pronunciation | TEXT | ||
kind | TEXT | * |
Table: articles
Field | Type | Required | Primary key |
---|---|---|---|
entry | TEXT | * | * |
pronunciation | TEXT | ||
kind | TEXT | * |
Table: adjectives
Field | Type | Required | Primary key |
---|---|---|---|
entry | TEXT | * | * |
pronunciation | TEXT | ||
reference_entry | TEXT | * |
Table: nouns
Field | Type | Required | Primary key |
---|---|---|---|
entry | TEXT | * | * |
pronunciation | TEXT | ||
gender | TEXT | * | * |
number_trait | TEXT | ||
reference_entry | TEXT | * |
Table: verb_forms
Field | Type | Required | Primary key |
---|---|---|---|
entry | TEXT | * | * |
pronunciation | TEXT | ||
infinitive | TEXT | * | * |
mode | TEXT | * | * |
tense | TEXT | * | |
person | TEXT | * |
The API
Even though this library is designed for Jardinero, one can still use its functions in other Python programs - mostly in a custom subclass of WikiPrism's PipelineStrategy.
As a matter of fact, one just needs a few functions from the info.gianlucacosta.cervantes namespace:
- extract_terms(page: Page) -> list[SpanishTerm]: given a page, returns a list of Spanish terms from the page - and whose types can be imported from the info.gianlucacosta.cervantes.terms namespace. In WikiPrism's model, this function is a TermExtractor[SpanishTerm]
- create_sqlite_dictionary(connection: Connection) -> SpanishSqliteDictionary: given a SQLite connection, creates a SqliteDictionary (from WikiPrism) that will become its owner and that can be used to read and write Spanish terms. Consequently, it is a SqliteDictionaryFactory[SpanishTerm]
Parting thoughts
Cervantes is a project that I created because I definitely needed to further explore Spanish morphology: actually, it was the initial kernel of Jardinero, which I later refactored into a separate library, as well as WikiPrism.
Since it relies on a very dynamic source like Wikcionario, and despite the carefully-crafted parsing regular expressions, its output cannot be 100% accurate.
Furthermore, it focuses on the linguistic aspects that I felt more appealing according to my own needs - which means that I had to discard information during the parsing, or even to include aspects that may be unnecessary in a different context.
Consequently... feel free to experiment, and maybe to create your own library! ^__^
Actually, one can even adopt Cervantes's patterns to create a linguistic module for Jardinero dedicated to another language! 🤩
Further references
-
Jardinero - Python/TypeScript React web app for exploring natural languages
-
WikiPrism - Parse wiki pages and create dictionaries, fast, with Python
-
Eos-core - Type-checked, dependency-free utility library for modern Python
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file info.gianlucacosta.cervantes-1.1.0.tar.gz
.
File metadata
- Download URL: info.gianlucacosta.cervantes-1.1.0.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.10.4 Linux/5.13.0-1021-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e342a26f13e68ac58660e417e0b3cd6b6ae840428398aa5e28a42a5993be0ca |
|
MD5 | 711efb435a08f56f75efed332397e577 |
|
BLAKE2b-256 | b9f266670e546b997ed22fe8483fa1dc4858a9d3e2008fe39ed8d34a933311e3 |
File details
Details for the file info.gianlucacosta.cervantes-1.1.0-py3-none-any.whl
.
File metadata
- Download URL: info.gianlucacosta.cervantes-1.1.0-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.10.4 Linux/5.13.0-1021-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c1ad60ef19e459aa2fef95fff211f655cadfb923c0511d3948b68bb32da15d1 |
|
MD5 | 4e8a81c600001675fdcaec4d868597c3 |
|
BLAKE2b-256 | 02cb4a87ff878e4b4aff3677e66717cdc643b7290d001d802810110efc3dd436 |