Skip to main content

Extract a compact Spanish dictionary from Wikcionario, with elegance

Project description

Cervantes

Extract a compact Spanish dictionary from Wikcionario, with elegance

In Spanish literature - and all over the world - the works of Miguel de Cervantes are considered an obra maestra because of their stylistic 🦋elegance, witty remarks and humanistic depth of thought...

...which is why I wish to dedicate this project to his memory: more precisely, this is my library for creating a customized, Wiktionary-based corpus of the Spanish language.

Cervantes is a type-checked library for Python, built on top of WikiPrism, focusing on:

  • Parsing an XML dump of Wikcionario and extracting Spanish terms from each wiki page

  • Classifying each term according to a set of grammar categories

  • Providing a Spanish-related Dictionary, backed by a SQLite db, that can be used for custom analysis via SQL queries

Despite its sophisticated regex-based engine, Cervantes has a minimalist programming interface; furthermore, it is designed to be a core plugin of Jardinero - which makes it extremely simple to use, via a web-application user interface.

No matter the scenario, it is essential to explore the SQL schema of its underlying database: for details, please consult the sections below.

Installation

To install Cervantes, just run:

pip install info.gianlucacosta.cervantes

or, if you are using Poetry:

poetry add info.gianlucacosta.cervantes

Extracting a custom dictionary from Wikcionario

Once Cervantes is installed in your Python environment, you can import it just like any other Python library - or you can run it within Jardinero's infrastructure!

In the latter case, make sure Jardinero is installed, then run:

python [-OO] -m info.gianlucacosta.jardinero info.gianlucacosta.cervantes

If you start Jardinero with the -OO flag:

  • the process will run slightly faster, but with less logging

  • Jardinero will download the very latest dump from Wikcionario's official download website

On the other hand, if you omit the flag:

Database schema

Every single table in the database created by Cervantes has two fields:

  • entry (TEXT NOT NULL) - denoting the term within the dictionary

  • pronunciation (TEXT) - the IPA pronunciation, with an ASCII apostrophe character (and not a more sophisticated Unicode symbol) before the syllable having the primary stress

Given the nature of the extraction process, there are no foreign keys enforcing consistency between tables (for example, between verbs and verb_forms) - but one can still perform JOINs according to one's needs.

Table: prepositions

Field Type Required Primary key
entry TEXT * *
pronunciation TEXT

Table: interjections

Field Type Required Primary key
entry TEXT * *
pronunciation TEXT

Table: conjunctions

Field Type Required Primary key
entry TEXT * *
pronunciation TEXT

Table: adverbs

Field Type Required Primary key
entry TEXT * *
pronunciation TEXT
kind TEXT *

Table: verbs

Field Type Required Primary key
entry TEXT * *
pronunciation TEXT
kind TEXT *

Table: pronouns

Field Type Required Primary key
entry TEXT * *
pronunciation TEXT
kind TEXT *

Table: articles

Field Type Required Primary key
entry TEXT * *
pronunciation TEXT
kind TEXT *

Table: adjectives

Field Type Required Primary key
entry TEXT * *
pronunciation TEXT
reference_entry TEXT *

Table: nouns

Field Type Required Primary key
entry TEXT * *
pronunciation TEXT
gender TEXT * *
number_trait TEXT
reference_entry TEXT *

Table: verb_forms

Field Type Required Primary key
entry TEXT * *
pronunciation TEXT
infinitive TEXT * *
mode TEXT * *
tense TEXT *
person TEXT *

The API

Even though this library is designed for Jardinero, one can still use its functions in other Python programs - mostly in a custom subclass of WikiPrism's PipelineStrategy.

As a matter of fact, one just needs a few functions from the info.gianlucacosta.cervantes namespace:

  • extract_terms(page: Page) -> list[SpanishTerm]: given a page, returns a list of Spanish terms from the page - and whose types can be imported from the info.gianlucacosta.cervantes.terms namespace. In WikiPrism's model, this function is a TermExtractor[SpanishTerm]
  • create_sqlite_dictionary(connection: Connection) -> SpanishSqliteDictionary: given a SQLite connection, creates a SqliteDictionary (from WikiPrism) that will become its owner and that can be used to read and write Spanish terms. Consequently, it is a SqliteDictionaryFactory[SpanishTerm]

Parting thoughts

Cervantes is a project that I created because I definitely needed to further explore Spanish morphology: actually, it was the initial kernel of Jardinero, which I later refactored into a separate library, as well as WikiPrism.

Since it relies on a very dynamic source like Wikcionario, and despite the carefully-crafted parsing regular expressions, its output cannot be 100% accurate.

Furthermore, it focuses on the linguistic aspects that I felt more appealing according to my own needs - which means that I had to discard information during the parsing, or even to include aspects that may be unnecessary in a different context.

Consequently... feel free to experiment, and maybe to create your own library! ^__^

Actually, one can even adopt Cervantes's patterns to create a linguistic module for Jardinero dedicated to another language! 🤩

Further references

  • Jardinero - Python/TypeScript React web app for exploring natural languages

  • WikiPrism - Parse wiki pages and create dictionaries, fast, with Python

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

info.gianlucacosta.cervantes-1.0.0.tar.gz (12.3 kB view hashes)

Uploaded Source

Built Distribution

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page