Extract a compact Spanish dictionary from Wikcionario, with elegance

These details have not been verified by PyPI

Project links

Project description

Cervantes

Extract a compact Spanish dictionary from Wikcionario, with elegance

In Spanish literature - and all over the world - the works of Miguel de Cervantes are considered an obra maestra because of their stylistic 🦋elegance, witty remarks and humanistic depth of thought...

...which is why I wish to dedicate this project to his memory: more precisely, this is my library for creating a customized, Wiktionary-based corpus of the Spanish language.

Cervantes is a type-checked library for Python, built on top of WikiPrism, focusing on:

Parsing an XML dump of Wikcionario and extracting Spanish terms from each wiki page
Classifying each term according to a set of grammar categories
Providing a Spanish-related Dictionary, backed by a SQLite db, that can be used for custom analysis via SQL queries

Despite its sophisticated regex-based engine, Cervantes has a minimalist programming interface; furthermore, it is designed to be a core plugin of Jardinero - which makes it extremely simple to use, via a web-application user interface.

No matter the scenario, it is essential to explore the SQL schema of its underlying database: for details, please consult the sections below.

Installation

To install Cervantes, just run:

pip install info.gianlucacosta.cervantes

or, if you are using Poetry:

poetry add info.gianlucacosta.cervantes

Extracting a custom dictionary from Wikcionario

Once Cervantes is installed in your Python environment, you can import it just like any other Python library...

...or you can run it as an extension module within Jardinero's infrastructure! 🥳

Just make sure that both Jardinero and Cervantes are installed, then run:

python -OO -m info.gianlucacosta.jardinero info.gianlucacosta.cervantes

You will then be able to create a dictionary and perform SQL queries via Jardinero's web user interface.

Cervantes also supports Jardinero's developer mode: in that case, the system will refer to your local copy of Wikcionario - which must be a BZ2 archive residing at the following address:

http://localhost:8000/eswiktionary-latest-pages-articles.xml.bz2

Usually, you can make this URL available by running:

python -m http.server

from within the directory containing your Wikcionario dump file.

For a more detailed explanation about the developer mode, please refer to Jardinero's documentation.

Database schema

Every single table in the database created by Cervantes has two fields:

entry (TEXT NOT NULL) - denoting the term within the dictionary
pronunciation (TEXT) - the IPA pronunciation, with an ASCII apostrophe character (and not a more sophisticated Unicode symbol) before the syllable having the primary stress

Given the nature of the extraction process, there are no foreign keys enforcing consistency between tables (for example, between verbs and verb_forms) - but one can still perform JOINs according to one's needs.

Table: prepositions

Field	Type	Required	Primary key
entry	TEXT	*	*
pronunciation	TEXT

Table: interjections

Field	Type	Required	Primary key
entry	TEXT	*	*
pronunciation	TEXT

Table: conjunctions

Field	Type	Required	Primary key
entry	TEXT	*	*
pronunciation	TEXT

Table: adverbs

Field	Type	Required	Primary key
entry	TEXT	*	*
pronunciation	TEXT
kind	TEXT		*

Table: verbs

Field	Type	Required	Primary key
entry	TEXT	*	*
pronunciation	TEXT
kind	TEXT		*

Table: pronouns

Field	Type	Required	Primary key
entry	TEXT	*	*
pronunciation	TEXT
kind	TEXT		*

Table: articles

Field	Type	Required	Primary key
entry	TEXT	*	*
pronunciation	TEXT
kind	TEXT		*

Table: adjectives

Field	Type	Required	Primary key
entry	TEXT	*	*
pronunciation	TEXT
reference_entry	TEXT		*

Table: nouns

Field	Type	Required	Primary key
entry	TEXT	*	*
pronunciation	TEXT
gender	TEXT	*	*
number_trait	TEXT
reference_entry	TEXT		*

Table: verb_forms

Field	Type	Required	Primary key
entry	TEXT	*	*
pronunciation	TEXT
infinitive	TEXT	*	*
mode	TEXT	*	*
tense	TEXT		*
person	TEXT		*

The API

Even though this library is designed for Jardinero, one can still use its functions in other Python programs - mostly in a custom subclass of WikiPrism's PipelineStrategy.

As a matter of fact, one just needs a few functions from the info.gianlucacosta.cervantes namespace:

extract_terms(page: Page) -> list[SpanishTerm]: given a page, returns a list of Spanish terms from the page - and whose types can be imported from the info.gianlucacosta.cervantes.terms namespace. In WikiPrism's model, this function is a TermExtractor[SpanishTerm]

create_sqlite_dictionary(connection: Connection) -> SpanishSqliteDictionary: given a SQLite connection, creates a SqliteDictionary (from WikiPrism) that will become its owner and that can be used to read and write Spanish terms. Consequently, it is a SqliteDictionaryFactory[SpanishTerm]

Parting thoughts

Cervantes is a project that I created because I definitely needed to further explore Spanish morphology: actually, it was the initial kernel of Jardinero, which I later refactored into a separate library, as well as WikiPrism.

Since it relies on a very dynamic source like Wikcionario, and despite the carefully-crafted parsing regular expressions, its output cannot be 100% accurate.

Furthermore, it focuses on the linguistic aspects that I felt more appealing according to my own needs - which means that I had to discard information during the parsing, or even to include aspects that may be unnecessary in a different context.

Consequently... feel free to experiment, and maybe to create your own library! ^__^

Actually, one can even adopt Cervantes's patterns to create a linguistic module for Jardinero dedicated to another language! 🤩

Further references

Jardinero - Python/TypeScript React web app for exploring natural languages
WikiPrism - Parse wiki pages and create dictionaries, fast, with Python

Eos-core - Type-checked, dependency-free utility library for modern Python
Miguel de Cervantes

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Apr 22, 2022

1.0.0

Apr 20, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

info.gianlucacosta.cervantes-1.1.0.tar.gz (12.3 kB view details)

Uploaded Apr 22, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

info.gianlucacosta.cervantes-1.1.0-py3-none-any.whl (11.3 kB view details)

Uploaded Apr 22, 2022 Python 3

File details

Details for the file info.gianlucacosta.cervantes-1.1.0.tar.gz.

File metadata

Download URL: info.gianlucacosta.cervantes-1.1.0.tar.gz
Upload date: Apr 22, 2022
Size: 12.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.13 CPython/3.10.4 Linux/5.13.0-1021-azure

File hashes

Hashes for info.gianlucacosta.cervantes-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3e342a26f13e68ac58660e417e0b3cd6b6ae840428398aa5e28a42a5993be0ca`
MD5	`711efb435a08f56f75efed332397e577`
BLAKE2b-256	`b9f266670e546b997ed22fe8483fa1dc4858a9d3e2008fe39ed8d34a933311e3`

See more details on using hashes here.

File details

Details for the file info.gianlucacosta.cervantes-1.1.0-py3-none-any.whl.

File metadata

Download URL: info.gianlucacosta.cervantes-1.1.0-py3-none-any.whl
Upload date: Apr 22, 2022
Size: 11.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.13 CPython/3.10.4 Linux/5.13.0-1021-azure

File hashes

Hashes for info.gianlucacosta.cervantes-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c1ad60ef19e459aa2fef95fff211f655cadfb923c0511d3948b68bb32da15d1`
MD5	`4e8a81c600001675fdcaec4d868597c3`
BLAKE2b-256	`02cb4a87ff878e4b4aff3677e66717cdc643b7290d001d802810110efc3dd436`

See more details on using hashes here.

info.gianlucacosta.cervantes 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Cervantes

Installation

Extracting a custom dictionary from Wikcionario

Database schema

Table: prepositions

Table: interjections

Table: conjunctions

Table: adverbs

Table: verbs

Table: pronouns

Table: articles

Table: adjectives

Table: nouns

Table: verb_forms

The API

Parting thoughts

Further references

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes