A program for converting data from a Ġabra database dump to a more regular and accessible format.
Project description
Ġabra Converter
This program converts Ġabra's database dump files, which are for MongoDB, into a more accessible format, as well as cleaning and normalising it.
How to use
To use this program, you will need to have the following command line commands available on your computer:
tar
: 7-zip archiverbsondump
: MongoDB tool
Make sure that you install the above applications and then test them in your command line with the following commands:
tar --version
bsondump --version
Once you have these applications available in your command line, you can now download a Ġabra database dump file.
Use the exporter by calling python bin/run_gabra_converter.py
or gabra_converter.exe
in the command line as follows:
python bin/run_gabra_converter.py --gabra_dump_path <path to dump file> --out_path <path to folder with exported files> --lexeme_cleaners <space separated list of lexeme cleaner names> --wordform_cleaners <space separated list of wordform cleaner names> --lexeme_exporter <exporter name> --wordform_exporter <exporter name, usually the same as the lexeme exporter>
Here is a typical example:
python bin/run_gabra_converter.py --gabra_dump_path path/to/gabra --out_path path/to/out --lexeme_cleaners --wordform_cleaners --lexeme_exporter csv --wordform_exporter csv
or with the gabra_converter.exe
:
gabra_converter --gabra_dump_path path/to/gabra --out_path path/to/out --lexeme_cleaners new_lines --wordform_cleaners --lexeme_exporter csv --wordform_exporter csv
Run python bin/run_gabra_converter.py --help
or gabra_converter --help
for more information.
What is exported
All the exported data is based on the official Ġabra schema.
Whilst MongoDB is a NoSQL database which allows for leaving fields out completely in database rows (note that rows are called documents and tables are called collections in MongoDB), the exported data is structured as flat tables.
All the fields in the schema are used in the export and left empty if unused in a row.
On the other hand, any fields that are not mentioned in the schema but still used in the rows, such as norm_freq
, are left out.
A number of files are generated to handle one-to-many relationships.
For example, since one lexeme can have many glosses (glosses are stored as a list in Ġabra), a separate file for glosses is created such that each row in the lexemes file can refer to multiple rows in the glosses file.
Non-list fields that are represented as nested objects are flattened such that the field "root":{"radicals":"b-ħ-b-ħ","variant":2}"
becomes two fields: root-radicals
and root-variant
, with the dash used to separate parent names from child names.
Any unnecessarily nested objects produced by MongoDB that are used to specify data types (objects consisting of just one field with a dollar sign at the front of the field name) are not preserved.
So numbers being stored in "$numberInt"
such as "derived_form":{"$numberInt":1}
will be exported as derived_form
without reference to the nested object.
Boolean values are represented as 0 for false and 1 for true.
Finally, while MongoDB uses hexadecimal numbers for primary and foreign keys, such as 63b1e0f314e849fa182bcfc3, the export also includes its own decimal primary and foreign keys for ease of use in relational databases.
These fields will have their field names prefixed with new_
, such as new_id
and new_lexeme_id
.
The following exporters are supported:
csv
At the moment, the program only supports CSV (Comma Separated Values) file exports. The files generated are the following:
lexemes.csv
: Contains all the non-list fields in the lexemes collection. Includes a decimal unique IDnew_id
field and the original hexadecimal unique ID_id
field.lexemes_alternatives.csv
: Contains the alternative words of each lexeme on separate rows using thenew_lexeme_id
field to link to the lexeme'snew_id
field. Includes a decimal unique IDnew_id
.lexemes_sources.csv
: Contains the sources of each lexeme on separate rows using thenew_lexeme_id
field to link to the lexeme'snew_id
field. Includes a decimal unique IDnew_id
.lexemes_glosses.csv
: Contains the different glosses (definitions in English) of each lexeme on separate rows using thenew_lexeme_id
field to link to the lexeme'snew_id
field. Includes a decimal unique IDnew_id
.lexemes_examples.csv
: Contains the different examples of each lexeme's gloss on separate rows using thenew_gloss_id
field to link to the gloss'snew_id
field. Includes a decimal unique IDnew_id
.wordforms.csv
: Contains all the non-list fields in the wordforms collection. Includes a decimal unique IDnew_id
field, a decimal lexeme ID reference callednew_lexeme_id
, and the original hexadecimal unique ID_id
field.wordforms_alternatives.csv
: Contains the alternative words of each wordform on separate rows using thenew_wordform_id
field to link to the wordform'snew_id
field. Includes a decimal unique IDnew_id
.wordforms_sources.csv
: Contains the sources of each wordform on separate rows using thenew_wordform_id
field to link to the wordform'snew_id
field. Includes a decimal unique IDnew_id
.
Available cleaners
There are a number of options available for skipping or cleaning certain rows from the Ġabra database. Some are required whilst others are optional, depending on the exporter used.
Lexeme related cleaners
new_lines
: Remove new lines from the glosses and examples of lexemes.lemma_capitals
: Skip any lexemes whose lemma contains uppercase letters.lemma_nonmaltese
: Skip any lexemes whose lemma contains non-Maltese letters.lemma_spaces
: Skip any lexemes whose lemma contains spaces.pending
: Skip any lexemes whose pending field is not set to false.
Required cleaners:
csv |
|
---|---|
new_lines |
|
lemma_capitals |
|
lemma_nonmaltese |
|
lemma_spaced |
|
pending |
Wordform related cleaners
missing_lexeme
: Skip any wordforms whose lexeme ID does not refer to an existing lexeme.surfaceform_capitals
: Skip any wordforms whose surfaceform contains uppercase letters.surfaceform_nonmaltese
: Skip any wordforms whose surfaceform contains non-Maltese letters.surfaceform_spaces
: Skip any wordforms whose surfaceform contains spaces.pending
: Skip any wordforms whose pending field is not set to false.
Required cleaners:
csv |
|
---|---|
missing_lexeme |
|
surfaceform_capitals |
|
surfaceform_nonmaltese |
|
surfaceform_spaces |
|
pending |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gabra_converter-1.1.0.tar.gz
.
File metadata
- Download URL: gabra_converter-1.1.0.tar.gz
- Upload date:
- Size: 24.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8fbd01b086bec1b3f26e663c25cbf0197f1711edcca030999439125eb59bc68 |
|
MD5 | 3b9ce472fde0bb6857d2c2d8492af61b |
|
BLAKE2b-256 | 4d9e43e6b9217ea29b1f3cbede3fa8e5a1388c473c420798a28a4ca28273aab1 |
File details
Details for the file gabra_converter-1.1.0-py3-none-any.whl
.
File metadata
- Download URL: gabra_converter-1.1.0-py3-none-any.whl
- Upload date:
- Size: 40.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15bcad88148ca7fe04d738db480e1139ba6b24f2ea83a2f976131680ecef75de |
|
MD5 | 11cf93b95cc7985b6baa51f25625b79d |
|
BLAKE2b-256 | 625ba8c9bcfbdae107df6d25a7c1d45a97987c0641cc184dd05da99c4a76c2cd |