Converts a dataset based on a specific schema
Project description
ckanext-transmute
This extension helps to validate and convert data based on a specific schema.
Working with transmute
ckanext-transmute provides an action tsm_transmute. It helps us to transmute data with the provided conversion scheme. The action doesn't change the original data but creates a new data dict. There are two mandatory arguments: data and schema. data is a data dict you have, and schema helps you to validate/change data in it.
Example
We have a data dict:
{
"title": "Test-dataset",
"email": "test@test.ua",
"metadata_created": "",
"metadata_modified": "",
"metadata_reviewed": "",
"resources": [
{
"title": "test-res",
"extension": "xml",
"web": "https://stackoverflow.com/",
"sub-resources": [
{
"title": "sub-res",
"extension": "csv",
"extra": "should-be-removed",
}
],
},
{
"title": "test-res2",
"extension": "csv",
"web": "https://stackoverflow.com/",
},
],
}
And we want to achieve this:
{
"name": "test-dataset",
"email": "test@test.ua",
"metadata_created": datetime.datetime(2022, 2, 3, 15, 54, 26, 359453),
"metadata_modified": datetime.datetime(2022, 2, 3, 15, 54, 26, 359453),
"metadata_reviewed": datetime.datetime(2022, 2, 3, 15, 54, 26, 359453),
"attachments": [
{
"name": "test-res",
"format": "XML",
"url": "https://stackoverflow.com/",
"sub-resources": [{"name": "SUB-RES", "format": "CSV"}],
},
{
"name": "test-res2",
"format": "CSV",
"url": "https://stackoverflow.com/",
},
],
}
Then, our schema must be something like that:
{
"root": "Dataset",
"types": {
"Dataset": {
"fields": {
"title": {
"validators": [
"tsm_string_only",
"tsm_to_lowercase",
"tsm_name_validator",
],
"map": "name",
},
"resources": {
"type": "Resource",
"multiple": True,
"map": "attachments",
},
"metadata_created": {
"validators": ["tsm_isodate"],
"default": "2022-02-03T15:54:26.359453",
},
"metadata_modified": {
"validators": ["tsm_isodate"],
"default_from": "metadata_created",
},
"metadata_reviewed": {
"validators": ["tsm_isodate"],
"replace_from": "metadata_modified",
},
}
},
"Resource": {
"fields": {
"title": {
"validators": ["tsm_string_only"],
"map": "name",
},
"extension": {
"validators": ["tsm_string_only", "tsm_to_uppercase"],
"map": "format",
},
"web": {
"validators": ["tsm_string_only"],
"map": "url",
},
"sub-resources": {
"type": "Sub-Resource",
"multiple": True,
},
},
},
"Sub-Resource": {
"fields": {
"title": {
"validators": ["tsm_string_only", "tsm_to_uppercase"],
"map": "name",
},
"extension": {
"validators": ["tsm_string_only", "tsm_to_uppercase"],
"map": "format",
},
"extra": {
"remove": True,
},
}
},
},
}
There is an example of schema with nested types. The root field is mandatory, it's must contain a main type name, from which the scheme starts. As you can see, Dataset type contains Resource type which contans Sub-Resource.
Transmutators
There are a few default transmutators you can use in your schema. Of course, you can define a custom transmutator with the ITransmute interface.
tsm_name_validator- Wrapper over CKAN defaultname_validatorvalidator.tsm_to_lowercase- Casts string value to lowercase.tsm_to_uppercase- Casts string value to uppercase.tsm_string_only- Validates iffield.valueis a string.tsm_isodate- Validates datetime string. Mutates an iso-like string to datetime object.tsm_to_string- Casts afield.valuetostr.tsm_get_nested- Allows you to pick up a value from a nested structure. Example:
data = "title_translated": [
{"nested_field": {"en": "en title", "ar": "العنوان ar"}},
]
schema = ...
"title": {
"replace_from": "title_translated",
"validators": [
["tsm_get_nested", 0, "nested_field", "en"],
"tsm_to_uppercase",
],
},
...
This will take a value for a title field from title_translated field. Because title_translated is an array with nested objects, we are using the tsm_get_nested transmutator to achieve the value from it.
tsm_trim_string- Trim string with max length. Example to trimhello worldtohello:
data = {"field_name": "hello world}
schema = ...
"field_name": {
"validators": [
["tsm_trim_string", 5]
],
},
...
tsm_concat- Concatenate strings. Use$selfto point on field value. Example:
data = {"id": "dataset-1"}
schema = ...
"package_url": {
"replace_from": "id",
"validators": [
[
"tsm_concat",
"https://site.url/dataset/",
"$self",
]
],
},
...
tsm_unique_only- Preserve only unique values from a list. Works only with lists.
The default transmutator must receive at least one mandatory argument - field object. Field contains few properties: field_name, value and type.
There is a possibility to provide more arguments to a validator like in tsm_get_nested. For this use a nested array with first item transmutator and other - arguments to it.
tsm_mapper- Map current value to the mapping dict
Map a value to another value. The current value must serve as a key within the mapping dictionary, while the new value will represent the updated value.
The default value to be used when the key is not found in the mapping. If the default value is not provided, the current value will be used as it.
data = {"language": "English"}
schema = ...
"language": {
"validators": [
[
"tsm_mapper",
{"English": "eng"},
"English"
]
]
},
...
tsm_list_mapper- Map current value to the mapping dict
Works as tsm_mapper but with list. Doesn't have a default value. Third argument remove must be True or False.
If remove set to True, removes values from the list if they don't have a corresponding mapping. Defaults to False.
Example without remove:
data = {"topic": ["Health", "Military", "Utilities"]}
schema = ...
"topic": {
"validators": [
[
"tsm_list_mapper",
{"Military": "Army", "Utilities": "Utility"}
]
]
},
...
The result here will be ["Health", "Army", "Utility"]
And here's an example with remove:
data = {"topic": ["Health", "Military", "Utilities"]}
schema = build_schema(
"topic": {
"validators": [
[
"tsm_list_mapper",
{"Military": "Army", "Utilities": "Utility"},
True
]
]
},
...
)
This will result in ["Army", "Utility"], and the Health will be deleted, cause it doesn't have a mapping.
Keywords
map(str) - changes thefield.namein result dict.validators(list[str]) - a list of transmutators that will be applied to afield.value. A transmutator could be astringor alistwhere the first item must be transmutator name and others are arbitrary values. Example:
There are two transmutators:... "validators": [ ["tsm_get_nested", "nested_field", "en"], "tsm_to_uppercase", , ...tsm_get_nestedandtsm_to_uppercase.multiple(bool, default:False) - if the field could have multiple items, e.gresourcesfield in dataset, mark it asmultipleto transmute all the items successively.... "resources": { "type": "Resource", "multiple": True }, ...remove(bool, default:False) - Removes a field from a result dict ifTrue.default(Any) - the default value that will be used if the original field.value evaluates toFalse.default_from(str|list) - acts similar todefaultbut accepts afield.nameof a sibling field from which we want to take its value. Sibling field is a field that located in the sametype. The current implementation doesn't allow to point on fields from othertypes. Could take a string that represents thefield.nameor an array of strings, to use multiple fields. Seeinherit_modekeyword for details.... "metadata_modified": { "validators": ["tsm_isodate"], "default_from": "metadata_created", }, ...replace_from(str|list) - acts similar todefault_frombut replaces the origin value whenever it's empty or not.inherit_mode(str, default:combine) - defines the mode fordefault_fromandreplace_from. By default we are combining values from all the fields, but we could just use first non-false value, in case if the field might be empty.value(Any) - a value that will be used for a field. This keyword has the highest priority. Could be used to create a new field with an arbitrary value.update(bool, default:False) - if the original value is mutable (array,object) - you can update it. You can only update field values of the same types.
Installation
To install ckanext-transmute:
-
Activate your CKAN virtual environment, for example:
. /usr/lib/ckan/default/bin/activate
-
Clone the source and install it on the virtualenv
git clone https://github.com/DataShades/ckanext-transmute.git cd ckanext-transmute pip install -e . pip install -r requirements.txt
-
Add
transmuteto theckan.pluginssetting in your CKAN config file (by default the config file is located at/etc/ckan/default/ckan.ini). -
Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu:
sudo service apache2 reload
Developer installation
To install ckanext-transmute for development, activate your CKAN virtualenv and do:
git clone https://github.com/DataShades/ckanext-transmute.git
cd ckanext-transmute
python setup.py develop
pip install -r dev-requirements.txt
Tests
I've used TDD to write this extension, so if you changing something be sure that all the tests are valid. To run the tests, do:
pytest --ckan-ini=test.ini
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ckanext_transmute-1.7.0.tar.gz.
File metadata
- Download URL: ckanext_transmute-1.7.0.tar.gz
- Upload date:
- Size: 34.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.0.0 pkginfo/1.10.0 requests/2.32.3 setuptools/75.1.0 requests-toolbelt/1.0.0 tqdm/4.66.6 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fc024b57b78f7324b79de54ac23e2e81216d6500b796ee1cd5d940c96ded410
|
|
| MD5 |
b3359fea08ebc0ff86c76df5b5269fbc
|
|
| BLAKE2b-256 |
d4d0c395bf9f3b863b03e207c2a472576ed6949a3b79466b6aff7ba3f076e818
|
File details
Details for the file ckanext_transmute-1.7.0-py3-none-any.whl.
File metadata
- Download URL: ckanext_transmute-1.7.0-py3-none-any.whl
- Upload date:
- Size: 34.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.0.0 pkginfo/1.10.0 requests/2.32.3 setuptools/75.1.0 requests-toolbelt/1.0.0 tqdm/4.66.6 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80de5c9df4946fc8d9c1c26ba5cff135d3cfcacf84e073d8d22780726394fd68
|
|
| MD5 |
e2b70b374e733bac35c0c8d7ec070d96
|
|
| BLAKE2b-256 |
f70dcc4a09b662c22d2d7a7159f88b8895551406a8433fb93e1afc43b869a8ad
|