A library for diffing structured data, like json or xml files, where the two files being diffed share some common structure.
Project description
Keyediffer
Keyediffer is a Python library for diffing structured data, like json or xml files, where the two files being diffed share some common structure.
Prerequisites
Visual Studio Code recommended
Ability to execute from Powershell. (Run Powershell as Administrator and run):
Set-ExecutionPolicy -ExecutionPolicy Unrestricted
Installation
Follow the steps to create a virtual environment
Use the package manager pip to install Keyediffer from git.
pip install cdisc-library-keyediffer
Use --upgrade
to reinstall regardless of version.
Development
Create a virtual environment venv
py -m venv "[path_to_virtual_environment]"
Add the following line in the file "[path_to_virtual_environment]\Scripts\Activate.ps1"
$Env:PYTHONPATH += ";$($pwd.Path)"
Switch to the virtual environment
& "[path_to_virtual_environment]\Scripts\Activate.ps1"
Use the package manager pip to install Keyediffer's requirements.
pip install -r requirements.txt
Follow the steps in this tutorial to execute the tests.
Or run the command python -m unittest
from the project root
Usage
Step 1 - Create a schema template
This first step takes as input the files to be diffed that contain similar structure.
from keyediffer.utils.excel_utils import save_xlsx
from keyediffer.utils.json_utils import (get_json, save_json)
from keyediffer.json_differ import (create_schema_template, json_diff)
versions = (
get_json('https://example.com/version1.json'),
get_json('version2.json'),
get_json('version3.json')
)
save_json(create_schema_template(versions), 'schema.json')
It creates a pruned schema subset. Note the following special names:
$schema
- JSON Schema identifierproperties
- Child properties within a dict/map/element/object structureitems
- Child items within a list/array/tuple structurekey
- For each list/array/tuple in the structure where the child items are dicts/maps/elements/objects, there is an empty key identifier.
The following properties are removed from the schema since they aren't used by the diff tool: required
, type
{
"$schema": "https://library.cdisc.org/api/mdr/schema",
"properties": {
"_links": {
"properties": {
...
}
},
"classes": {
"items": {
"key": [{"": {}}],
"properties": {
"_links": {
"properties": {
...
"subclasses": {
"items": {
"key": [{"": {}}],
"properties": {
"href": {},
"title": {},
"type": {}
}
}
}
}
},
"datasets": {
"items": {
"key": [{"": {}}],
"properties": {
"_links": {
"properties": {
...
}
},
"datasetStructure": {},
"datasetVariables": {
"items": {
"key": [{"": {}}],
"properties": {
"_links": {
"properties": {
"codelist": {
"items": {
"key": [{"": {}}],
"properties": {
"href": {},
"title": {},
"type": {}
}
}
},
...
}
},
"core": {},
"describedValueDomain": {},
"description": {},
"label": {},
"name": {},
"ordinal": {},
"role": {},
"simpleDatatype": {},
"valueList": {
"items": {}
}
}
}
},
"description": {},
"label": {},
"name": {},
"ordinal": {}
}
}
},
"description": {},
"label": {},
"name": {},
"ordinal": {}
}
}
},
"description": {},
"effectiveDate": {},
"label": {},
"name": {},
"registrationStatus": {},
"source": {},
"version": {}
}
}
Step 2 - Fill in additional attributes in the schema template
The following properties can be created and populated:
doc_id
- Name of property that will uniquely identify this document across document versions.key
- A list of key names that should correspond to an attribute/property/key that will uniquely identify the object within the parent list. The key will be used for identifying and comparing common objects across versions. If objects cannot be matched on the first key in the list, it will try to match on the next key in the list.exclusions
- List of child property names that should be excluded from comparison in the "basic" diff output.alias
- Renames properties in the "basic" diff output.
{
"$schema": "https://library.cdisc.org/api/mdr/schema",
"exclusions" : ["_links", "name", "version", "effectiveDate", "label"],
"doc_id" : "name",
"properties": {
"_links": {
"properties": {
...
}
},
"classes": {
"alias" : "Class",
"items": {
"key": [{"name": {}}],
"exclusions" : ["_links", "ordinal"],
"properties": {
"_links": {
"properties": {
...
"subclasses": {
"items": {
"key": [{"title": {}}],
"properties": {
"href": {},
"title": {},
"type": {}
}
}
}
}
},
"datasets": {
"alias": "Dataset",
"items": {
"key": [{"name": {}}],
"exclusions" : ["_links", "ordinal"],
"properties": {
"_links": {
"properties": {
...
}
},
"datasetStructure": {},
"datasetVariables": {
"alias": "Variable",
"items": {
"key": [{"name": {}}],
"exclusions" : ["_links", "ordinal"],
"properties": {
"_links": {
"properties": {
"codelist": {
"items": {
"key": [{"href": {}}],
"properties": {
"href": {},
"title": {},
"type": {}
}
}
},
...
}
},
"core": {
"alias": "Core"
},
"describedValueDomain": {},
"description": {
"alias": "CDISC Notes"
},
"label": {},
"name": {},
"ordinal": {},
"role": {},
"simpleDatatype": {},
"valueList": {
"items": {}
}
}
}
},
"description": {
"alias": "CDISC Notes"
},
"label": {},
"name": {
"alias": "Variable Name"
},
"ordinal": {}
}
}
},
"description": {
"alias": "CDISC Notes"
},
"label": {},
"name": {
"alias": "Dataset Name"
},
"ordinal": {}
}
}
},
"description": {
"alias": "CDISC Notes"
},
"effectiveDate": {},
"label": {},
"name": {
"alias" : "Class"
},
"registrationStatus": {},
"source": {},
"version": {}
}
}
Step 3 - Generate the diff results
save_json(
json_diff(get_json('schema.json'),
get_json('version2.json'),
get_json('version3.json'))
, 'diff.json')
Step 4 - Save to format of choice
Here is an example to convert to Excel
save_xlsx(get_json('diff.json'), 'diff.xlsx')
Output Columns
Updated Version
Document ID of new version
Previous Version
Document ID of old version
Action
- Type Update
- Drop
- Add
- Value Update
Impact
Leaf node name in the path. Convenient for filtering.
Change Level
Only applies to 'Value Update' Category.
- Minor - Non-alphanumeric changes or case changes only
- Major - Alphanumeric changes (other than case)
<Filter Container*>
These columns are dynamic and are generated for each list's name in the structure.
Attribute (updated)
New objects / values. Value-level adds and updates are highlighted.
Attribute (previous)
Old objects / values. Value-level drops and updates are highlighted.
Attribute Path
JSONPath style path that contains filters. These can be used to identify the exact object.
Location Path
JSONPath style path without filters. Convenient for filtering.
Value (updated)
Only applies to 'Value Update' Category. Array containing list of adds and updates for string-level diffs.
Value (previous)
Only applies to 'Value Update' Category. Array containing list of drops and updates for string-level diffs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cdisc_library_keyediffer-0.0.35.tar.gz
.
File metadata
- Download URL: cdisc_library_keyediffer-0.0.35.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 69229bfeb2f1ea520a356a7879fd7637f356513e152d121a01dfe9bf1be1297a |
|
MD5 | 0dc412ac487560f746ff591020a81805 |
|
BLAKE2b-256 | 9f54be3ea71cf3f7268d963da3af7844b0f314f885211803cedfc428def6ba85 |
File details
Details for the file cdisc_library_keyediffer-0.0.35-py3-none-any.whl
.
File metadata
- Download URL: cdisc_library_keyediffer-0.0.35-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fc97784385a9bb3d1970c3190393fb58083280ed52c2f69ac52104eb256e19d8 |
|
MD5 | 07023b6140fa22ba08327ffc59d31f31 |
|
BLAKE2b-256 | 22148bc5381bf6872de6dce67df5e2bae9ecb69c837083ead667960b386ebbe4 |