Create consistent and comparable fingerprints with secure hashes from unordered JSON data
Project description
json-fingerprint
Create consistent and comparable fingerprints with secure hashes from unordered JSON data.
A JSON fingerprint consists of three parts: the version of the underlying canonicalization algorithm, the hash function used and a hexadecimal digest of the hash function output. A complete example could look like this: jfpv1$sha256$5815eb0ce6f4e5ab0a771cce2a8c5432f64222f8fd84b4cc2d38e4621fae86af
.
Fingerprint element | Description |
---|---|
jfpv1 | JSON fingerprint version identifier: json fingerprint version 1 |
sha256 | Hash function identifier (sha256, sha384 or sha512) |
5815eb0c...1fae86af | The secure hash function output in hexadecimal format |
Table of Contents
v1 release checklist (jfpv1)
This is a list of high-level development and documentation tasks, which need to be completed prior to freezing the API for v1. Before v1, backwards-incompatible changes to the API are possible, although not likely from v0.10.0 onwards. Since the jfpv1 spec is work in progress, the fingerprints may not be fully comparable between different 0.y.z versions.
- Formalized the jfpv1 specification
- JSON type support
- Primitives and literals
- Arrays
- Objects
- Flattened "sibling-aware" internal data structure
- Support nested JSON data structures with mixed types
- Support most common SHA-2 hash functions
- SHA256
- SHA384
- SHA512
- Dynamic jfpv1 fingerprint comparison function (JSON string against a fingerprint)
- Performance characteristics that scale sufficiently
- Extensive verification against potential fingerprint (hash) collisions
Installation
To install the json-fingerprint package, run pip install json-fingerprint
.
Examples
The complete working examples below show how to use all core features of the json_fingerprint
package.
Create JSON fingerprints
JSON fingerprints can be created with the create()
function, which requires three arguments: input (valid JSON string), hash function (SHA256, SHA384 and SHA512 are supported) and JSON fingerprint version (1).
import json
import json_fingerprint
input_1 = json.dumps([3, 2, 1, [True, False], {'foo': 'bar'}])
input_2 = json.dumps([2, {'foo': 'bar'}, 1, [False, True], 3]) # Different order
fp_1 = json_fingerprint.create(input=input_1, hash_function='sha256', version=1)
fp_2 = json_fingerprint.create(input=input_2, hash_function='sha256', version=1)
print(f'Fingerpr. 1: {fp_1}')
print(f'Fingerpr. 2: {fp_2}')
This will output two identical fingerprints regardless of the different order of the json elements:
Fingerpr. 1: jfpv1$sha256$164e2e93056b7a0e4ace25b3c9aed9cf061f9a23c48c3d88a655819ac452b83a
Fingerpr. 2: jfpv1$sha256$164e2e93056b7a0e4ace25b3c9aed9cf061f9a23c48c3d88a655819ac452b83a
Since JSON objects with identical data content and structure will always produce identical fingerprints, the fingerprints can be used effectively for various purposes. These include finding duplicate JSON data from a larger dataset, JSON data cache validation/invalidation and data integrity checking.
Decode JSON fingerprints
JSON fingerprints can be decoded with the decode()
convenience function. It returns the version, hash function and secure hash in a tuple.
import json_fingerprint
fp = 'jfpv1$sha256$164e2e93056b7a0e4ace25b3c9aed9cf061f9a23c48c3d88a655819ac452b83a'
version, hash_function, hash = json_fingerprint.decode(fingerprint=fp)
print(f'Version (integer): {version}')
print(f'Hash function: {hash_function}')
print(f'Secure hash: {hash}')
This will output the individual elements that make up a fingerprint as follows:
Version (integer): 1
Hash function: sha256
Secure hash: 164e2e93056b7a0e4ace25b3c9aed9cf061f9a23c48c3d88a655819ac452b83a
Match fingerprints
The match()
is another convenience function that matches JSON data against a fingerprint, and returns either True
or False
depending on whether the data matches the fingerprint or not. Internally, it will automatically choose the correct version and hash function based on the target_fingerprint
argument.
import json
import json_fingerprint
input_1 = json.dumps([3, 2, 1, [True, False], {'foo': 'bar'}])
input_2 = json.dumps([3, 2, 1])
target_fp = 'jfpv1$sha256$164e2e93056b7a0e4ace25b3c9aed9cf061f9a23c48c3d88a655819ac452b83a'
match_1 = json_fingerprint.match(input=input_1, target_fingerprint=target_fp)
match_2 = json_fingerprint.match(input=input_2, target_fingerprint=target_fp)
print(f'Fingerprint matches with input_1: {match_1}')
print(f'Fingerprint matches with input_2: {match_2}')
This will output the following:
Fingerprint matches with input_1: True
Fingerprint matches with input_2: False
Find matches in fingerprint lists
The find_matches()
function takes a JSON string and a list of JSON fingerprints as input. It creates a fingerprint of the JSON input string of each different variant in the target list, and looks for matches in the fingerprint list. It can optionally also deduplicate the fingerprint input list, and results list.
import json
import json_fingerprint
# Produces SHA256: jfpv1$sha256$e69b883d...776bea81
# Produces SHA384: jfpv1$sha384$a07e46e3...3e7fa666
input = json.dumps({'foo': 'bar'})
fingerprints = [
# SHA256 match
'jfpv1$sha256$e69b883d4c554035eea01e8817285659f64f8345a12768cc2782fe29776bea81',
# SHA256 match (duplicate)
'jfpv1$sha256$e69b883d4c554035eea01e8817285659f64f8345a12768cc2782fe29776bea81',
# SHA384 match
('jfpv1$sha384$a07e4e7a13224f7bd1b80616f8874bda3fb4d18c52f5643fb1c9d5a7665a1d9'
'69412bb9bcc7c6e30cedca4953e7fa666'),
# Not a match
'jfpv1$sha256$73f7bb145f268c033ec22a0b74296cdbab1405415a3d64a1c79223aa9a9f7643',
]
matches = json_fingerprint.find_matches(input=input, fingerprints=fingerprints)
# Print raw matches, which include 2 same SHA256 fingerprints
print(*(f'\nMatch: {match[0:30]}...' for match in matches))
deduplicated_matches = json_fingerprint.find_matches(input=input,
fingerprints=fingerprints,
deduplicate=True)
# Print deduplicated matches
print(*(f'\nDeduplicated match: {match[0:30]}...' for match in deduplicated_matches))
This will output the following results, first the list with a duplicate and the latter with deduplicated results:
Match: jfpv1$sha256$e69b883d4c554035e...
Match: jfpv1$sha256$e69b883d4c554035e...
Match: jfpv1$sha384$a07e4e7a13224f7bd...
Deduplicated match: jfpv1$sha384$a07e4e7a13224f7bd...
Deduplicated match: jfpv1$sha256$e69b883d4c554035e...
JSON normalization
The jfpv1 JSON fingerprint function transforms the data internally into a normalized (canonical) format before hashing the output.
Alternative specifications
Most existing JSON normalization/canonicalization specifications and related implementations operate on three key aspects: data structures, values and data ordering. While the ordering of key-value pairs (objects) is straightforward, issues usually arise from the ordering of arrays.
The JSON specifications, including the most recent RFC 8259, have always considered the order of array elements to be meaningful. As data gets serialized, transferred, deserialized and serialized again throughout various systems, maintaining the order of array elements becomes impractical if not impossible in many cases. As a consequence, this makes the creation and comparison of secure hashes of JSON data across multiple systems a complex process.
JSON Fingerprint v1 (jfpv1)
The jfpv1 specification takes a more value-oriented approach toward JSON normalization and secure hash creation: values and value-related metadata bear most significance when JSON data gets normalized into the jfpv1 format. The original JSON data gets transformed into a flattened list of small objects, which are then hashed and sorted, and ultimately hashed again as a whole.
In practice, the jfpv1 specification purposefully ignores the original order of data elements in an array. The jfpv1 specification focuses instead on verifying that the following aspects of JSON datasets being compared match:
- All values in the compared datasets are identical
- The values exist in identical paths (arrays, object key-value pairs)
In the case of arrays, each array gets a unique hash identifier based on the data elements it holds. This way, each flattened value "knows" to which array it belongs to. This identifier is called a sibling hash because its derived from each value and its neighboring values.
Running tests
The entire internal test suite of json-fingerprint is included in its distribution package. If you wish to run the internal test suite, install the package and run the following command:
python -m json_fingerprint.tests.run
If all tests ran successfully, this will produce an output similar to the following:
..........................
----------------------------------------------------------------------
Ran 26 tests in 0.009s
OK
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for json_fingerprint-0.12.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f291107733fdbc7436a475234f61182d7830bf884098285376ca7789322103eb |
|
MD5 | 8d0e3a26e0d72756b686848364a57968 |
|
BLAKE2b-256 | 738c88e11e9e9a4d87bb3f239a1541e121cb1cb6d3a01c9244c5adba48b6a9c3 |