jsonid a json identification tool
Project description
JSONID
JSONIDentification tool and ruleset. JSONID can be downloaded from pypi.org.
Contents
- Introduction
- Why?
- What does JSONID get you?
- Ruleset
- Registry
- PRONOM
- Output format
- Sample files
- Analysis
- Utils
- Docs
- Developer install
- Packaging
Introduction
JSONID borrows from the Python approach to ask forgiveness rather than permission (EAFP) to attempt to open every object it scans and see if it parses as JSON. If it doesn't, we move along. If it does, we then have an opportunity to identify the characteristics of the JSON we have opened.
Python being high-level also provides an easier path to processing files and parsing JSON quickly with very little other knowledge required of the underlying data structure.
Why?
Consider these equivalent forms:
{
"key 1": "value",
"key 2": "value"
}
{
"key 2": "value",
"key 1": "value"
}
PRONOM signatures are not expressive enough for complicated JSON objects.
If I want DROID to find key 1 I have to use a wildcard, so I would write
something like:
BOF: "7B*226B6579203122"
EOF: "7D"
But if I then want to match on key 2 as well as key 1 things start getting
complicated as they aren't guaranteed by the JSON specification to be in the
same "position" (if we think about order visually). When other keys are used in
the object they aren't even guaranteed to be next to one another.
This particular example is a 'map' object whose most important property is consistent retrieval of information through its "keys". Further complexity can be added when we are dealing with maps embedded in a "list" or "array", or simply just maps of arbitrary depth.
JSONID tries to compensate for JSON's complexities by using the format's own strengths to parse binary data as JSON and then if is successful, use a JSON-inspired grammar to describe keys and key-value pairs as "markers" that can potentially identify the JSON objects that we are looking at. Certainly narrow down the potential instances of JSON objects that we might be looking at.
What does JSONID get you?
To begin, JSONID should identify JSON files on your system as JSON. That's already a pretty good position to be in.
The ruleset should then allow you to identify a decent number of JSON objects, especially those that have a well-defined structure. Examples we have in the registry data include things like ActivityPub streams, RO-CRATE metadata, IIIF API data and so on.
If the ruleset works for JSON we might be able to apply it to other formats that can represent equivalent data structures in the future such as YAML, and TOML.
Ruleset
JSONID currently defines a small set of rules that help us to identify JSON documents.
The rules are described in their own data-structures. The structures are processed as a list (they need not necessarily be in order) and each must match for a given set of ruls to determine what kind of JSON document we might be looking at.
JSONID can identify the existence of information but you can also use wildcards and provide some negation as required, e.g. to remove false-positives between similar JSON entities.
| rule | meaning |
|---|---|
| INDEX | index (from which to read when structure is an array) |
| GOTO | goto key (read key at given key) |
| KEY | key to read |
| CONTAINS | value contains string |
| STARTSWITH | value startswith string |
| ENDSWITH | value endswith string |
| IS | value matches exactly |
| REGEX | value matches a regex pattern |
| EXISTS | key exists |
| NOEXIST | key doesn't exists |
| ISTYPE | key is a specific type (string, number, dict, array) |
Stored in a list within a RegistryEntry object, they are then processed
in order.
For example:
[
{ "KEY": "name", "IS": "value" },
{ "KEY": "schema", "CONTAINS": "/schema/version/1.1/" },
{ "KEY": "data", "IS": { "more": "data" } },
]
All rules need to match for a positive ID.
NB.: JSONID is a work-in-progress and requires community input to help determine the grammar in its fullness and so there is a lot of opportunity to add/remove to these methods as its development continues. Additionally, help formalizing the grammar/ruleset would be greatly appreciated 🙏.
Backed by testing
The ruleset has been developed using test-driven-development practices (TDD) and the current set of tests can be reviewed in the repository's test folder. More tests should be added, in general, and over time.
Registry
A temporary "registry" module is used to store JSON markers. The registry is a work in progress and must be exported and rewritten somewhere more centralized (and easier to manage) if JSONID can prove useful to the communities that might use it (see notes on PRONOM below).
The registry web-page is here:
The registry's source is here:
Registry examples
Identifying JSON-LD Generic
RegistryEntry(
identifier="id0009",
name=[{"@en": "JSON-LD (generic)"}],
markers=[
{"KEY": "@context", "EXISTS": None},
{"KEY": "id", "EXISTS": None},
],
),
Pseudo code: Test for the existence of keys:
@contextandidin the primary JSON object.
Identifying Tika Recursive Metadata
RegistryEntry(
identifier="id0024",
name=[{"@en": "tika recursive metadata"}],
markers=[
{"INDEX": 0, "KEY": "Content-Length", "EXISTS": None},
{"INDEX": 0, "KEY": "Content-Type", "EXISTS": None},
{"INDEX": 0, "KEY": "X-TIKA:Parsed-By", "EXISTS": None},
{"INDEX": 0, "KEY": "X-TIKA:parse_time_millis", "EXISTS": None},
],
Pseudo code: Test for the existence of keys:
Content-Length,Content-Type,X-TIKA:Parsed-ByandX-TIKA:parse_time_millisin thezeroth(first) JSON object where the primary document is a list of JSON objects.
Identifying SOPS encrypted secrets file
RegistryEntry(
identifier="id0012",
name=[{"@en": "sops encrypted secrets file"}],
markers=[
{"KEY": "sops", "EXISTS": None},
{"GOTO": "sops", "KEY": "kms", "EXISTS": None},
{"GOTO": "sops", "KEY": "pgp", "EXISTS": None},
],
),
Pseudo code: Test for the existence of keys
sopsin the primary JSON object.Goto the
sopskey and test for the existence of keys:kmsandpgpwithin thesopsobject/value.
Local rules
The plan is to allow local rules to be run alongside the global ruleset. I expect this will be a bit further down the line when the ruleset and metaddata is more stabilised.
PRONOM
Ideally JSON can generate evidence enough to warrant the creration of PRONOM IDs that can then be referenced in the JSONID output.
Evantually, PRONOM or a PRONOM-like tool might host an authoritative version of the JSONID registry.
Output format
For ease of development, the utility currently outputs yaml. The structure
is still very fluid, and will also vary depending on the desired level of
detail in the registry, e.g. there isn't currently a lot of information about
the contents beyond a basic title and identifier.
E.g.:
---
jsonid: 0.0.0
scandate: 2025-04-21T18:40:48Z
---
file: integration_files/plain.json
additional:
- '@en': data is dict type
depth: 1
documentation:
- archive_team: http://fileformats.archiveteam.org/wiki/JSON
identifiers:
- rfc: https://datatracker.ietf.org/doc/html/rfc8259
- pronom: http://www.nationalarchives.gov.uk/PRONOM/fmt/817
- loc: https://www.loc.gov/preservation/digital/formats/fdd/fdd000381.shtml
- wikidata: https://www.wikidata.org/entity/Q2063
mime:
- application/json
name:
- '@en': JavaScript Object Notation (JSON)
---
The structure should become more concrete as JSONID is formalized.
Sample files
Integration files
Files used in the development of JSONID are available in their own repository/
Fundamental examples
There is a small samples directory included with this epository demonstrating some fundamental differences in encoding and JSON types.
Analysis
JSONID provides an analysis mechanism to help developers of identifiers. It might also help users talk about interesting properties about the objects being analysed, and provide consistent fingerprinting for data that has different byte-alignment but is otherwise identical.
NB.: Comments on existing statistics or ideas for new ones are appreciated.
Example analysis
{
"content_length": 329,
"number_of_lines": 32,
"line_warning": false,
"top_level_keys_count": 4,
"top_level_keys": [
"key1",
"key2",
"key3",
"key4"
],
"top_level_types": [
"list",
"map",
"list",
"list"
],
"depth": 8,
"heterogeneous_list_types": true,
"fingerprint": {
"unf": "UNF:6:sAsKNmjOtnpJtXi3Q6jVrQ==",
"cid": "bafkreibho6naw5r7j23gxu6rzocrud4pc6fjsnteyjveirtnbs3uxemv2u"
},
"encoding": "UTF-8"
}
Utils
json2json
UTF-16 can be difficult to read as UTF-16 uses two bytes per every one, e.g.
..{.".a.".:. .".b.".}. is simply {"a": "b"}. The utility json2json.py
in the utils folder will output UTF-16 as UTF-8 so that signatures can be
more easily derived. A signature derived for UTF-16 looks exactly the same
as UTF-8.
json2json can be called from the command line when installed via pip, or
find it in src.utils.
Docs
Dev docs are available.
Developer install
pip
Setup a virtual environment venv and install the local development
requirements as follows:
python3 -m venv venv
source venv/bin/activate
python -m pip install -r requirements/local.txt
tox
Run tests (all)
python -m tox
Run tests-only
python -m tox -e py3
Run linting-only
python -m tox -e linting
pre-commit
Pre-commit can be used to provide more feedback before committing code. This reduces reduces the number of commits you might want to make when working on code, it's also an alternative to running tox manually.
To set up pre-commit, providing pip install has been run above:
pre-commit install
This repository contains a default number of pre-commit hooks, but there may be others suited to different projects. A list of other pre-commit hooks can be found here.
Packaging
The justfile contains helper functions for packaging and release.
Run just help for more information.
pyproject.toml
Packaging consumes the metadata in pyproject.toml which helps to describe
the project on the official pypi.org repository. Have a look at the
documentation and comments there to help you create a suitably descriptive
metadata file.
Versioning
Versioning in Python can be hit and miss. You can label versions for yourself, but to make it reliaable, as well as meaningful is should be controlled by your source control system. We assume git, and versions can be created by tagging your work and pushing the tag to your git repository, e.g. to create a release candidate for version 1.0.0:
git tag -a 1.0.0-rc.1 -m "release candidate for 1.0.0"
git push origin 1.0.0-rc.1
When you build, a package will be created with the correct version:
just package-source
### build process here ###
Successfully built python_repo_jsonid-1.0.0rc1.tar.gz and python_repo_jsonid-1.0.0rc1-py3-none-any.whl
Local packaging
To create a python wheel for testing locally, or distributing to colleagues run:
just package-source
A tar and whl file will be stored in a dist/ directory. The whl file
can be installed as follows:
pip install <your-package>.whl
Publishing
Publishing for public use can be achieved with:
just package-upload-testorjust package-upload
just-package-upload-test will upload the package to test.pypi.org
which provides a way to look at package metadata and documentation and ensure
that it is correct before uploading to the official pypi.org
repository using just package-upload.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jsonid-0.9.0.tar.gz.
File metadata
- Download URL: jsonid-0.9.0.tar.gz
- Upload date:
- Size: 271.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53209ee66d591ca1d2ba4575bdf716147939b1fb91728255a0bfd596a972c678
|
|
| MD5 |
c2c02e390248645118effe441d4dc71b
|
|
| BLAKE2b-256 |
959d5b7bd85babf39a7d3c34e6852d52ee345a02f51300a97262bcabddb18177
|
File details
Details for the file jsonid-0.9.0-py3-none-any.whl.
File metadata
- Download URL: jsonid-0.9.0-py3-none-any.whl
- Upload date:
- Size: 37.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4516401d690a287dc216ee1b69e10e1cf693631afceba4e699d9a4ec6981a422
|
|
| MD5 |
1b49aee8773baf172ea1c67f48c45f6f
|
|
| BLAKE2b-256 |
7c01e9b6fa4010e4a0ef830e9ad11a813cf2c9726d007f8a127e336e62470c49
|