Skip to main content

jsonid a json identification tool

Project description

jsonid

JSON ID logo based on JSON Logo by Douglas Crockford

JSON-Identification ruleset and tool.

Function

jsonid borrows from the Python approach to ask forgiveness rather than permission (EAFP) to attempt to open every object it scans and see if it parses as JSON. If it doesn't we move along. If it does, we then have an opportunity to identify the characteristics of the JSON we have opened.

Python being high-level also provides an easier path to processing files and parsing JSON quickly with very little other knowledge required of the underlying data structure.

Why?

Consider these equivalent forms:

{
    "key 1": "value",
    "key 2": "value"
}
{
    "key 2": "value",
    "key 1": "value
}

PRONOM signatures are not expressive enough for complicated JSON objects.

If I want PRONOM to find key 1 I have to use a wildcard, so something like:

BOF: "7B*226B6579203122"
EOF: "7D"

But if I then want to match on key 2 as well as key 1 things start getting complicated as they aren't guaranteed by the JSON spec to be in the same order. They're not even guaranteed to be in the same positions (from a visual perspective) when other keys are also used in the object.

jsonid tries to compensate for this by using JSON's own strengths to use its keys and values as "markers" that can help to identify what we're looking at.

Ruleset

jsonid currently defines a small set of rules that help us to identify JSON documents.

The rules are each their own data-structures. The structures are processed sequentially in order to determine what kind of JSON document we might be looking at. jsonid is currently designed to identify the existence of information but you can also add some negation, e.g. to remove false-positives. Do this carefully!

rule meaning
INDEX index (from which to read when structure is an array)
GOTO goto key (read key at given key)
KEY key to read
CONTAINS value contains string
STARTSWITH value startswith string
ENDSWITH value endswith string
IS value matches exactly
REGEX value matches a regex pattern
EXISTS key exists
NOEXIST key doesn't exists
ISTYPE key is a specific type (string, number, dict, array)

Stored in a list within a RegistryEntry object, they are then processed in order.

For example:

    [
        { "KEY": "name", "IS": "value" },
        { "KEY": "schema", "CONTAINS": "/schema/version/1.1/" },
        { "KEY": "data", "IS": { "more": "data" } },
    ]

All rules need to match for a positive ID.

NB.: jsonid is an early-days tool so there is a lot of opportunity to add/remove to these if it proves its worth

Registry

A "registry" module is used to store JSON markers for identifying documents and objects. The registry is a work in progress and will be exported and rewritten if jsonid can prove useful to its communities.

The registry can be read in the source code here:

PRONOM

Ideally we will add PRONOM identifiers jsonid's formats. The tool can be used to generate evidence enough to be able to add this data to PRONOM in future.

Output format

A very basic yaml output is used to output data about identified files. This will need to be reformatted and reshaped as the concept is proved.

What does jsonid get you?

At the very least, jsonid should identify json files on your system as json. That's already a pretty good position to be in.

The ruleset should then allow you to identify a decent number of json objects, especially those that have a well-defined structure. Examples we have in the registry data include things like ActivityPub streams, RO-CRATE metadata, IIIF API data and so on.

If the ruleset works for JSON we might be able to apply it to other formats such as YAML in future.


Developer install

pip

Setup a virtual environment venv and install the local development requirements as follows:

python3 -m venv venv
source venv/bin/activate
python -m pip install -r requirements/local.txt

tox

Run tests (all)

python -m tox

Run tests-only

python -m tox -e py3

Run linting-only

python -m tox -e linting

pre-commit

Pre-commit can be used to provide more feedback before committing code. This reduces reduces the number of commits you might want to make when working on code, it's also an alternative to running tox manually.

To set up pre-commit, providing pip install has been run above:

  • pre-commit install

This repository contains a default number of pre-commit hooks, but there may be others suited to different projects. A list of other pre-commit hooks can be found here.

Packaging

The justfile contains helper functions for packaging and release. Run just help for more information.

pyproject.toml

Packaging consumes the metadata in pyproject.toml which helps to describe the project on the official pypi.org repository. Have a look at the documentation and comments there to help you create a suitably descriptive metadata file.

Versioning

Versioning in Python can be hit and miss. You can label versions for yourself, but to make it reliaable, as well as meaningful is should be controlled by your source control system. We assume git, and versions can be created by tagging your work and pushing the tag to your git repository, e.g. to create a release candidate for version 1.0.0:

git tag -a 1.0.0-rc.1 -m "release candidate for 1.0.0"
git push origin 1.0.0-rc.1

When you build, a package will be created with the correct version:

make package-source
### build process here ###
Successfully built python_repo_jsonid-1.0.0rc1.tar.gz and python_repo_jsonid-1.0.0rc1-py3-none-any.whl

Local packaging

To create a python wheel for testing locally, or distributing to colleagues run:

  • make package-source

A tar and whl file will be stored in a dist/ directory. The whl file can be installed as follows:

  • pip install <your-package>.whl

Publishing

Publishing for public use can be achieved with:

  • make package-upload-test or make package-upload

make-package-upload-test will upload the package to test.pypi.org which provides a way to look at package metadata and documentation and ensure that it is correct before uploading to the official pypi.org repository using make package-upload.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jsonid-0.1.4.tar.gz (91.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jsonid-0.1.4-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file jsonid-0.1.4.tar.gz.

File metadata

  • Download URL: jsonid-0.1.4.tar.gz
  • Upload date:
  • Size: 91.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for jsonid-0.1.4.tar.gz
Algorithm Hash digest
SHA256 abab84fd49fdd4c42db82a706e331275c88001c23ead733c66e43ef120023ea9
MD5 c57d8d4fadc35341122376f3e3dd5bb5
BLAKE2b-256 57367125f053608c1e61935ca33249fe2d05220fb742f2f2a56db41602336adb

See more details on using hashes here.

File details

Details for the file jsonid-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: jsonid-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for jsonid-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 16ccf2dd7e0a4817290505952536694b0c379109a27f25604df09c4a3a33cbf7
MD5 d2fbb1b7c2ec99ddd9b056e6aed5f867
BLAKE2b-256 2cf69bd2b95c0119ed2f5b4ff7b2b32b2e5ffeb8bf697fa22a8f5da55d63fb5f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page