Skip to main content

A package to repair broken json strings

Project description

PyPI Python version PyPI downloads Github Sponsors

This simple package can be used to fix an invalid json string. To know all cases in which this package will work, check out the unit test.

Inspired by https://github.com/josdejong/jsonrepair


How to cite

If you are using this library in your academic work (as I know many folks are) please find the BibTex here:

@software{Baccianella_JSON_Repair_-_2024,
    author = {Baccianella, Stefano},
    month = aug,
    title = {{JSON Repair - A python module to repair invalid JSON, commonly used to parse the output of LLMs}},
    url = {https://github.com/mangiucugna/json_repair},
    version = {0.28.3},
    year = {2024}
}

Thank you for citing my work and please send me a link to the paper if you can!


Offer me a beer

If you find this library useful, you can help me by donating toward my monthly beer budget here: https://github.com/sponsors/mangiucugna


Demo

If you are unsure if this library will fix your specific problem, or simply want your json validated online, you can visit the demo site on GitHub pages: https://mangiucugna.github.io/json_repair/


Motivation

Some LLMs are a bit iffy when it comes to returning well formed JSON data, sometimes they skip a parentheses and sometimes they add some words in it, because that's what an LLM does. Luckily, the mistakes LLMs make are simple enough to be fixed without destroying the content.

I searched for a lightweight python package that was able to reliably fix this problem but couldn't find any.

So I wrote one

How to use

from json_repair import repair_json

good_json_string = repair_json(bad_json_string)
# If the string was super broken this will return an empty string

You can use this library to completely replace json.loads():

import json_repair

decoded_object = json_repair.loads(json_string)

or just

import json_repair

decoded_object = json_repair.repair_json(json_string, return_objects=True)

Avoid this antipattern

Some users of this library adopt the following pattern:

obj = {}
try:
    obj = json.loads(string)
except json.JSONDecodeError as e:
    obj = json_repair.loads(string)
    ...

This is wasteful because json_repair will already verify for you if the JSON is valid, if you still want to do that then add skip_json_loads=True to the call as explained the section below.

Read json from a file or file descriptor

JSON repair provides also a drop-in replacement for json.load():

import json_repair

try:
    file_descriptor = open(fname, 'rb')
except OSError:
    ...

with file_descriptor:
    decoded_object = json_repair.load(file_descriptor)

and another method to read from a file:

import json_repair

try:
    decoded_object = json_repair.from_file(json_file)
except OSError:
    ...
except IOError:
    ...

Keep in mind that the library will not catch any IO-related exception and those will need to be managed by you

Performance considerations

If you find this library too slow because is using json.loads() you can skip that by passing skip_json_loads=True to repair_json. Like:

from json_repair import repair_json

good_json_string = repair_json(bad_json_string, skip_json_loads=True)

I made a choice of not using any fast json library to avoid having any external dependency, so that anybody can use it regardless of their stack.

Some rules of thumb to use:

  • Setting return_objects=True will always be faster because the parser returns an object already and it doesn't have serialize that object to JSON
  • skip_json_loads is faster only if you 100% know that the string is not a valid JSON
  • If you are having issues with escaping pass the string as raw string like: r"string with escaping\""

Use json_repair from CLI

Install the library for command-line with:

pipx install json-repair

then run

$ json_repair -h

usage: json_repair [-h] [-i] [--ensure_ascii] [--indent INDENT] filename

Repair and parse JSON files.

positional arguments:
  filename         The JSON file to repair

options:
  -h, --help       show this help message and exit
  -i, --inline     Replace the file inline instead of returning the output to stdout
  --ensure_ascii   Pass the ensure_ascii parameter to json.dumps()
  --indent INDENT  Number of spaces for indentation (Default 2)

to learn how to use it

Adding to requirements

Please pin this library only on the major version!

We use TDD and strict semantic versioning, there will be frequent updates and no breaking changes in minor and patch versions. To ensure that you only pin the major version of this library in your requirements.txt, specify the package name followed by the major version and a wildcard for minor and patch versions. For example:

json_repair==0.*

In this example, any version that starts with 0. will be acceptable, allowing for updates on minor and patch versions.

How it works

This module will parse the JSON file following the BNF definition:

<json> ::= <primitive> | <container>

<primitive> ::= <number> | <string> | <boolean>
; Where:
; <number> is a valid real number expressed in one of a number of given formats
; <string> is a string of valid characters enclosed in quotes
; <boolean> is one of the literal strings 'true', 'false', or 'null' (unquoted)

<container> ::= <object> | <array>
<array> ::= '[' [ <json> *(', ' <json>) ] ']' ; A sequence of JSON values separated by commas
<object> ::= '{' [ <member> *(', ' <member>) ] '}' ; A sequence of 'members'
<member> ::= <string> ': ' <json> ; A pair consisting of a name, and a JSON value

If something is wrong (a missing parantheses or quotes for example) it will use a few simple heuristics to fix the JSON string:

  • Add the missing parentheses if the parser believes that the array or object should be closed
  • Quote strings or add missing single quotes
  • Adjust whitespaces and remove line breaks

I am sure some corner cases will be missing, if you have examples please open an issue or even better push a PR

How to develop

Just create a virtual environment with requirements.txt, the setup uses pre-commit to make sure all tests are run.

Make sure that the Github Actions running after pushing a new commit don't fail as well.

How to release

You will need owner access to this repository

  • Edit pyproject.toml and update the version number appropriately using semver notation
  • Commit and push all changes to the repository before continuing or the next steps will fail
  • Run python -m build
  • Create a new release in Github, making sure to tag all the issues solved and contributors. Create the new tag, same as the one in the build configuration
  • Once the release is created, a new Github Actions workflow will start to publish on Pypi, make sure it didn't fail

Repair JSON in other programming languages


Star History

Star History Chart

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

json_repair-0.29.0.tar.gz (23.4 kB view details)

Uploaded Source

Built Distribution

json_repair-0.29.0-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file json_repair-0.29.0.tar.gz.

File metadata

  • Download URL: json_repair-0.29.0.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for json_repair-0.29.0.tar.gz
Algorithm Hash digest
SHA256 a84b101e24e8f0ca14662dba2654b83b81e3b22c5d1bdf118eff04b3eaaa0fbb
MD5 4b985b3e7d2b72c4d31f5139f5d440cc
BLAKE2b-256 b6557931982c1467527496c879bef200c899f45df2eeb0316090ea39a62d1548

See more details on using hashes here.

File details

Details for the file json_repair-0.29.0-py3-none-any.whl.

File metadata

File hashes

Hashes for json_repair-0.29.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e64f0f4af5ceda187859a932ae177f3e59f2b8a12d21be139f92c3e4faa40b80
MD5 45f64f9bad30fcfa202cf8c39fde6faf
BLAKE2b-256 dd8c72a9472e81548c90846eff0bed924b4d6376ea1b3502fb8442ea12ba5fac

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page