Skip to main content

Fix corrupted JSON files using the jsonic JSON parser in JavaScript

Project description

SalvaJSON

salvajson is a Python package for fixing corrupted JSON files using. It uses the lenient jsonic parser and the pythonmonkey Python-JS bridge.

  • Seamlessly corrects invalid JSON from LLMs, APIs, and other sources
  • Handles missing/single quotes, trailing commas, missing commas, unquoted strings, JSON comments, and more
  • Simple Python API and command-line interface

Installation

uv pip install --system salvajson

Requirements

  • Python 3.10 or higher
  • PythonMonkey 1.1.0 or higher (automatically installed)

Usage

Python API

from salvajson import salvaj

# Fix a corrupted JSON string
corrupted_json = """{
    name: "John",
    age: 30,
    'hobbies': ['reading' 'coding'],
}"""

fixed_json = salvaj(corrupted_json)
print(fixed_json)

Command Line Interface

Salvajson comes with a CLI for processing JSON files directly:

# Process a single file
python -m salvajson path/to/corrupted.json

# Process and save to a new file
python -m salvajson input.json > output.json

Development

Setup Development Environment

  1. Clone the repository:
git clone https://github.com/twardoch/salvajson.git; cd salvajson
  1. Install development dependencies:
uv venv && source .venv/bin/activate; uv pip install -e ".[dev,test]"
  1. Install JavaScript dependencies:
cd js_src; npm install

Project Structure

salvajson/
├── src/
│   └── salvajson/
│       ├── __init__.py      # Main package interface
│       └── salvajson.js     # Bundled JavaScript code
├── js_src/
│   ├── package.json     # JavaScript dependencies
│   └── index.js         # JavaScript source
├── tests/
│   └── test_salvajson.py
├── build.py             # Build script
└── pyproject.toml       # Python package configuration

Building

The project uses hatchling as its build backend. The build.py script handles bundling the JavaScript code before package building.

python build.py
pip install -e .

How It Works

Salvajson uses PythonMonkey to create a bridge between Python and JavaScript, allowing it to leverage the powerful jsonic parser. When you pass a JSON string to salvaj():

  1. The string is passed to the JavaScript runtime
  2. jsonic attempts to parse and fix the JSON
  3. The fixed JSON is returned to Python as a string

License

Apache License 2.0 - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Credits

About the jsonic lenient JSON parser

jsonic is a JavaScript library that provides a more lenient and extensible JSON parser. Unlike the standard JSON.parse, jsonic allows for a more flexible syntax, making it easier to work with human-written JSON-like data. This article provides a detailed technical overview of how jsonic works, exploring its architecture, parsing process, and extensibility features.

Core Architecture

At its core, jsonic is built around two primary components:

  1. Lexer (Tokenizer): The lexer's role is to break down the input string into a stream of tokens. Each token represents a meaningful unit of the input, such as a string, number, keyword, or punctuation. The lexer in jsonic is highly configurable, allowing users to define custom token types and matching rules.

  2. Parser: The parser takes the token stream generated by the lexer and constructs an Abstract Syntax Tree (AST) based on a predefined grammar. This grammar defines the rules of the jsonic syntax, including how tokens can be combined to form valid JSON structures.

The parser utilizes a recursive descent parsing strategy. This approach involves defining a set of functions, each responsible for parsing a specific grammar rule. These functions recursively call each other to parse nested structures.

Parsing Process

The parsing process in jsonic can be summarized as follows:

  1. Initialization: When Jsonic() is called, it initializes a new parser instance with the default or user-provided options. The options control various aspects of the parsing process, including the allowed syntax, error handling, and plugin configurations.
  2. Lexing: The input string is passed to the lexer, which scans the string character by character. The lexer uses regular expressions and custom matching functions (as defined in the options) to identify and extract tokens. Each token is represented by an object containing its type, value, and location in the source string.
  3. Parsing: The parser receives the stream of tokens from the lexer. It uses a set of parsing rules, defined by the jsonic grammar, to construct an AST. The parser starts with a top-level rule (typically 'val') and recursively applies other rules based on the current token and the grammar.
  4. AST Construction: As the parser applies the rules, it builds an AST that represents the structure of the input data. The AST is a hierarchical tree structure where each node corresponds to a grammar rule or a token.
  5. Result: Once the entire input has been parsed, the parser returns the root node of the AST. This node represents the parsed JSON value, which can be a primitive value, an object, or an array.

Grammar Definition

The grammar of jsonic is defined using a set of rules. Each rule specifies how a particular syntactic construct can be recognized and parsed. Rules are defined using the jsonic.rule() method, which takes a rule name and a rule definer function.

A rule definer function takes a RuleSpec object and modifies it to define the rule's behavior. It primarily uses the open and close methods to specify the token sequences that mark the beginning and end of a rule, respectively.

Let's illustrate this with a simplified example:

jsonic.rule('map', (rs: RuleSpec) => {
  rs.bo((r: Rule) => {
    // Create a new empty map.
    r.node = {}
  })
  .open([
    // An empty map: {}.
    { s: [OB, CB], b: 1, n: { pk: 0 }, g: 'map,json' },
    // Start matching map key-value pairs: a:1.
    // Reset counter n.pk as new map (for extensions).
    { s: [OB], p: 'pair', n: { pk: 0 }, g: 'map,json,pair' },
  ])
  .close([
    // End of map.
    { s: [CB], g: 'end,json' },
  ])
});

In this example, the map rule is defined to match a JSON object. The bo method sets up the initial node for the rule as an empty object. The open method specifies two possible starting token sequences:

  1. OB, CB: Matches an empty object {}.
  2. OB: Matches the opening brace of a non-empty object, and pushes the pair rule onto the stack to parse the key-value pairs.

The close method specifies that a closing brace CB marks the end of the map.

Extensibility

jsonic is designed to be extensible, allowing users to customize the parsing process and add support for new syntax features. This extensibility is achieved through several mechanisms:

  1. Options: jsonic provides a wide range of options that control various aspects of the parsing process. These options can be used to modify the behavior of the lexer and parser, enabling users to fine-tune the parsing process to their specific needs.

  2. Plugins: Plugins are a powerful mechanism for extending jsonic. A plugin is a function that takes a jsonic instance as an argument and can modify its behavior by:

    • Adding new lexer matchers using jsonic.lex().
    • Defining new parsing rules or modifying existing ones using jsonic.rule().
    • Adding custom options to the jsonic.options object.
  3. Custom Lexer Matchers: Users can define custom lexer matchers to recognize new token types. These matchers are functions that take the current lexer state and return a token object if a match is found.

  4. Custom Rule Actions: Rule definitions can include custom actions that are executed when a rule matches. These actions can be used to modify the parsed data, perform validation, or trigger other custom logic.

Error Handling

jsonic includes robust error handling capabilities. When the parser encounters an unexpected token or a syntax error, it throws a JsonicError exception. This exception provides detailed information about the error, including the error code, a descriptive message, and the location of the error in the source string.

The error messages are customizable through the error option, and hints can be provided using the hint option.

Example: Adding Support for Comments

Let's illustrate how to extend jsonic with a simple example. We'll add support for single-line comments starting with //.

  1. Define a Lexer Matcher:

    function makeCommentMatcher(cfg, _opts) {
      return function matchComment(lex) {
        let pnt = lex.pnt
        let src = lex.src
        let sI = pnt.sI
    
        if (src.substring(sI).startsWith('//')) {
          let end = src.indexOf('\n', sI)
          if (-1 === end) {
            end = src.length
          }
          let comment = src.substring(sI, end)
          let tkn = lex.token('#CM', comment, comment, pnt)
          pnt.sI += comment.length
          pnt.cI += comment.length
          return tkn
        }
      }
    }
    
  2. Register the Matcher:

    let j = Jsonic.make({
      lex: {
        match: {
          comment: { order: 1e5, make: makeCommentMatcher },
        },
      },
    })
    

    We add a new lexer matcher named comment with a high order to ensure it runs before other matchers. The makeCommentMatcher function creates a matcher that recognizes // comments and generates a #CM token.

  3. Ignore the Comment Token:

    j.options({
      tokenSet: {
        IGNORE: ['#SP', '#LN', '#CM'], // Add #CM to IGNORE
      },
    })
    

    We add the #CM token to the IGNORE token set, so the parser ignores it.

Now, jsonic will correctly parse JSON with single-line comments:

let result = j(`
{
  // This is a comment
  "a": 1,
  "b": 2 // Another comment
}
`)

console.log(result) // Output: { a: 1, b: 2 }

This example demonstrates how to extend jsonic with custom lexing and parsing logic to support new syntax features. By defining custom lexer matchers and modifying the parsing rules, you can tailor jsonic to your specific needs.

Conclusion

jsonic is a powerful and flexible JSON parser that offers a more lenient syntax and extensive customization options. Its modular architecture, based on a configurable lexer and a rule-based parser, makes it highly extensible. By understanding how jsonic works, developers can leverage its capabilities to parse a wide range of JSON-like data formats and even define their own custom JSON dialects.

This detailed overview provides a solid foundation for understanding the inner workings of jsonic. For further exploration, refer to the official documentation and the source code of the library and its plugins.

About pythonmonkey

pythonmonkey is a powerful tool that embeds the SpiderMonkey JavaScript engine into the Python runtime, enabling seamless interoperability between JavaScript and Python. This article will delve into the technical details of how pythonmonkey achieves this integration, covering its core mechanisms and design choices.

Architecture Overview

At its heart, pythonmonkey has two main components:

  1. SpiderMonkey Integration: The library embeds the SpiderMonkey JavaScript engine, providing the capability to execute JavaScript code within the Python process.
  2. Python-JavaScript Bridge: This component facilitates communication and data exchange between the two language runtimes. It handles object wrapping, type coercion, and function calls across the boundary.

SpiderMonkey Embedding

pythonmonkey statically links to a specific version of SpiderMonkey. During the build process, the SpiderMonkey source code (obtained from mozilla-central repository) is compiled and linked with the pythonmonkey library. This creates a single library that contains both the Python extension and the JavaScript engine.

The build process uses CMake as the build system, and build.py is the main Python script that orchestrates the build process.

Python-JavaScript Bridge

The bridge is the crucial part that enables interoperability. It consists of several key mechanisms:

  1. Context Creation: When pythonmonkey is initialized, it creates a JSContext. This context represents an isolated instance of the SpiderMonkey engine.

  2. Global Object: A JavaScript global object is created within the context. This object serves as the global namespace for JavaScript code executed by pythonmonkey.

  3. Object Wrapping: pythonmonkey employs a system of proxy objects to enable interaction between Python and JavaScript objects:

    • JSObjectProxy: This Python class acts as a proxy for JavaScript objects. It overrides methods like __getattr__, __setattr__, and __delattr__ to delegate operations to the underlying JavaScript object using the SpiderMonkey API.
    • JSArrayProxy: This is similar to JSObjectProxy but specifically handles JavaScript arrays, also conforming to Python's list interface.
    • JSFunctionProxy: This wraps JavaScript functions, allowing them to be called from Python.
    • JSMethodProxy: This wraps JavaScript functions that are expected to act as methods (i.e., have a 'this' context) when called from Python.
  4. Type Coercion: pythonmonkey automatically coerces data types when values cross the language boundary. It handles intrinsic types (numbers, booleans, strings, None, null, undefined), as well as more complex structures like lists/arrays and dictionaries/objects.

    • From JS to Python:
      • JavaScript strings are represented by Python's JSStringProxy type (String).
      • JavaScript numbers are represented by Python floats or integers (depending on the size).
      • JavaScript bigints are represented by pythonmonkey.bigint (Integer).
      • JavaScript booleans are represented by Python bools.
      • JavaScript functions are represented by pythonmonkey.JSFunctionProxy.
      • JavaScript Date objects are represented by Python datetime.datetime objects.
      • JavaScript Arrays are represented by pythonmonkey.JSArrayProxy (List).
      • JavaScript Objects are represented by pythonmonkey.JSObjectProxy (Dict).
      • JavaScript TypedArrays are represented by Python Buffer, sharing the same memory.
      • JavaScript Promises are awaitable.
      • JavaScript Error objects are represented by pythonmonkey.SpiderMonkeyError (Error).
      • JavaScript null and undefined are represented by pythonmonkey.null and None respectively.
    • From Python to JS:
      • Python strings are converted to JS strings, with the possibility of sharing the underlying string buffer for immutable strings.
      • Python integers are converted to JS numbers or bigints, depending on their size.
      • Python floats are converted to JS numbers.
      • Python booleans are converted to JS booleans.
      • Python lists are represented by JS true arrays and support all Array methods through a JS API Proxy.
      • Python dictionaries are represented by JS objects.
      • Python None is converted to JS undefined.
      • Python functions are wrapped so they can be called from JS.
      • Python awaitables are converted to JS Promises.
      • Python Buffer objects are converted to JS ArrayBuffer and share the same memory.
      • Python datetime.datetime objects are converted to JS Date objects.
      • Python Errors are converted to JS Error objects.
  5. Function Calls: When a Python function is called from JavaScript (e.g., through a callback), pythonmonkey creates a wrapper that:

    • Converts JS arguments to their Python equivalents.
    • Calls the Python function.
    • Converts the Python return value to a JavaScript value.
    • Handles Python exceptions by converting them to JavaScript errors. Similarly, when a JavaScript function is called from Python, a JSFunctionProxy handles the call by:
    • Converting Python arguments to their JavaScript equivalents.
    • Calling the JavaScript function.
    • Converting the JavaScript return value to a Python object.
    • Handles JavaScript exceptions by converting them to Python exceptions.
  6. Garbage Collection: pythonmonkey integrates with both Python's and SpiderMonkey's garbage collectors. Proxy objects hold references to their underlying objects in the other runtime, ensuring that they are not prematurely collected.

  7. Event Loop: pythonmonkey utilizes the Python asyncio event loop to manage asynchronous operations in JavaScript. It provides APIs to schedule tasks on the event loop and to await JavaScript promises from Python.

Internal Bindings

pythonmonkey provides a special function called internalBinding. This function allows JavaScript code to access certain built-in modules that are implemented in C++ for performance or to expose platform-specific functionality. These internal bindings are analogous to Node.js's internal modules.

Example: eval

The pythonmonkey.eval function is a good example of how the bridge works:

import pythonmonkey as pm

result = pm.eval("1 + 1")
print(result)

In this example:

  1. The Python string "1 + 1" is passed to the pm.eval function.
  2. pm.eval uses the SpiderMonkey API to parse and compile the JavaScript code.
  3. The compiled code is executed within the SpiderMonkey context.
  4. The result (the number 2) is returned as a JavaScript value.
  5. pythonmonkey automatically coerces the JavaScript number to a Python float.
  6. The Python code then prints the result.

Example: Calling Python from JS

// In JavaScript
const python = require('pythonmonkey').python;
const result = python.eval('1 + 1');
console.log(result);

Here:

  1. The JavaScript code uses the require function (provided by pythonmonkey's CommonJS implementation) to import the python object.
  2. The python.eval function is called, which internally calls the corresponding Python function in pythonmonkey.
  3. The Python code evaluates the expression and returns the result.
  4. The result is automatically converted to a JavaScript number and printed to the console.

Conclusion

pythonmonkey provides a sophisticated integration between Python and JavaScript, enabling developers to leverage the strengths of both languages within a single application. The library carefully manages object lifetimes, handles type conversions, and provides mechanisms for asynchronous operations, making it a powerful tool for building hybrid applications. Understanding its internal workings can help developers make the most of its capabilities and write efficient and robust interoperable code.

Notes for Maintainers

Automated Workflows

The package uses GitHub Actions for automation:

  1. JS Dependencies Update (weekly + manual trigger)

    • Updates JS dependencies and rebuilds the bundle
    • Creates a PR for review
    • Trigger manually: Go to Actions → "Update JS Dependencies" → "Run workflow"
  2. PyPI Publishing (on tag)

    • Builds and publishes to PyPI when a version tag is pushed
    • To release a new version:
      git tag v0.1.1  # Use appropriate version
      git push origin v0.1.1
      
    • The workflow will automatically build and publish to PyPI

Required Secrets

Set up these secrets in your GitHub repository:

  • PYPI_API_TOKEN: API token from PyPI for publishing

License

This project is licensed under the Apache License, Version 2.0.

Versioning

This project uses semantic versioning. To create a new release:

  1. Commit your changes
  2. Create a new git tag following semantic versioning principles:
    # For a patch release
    git tag v0.1.1
    
    # For a minor release
    git tag v0.2.0
    
    # For a major release
    git tag v1.0.0
    
  3. Push the tag to trigger the PyPI publishing workflow:
    git push origin v0.1.1
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

salvajson-1.6.0.tar.gz (36.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

salvajson-1.6.0-py3-none-any.whl (31.6 kB view details)

Uploaded Python 3

File details

Details for the file salvajson-1.6.0.tar.gz.

File metadata

  • Download URL: salvajson-1.6.0.tar.gz
  • Upload date:
  • Size: 36.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for salvajson-1.6.0.tar.gz
Algorithm Hash digest
SHA256 ca948bbf430e22b6fd6fad1f6d582afa947df2f99ff00083005288932e3600f7
MD5 aacddb21f9c2b4bef6283e4ca1b02c03
BLAKE2b-256 0f7b7b9f6b6c23cadad49671288291f926043da2d061816998077d9cae2e600d

See more details on using hashes here.

File details

Details for the file salvajson-1.6.0-py3-none-any.whl.

File metadata

  • Download URL: salvajson-1.6.0-py3-none-any.whl
  • Upload date:
  • Size: 31.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for salvajson-1.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 60fee94deb6b4a58ff752c187cbe3781f77dbacd67b55ccb7decb25d84df3069
MD5 6ec942d64e1236cd5565a8ef6eca1b28
BLAKE2b-256 9830375ddb5cd5e5e694c109a4a9d69b13a6d7f7914ba9c51af246d54e027713

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page