Skip to main content

A streaming JSON parser that processes JSON data incrementally, handling partial states. Useful for incrementally parsing partial responses from streaming outputs of Large Language Models (LLMs).

Project description

Streaming JSON Parser

Objective

This Python module implements a streaming JSON parser designed to process JSON data incrementally. The primary goal is to handle potentially incomplete JSON data streams, such as those produced by Large Language Models (LLMs), and return the current state of the parsed object at any time.

Requirements Subset

The parser is specifically designed for a subset of JSON where:

  • Values consist solely of strings and objects.
  • Escape sequences in strings are not expected (though the implementation handles them).
  • Duplicate keys in objects are not expected (though the implementation may tolerate them, typically keeping the last value).

Features

  • Incremental Parsing: Consumes JSON data in chunks via the consume() method.
  • Partial State Retrieval: The get() method returns the currently parsed JSON object state, even if the input stream is incomplete.
  • Partial String Values: Returns partial string values as they are received (e.g., {"key": "val is valid partial state).
  • Key Handling: Keys are only included in the returned object once their value type (string or object start) is identified.
  • Robustness: Attempts to parse standard JSON efficiently and falls back to a more lenient state-machine parser for incomplete or slightly non-standard input.
  • Non-Standard JSON: Tolerates some non-standard features like unquoted keys and single-quoted strings.
  • Error Handling: Attempts to recover from invalid characters or find the first valid JSON object within the buffer.
  • Support for Primitives & Arrays: Although the requirements focused on strings and objects, the implementation also handles numbers, booleans, null, and arrays as values within objects.

Implementation Approach

  1. Buffering: The consume() method appends incoming data chunks to an internal string buffer after escaping potentially invalid control characters.
  2. Parsing (get()):
    • The buffer is first cleaned by removing leading whitespace and any characters before the first {.
    • It attempts parsing using json.raw_decode for speed and standard compliance. If a dictionary is successfully decoded, it's returned, and the consumed portion is removed from the buffer.
    • If raw_decode fails (due to incomplete data, syntax errors, or non-standard features), it falls back to the IterativeStateMachine.
    • The IterativeStateMachine parses the buffer character by character, maintaining state to handle nested structures, different value types (including non-standard ones like unquoted keys), and partial inputs.
    • The get() method returns the dictionary parsed by either method and updates the buffer, removing the parsed object and any leading garbage before the next potential object. If no complete object can be parsed, an empty dictionary is returned.

Assumptions and Extensions

The implementation makes the following assumptions or extends the requirements:

  1. Handling of Additional Primitive Types: Supports numbers (int, float), booleans (true, false), and null as values, beyond the specified strings and objects.
  2. Handling of Arrays: Supports JSON arrays ([...]) as values within objects and can parse them, although get() only returns top-level objects (dict).
  3. Non-Standard JSON Support: Tolerates and parses:
    • Unquoted object keys (e.g., {key: "value"}).
    • Single-quoted strings (e.g., {'key': 'value'}).
  4. Escape Sequence Handling: Actively handles standard JSON escape sequences (e.g., \n, \") and Unicode escapes (\uXXXX) within strings, although they were "not expected".
  5. Control Character Handling: Escapes invalid JSON control characters (U+0000 to U+001F) found outside of strings in the input buffer using \uXXXX format during consume.
  6. Error Recovery/Robustness: Discards leading non-JSON data before the first { and attempts to parse the first valid object found. Handles multiple objects in the buffer sequentially across get() calls.
  7. Duplicate Keys: Does not explicitly prevent duplicate keys; standard Python dictionary behavior (last key wins) likely applies.
  8. Efficiency Strategy: Uses json.raw_decode first, falling back to a custom parser only when necessary.
  9. Input Type: consume expects string input; other types are ignored.

Algorithmic Complexity

The efficiency of the StreamingJsonParser depends on the method being called and the nature of the input data stream.

  • consume(buffer: str):

    • Time Complexity: Primarily involves appending the new buffer (length k) to the internal buffer and performing basic character escaping. This is typically O(k). String concatenation in Python can sometimes be O(N+k) where N is the current buffer size, but often optimized closer to O(k) amortized.
    • Space Complexity: Increases the internal buffer size by O(k).
  • get():

    • Time Complexity:
      • Fast Path (json.raw_decode): If the buffer starts with a complete, standard JSON object of size P, Python's built-in decoder is used. This is generally efficient, expected to be around O(P).
      • Fallback Path (IterativeStateMachine): If raw_decode fails (due to incomplete data or non-standard syntax), the custom state machine parses the buffer character by character. In the worst case, it might need to scan a significant portion of the buffer (size B'). The complexity is dominated by this scan and subsequent buffer slicing, making it roughly O(B').
      • Overall: The complexity varies. It's close to O(P) when complete objects are readily available and standard, and approaches O(B') when parsing incomplete or non-standard streams requires the iterative fallback.
    • Space Complexity: Does not inherently allocate significant additional space beyond the internal representation of the parsed object being returned. The main space usage comes from the internal buffer managed by consume.
  • Overall Space Complexity: The primary factor is the internal buffer. In the worst case (e.g., a very large stream is consumed without any complete objects being parsed and removed by get()), the space complexity can be O(T), where T is the total size of the streamed data received so far. In typical usage where get() successfully parses and removes objects, the buffer size stays manageable.

Usage

# Import the class
from streaming_json_parser import StreamingJsonParser

# Initialize the parser
parser = StreamingJsonParser()

# Consume JSON data chunks
parser.consume('{"name": "Example", "data": {"val') # Partial object value
parser.consume('ue": "stream"}')                  # Complete the object

# Get the current state of the parsed object
# This will return the first complete object found.
current_object = parser.get()
print(current_object)
# Output: {'name': 'Example', 'data': {'value': 'stream'}}

# The buffer is cleared/updated after get(), ready for the next object
parser.consume('{"next": "object"}')
next_object = parser.get()
print(next_object)
# Output: {'next': 'object'}

# Example with partial string value
parser = StreamingJsonParser()
parser.consume('{"key": "partial string')
partial_state = parser.get()
print(partial_state)
# Output: {'key': 'partial string'}

parser.consume(' complete"}')
complete_state = parser.get()
print(complete_state)
# Output: {'key': 'partial string complete'}

Setup

To use this parser and run the tests, you need to install the dependencies:

pip install -r requirements.txt

The requirements.txt file includes:

  • pytest
  • pytest-cov

Testing

Unit tests are provided in test_streaming_json_parser.py. You can run them using pytest:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

streaming_json_parser-0.1.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

streaming_json_parser-0.1.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file streaming_json_parser-0.1.0.tar.gz.

File metadata

  • Download URL: streaming_json_parser-0.1.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for streaming_json_parser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ee6e67448fac2bfd1a4fe8511e6cbaa54e46e7b746284bf137bebd744efbdd6d
MD5 6f62d5e02eac03234c6006760dc4f88e
BLAKE2b-256 74abe64c774a5fc27c2238abffc517ee40feee5f61c4d561d44dc58b56f7b394

See more details on using hashes here.

Provenance

The following attestation bundles were made for streaming_json_parser-0.1.0.tar.gz:

Publisher: publish.yml on aramisfacchinetti/streaming-json-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file streaming_json_parser-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for streaming_json_parser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f9e2129e058bdf0c03f450187d7afb78f15a6cee4974024e8f6fbc220b978e51
MD5 8022d56e40583063f88d2867806eb362
BLAKE2b-256 8ff5393f036e6174adb52629614f76a7d6fb5d3932f7f04de07ddac46c74581e

See more details on using hashes here.

Provenance

The following attestation bundles were made for streaming_json_parser-0.1.0-py3-none-any.whl:

Publisher: publish.yml on aramisfacchinetti/streaming-json-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page