A streaming JSON parser that processes JSON data incrementally, handling partial states. Useful for incrementally parsing partial responses from streaming outputs of Large Language Models (LLMs).

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

aramisfacchinetti

These details have not been verified by PyPI

Project description

Streaming JSON Parser

Objective

This Python module implements a streaming JSON parser designed to process JSON data incrementally. The primary goal is to handle potentially incomplete JSON data streams, such as those produced by Large Language Models (LLMs), and return the current state of the parsed object at any time.

Requirements Subset

The parser is specifically designed for a subset of JSON where:

Values consist solely of strings and objects.
Escape sequences in strings are not expected (though the implementation handles them).
Duplicate keys in objects are not expected (though the implementation may tolerate them, typically keeping the last value).

Features

Incremental Parsing: Consumes JSON data in chunks via the consume() method.
Partial State Retrieval: The get() method returns the currently parsed JSON object state, even if the input stream is incomplete.
Partial String Values: Returns partial string values as they are received (e.g., {"key": "val is valid partial state).
Key Handling: Keys are only included in the returned object once their value type (string or object start) is identified.
Robustness: Attempts to parse standard JSON efficiently and falls back to a more lenient state-machine parser for incomplete or slightly non-standard input.
Non-Standard JSON: Tolerates some non-standard features like unquoted keys and single-quoted strings.
Error Handling: Attempts to recover from invalid characters or find the first valid JSON object within the buffer.
Support for Primitives & Arrays: Although the requirements focused on strings and objects, the implementation also handles numbers, booleans, null, and arrays as values within objects.

Implementation Approach

Buffering: The consume() method appends incoming data chunks to an internal string buffer after escaping potentially invalid control characters.
Parsing (get()):
- The buffer is first cleaned by removing leading whitespace and any characters before the first {.
- It attempts parsing using json.raw_decode for speed and standard compliance. If a dictionary is successfully decoded, it's returned, and the consumed portion is removed from the buffer.
- If raw_decode fails (due to incomplete data, syntax errors, or non-standard features), it falls back to the IterativeStateMachine.
- The IterativeStateMachine parses the buffer character by character, maintaining state to handle nested structures, different value types (including non-standard ones like unquoted keys), and partial inputs.
- The get() method returns the dictionary parsed by either method and updates the buffer, removing the parsed object and any leading garbage before the next potential object. If no complete object can be parsed, an empty dictionary is returned.

Assumptions and Extensions

The implementation makes the following assumptions or extends the requirements:

Handling of Additional Primitive Types: Supports numbers (int, float), booleans (true, false), and null as values, beyond the specified strings and objects.
Handling of Arrays: Supports JSON arrays ([...]) as values within objects and can parse them, although get() only returns top-level objects (dict).
Non-Standard JSON Support: Tolerates and parses:
- Unquoted object keys (e.g., {key: "value"}).
- Single-quoted strings (e.g., {'key': 'value'}).
Escape Sequence Handling: Actively handles standard JSON escape sequences (e.g., \n, \") and Unicode escapes (\uXXXX) within strings, although they were "not expected".
Control Character Handling: Escapes invalid JSON control characters (U+0000 to U+001F) found outside of strings in the input buffer using \uXXXX format during consume.
Error Recovery/Robustness: Discards leading non-JSON data before the first { and attempts to parse the first valid object found. Handles multiple objects in the buffer sequentially across get() calls.
Duplicate Keys: Does not explicitly prevent duplicate keys; standard Python dictionary behavior (last key wins) likely applies.
Efficiency Strategy: Uses json.raw_decode first, falling back to a custom parser only when necessary.
Input Type: consume expects string input; other types are ignored.

Algorithmic Complexity

The efficiency of the StreamingJsonParser depends on the method being called and the nature of the input data stream.

consume(buffer: str):
- Time Complexity: Primarily involves appending the new buffer (length k) to the internal buffer and performing basic character escaping. This is typically O(k). String concatenation in Python can sometimes be O(N+k) where N is the current buffer size, but often optimized closer to O(k) amortized.
- Space Complexity: Increases the internal buffer size by O(k).
get():
- Time Complexity:
  - Fast Path (json.raw_decode): If the buffer starts with a complete, standard JSON object of size P, Python's built-in decoder is used. This is generally efficient, expected to be around O(P).
  - Fallback Path (IterativeStateMachine): If raw_decode fails (due to incomplete data or non-standard syntax), the custom state machine parses the buffer character by character. In the worst case, it might need to scan a significant portion of the buffer (size B'). The complexity is dominated by this scan and subsequent buffer slicing, making it roughly O(B').
  - Overall: The complexity varies. It's close to O(P) when complete objects are readily available and standard, and approaches O(B') when parsing incomplete or non-standard streams requires the iterative fallback.
- Space Complexity: Does not inherently allocate significant additional space beyond the internal representation of the parsed object being returned. The main space usage comes from the internal buffer managed by consume.
Overall Space Complexity: The primary factor is the internal buffer. In the worst case (e.g., a very large stream is consumed without any complete objects being parsed and removed by get()), the space complexity can be O(T), where T is the total size of the streamed data received so far. In typical usage where get() successfully parses and removes objects, the buffer size stays manageable.

Usage

# Import the class
from streaming_json_parser import StreamingJsonParser

# Initialize the parser
parser = StreamingJsonParser()

# Consume JSON data chunks
parser.consume('{"name": "Example", "data": {"val') # Partial object value
parser.consume('ue": "stream"}')                  # Complete the object

# Get the current state of the parsed object
# This will return the first complete object found.
current_object = parser.get()
print(current_object)
# Output: {'name': 'Example', 'data': {'value': 'stream'}}

# The buffer is cleared/updated after get(), ready for the next object
parser.consume('{"next": "object"}')
next_object = parser.get()
print(next_object)
# Output: {'next': 'object'}

# Example with partial string value
parser = StreamingJsonParser()
parser.consume('{"key": "partial string')
partial_state = parser.get()
print(partial_state)
# Output: {'key': 'partial string'}

parser.consume(' complete"}')
complete_state = parser.get()
print(complete_state)
# Output: {'key': 'partial string complete'}

Setup

To use this parser and run the tests, you need to install the dependencies:

pip install -r requirements.txt

The requirements.txt file includes:

pytest
pytest-cov

Testing

Unit tests are provided in test_streaming_json_parser.py. You can run them using pytest:

pytest

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

aramisfacchinetti

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

streaming_json_parser-0.1.0.tar.gz (19.9 kB view details)

Uploaded Apr 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

streaming_json_parser-0.1.0-py3-none-any.whl (13.5 kB view details)

Uploaded Apr 5, 2025 Python 3

File details

Details for the file streaming_json_parser-0.1.0.tar.gz.

File metadata

Download URL: streaming_json_parser-0.1.0.tar.gz
Upload date: Apr 5, 2025
Size: 19.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for streaming_json_parser-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ee6e67448fac2bfd1a4fe8511e6cbaa54e46e7b746284bf137bebd744efbdd6d`
MD5	`6f62d5e02eac03234c6006760dc4f88e`
BLAKE2b-256	`74abe64c774a5fc27c2238abffc517ee40feee5f61c4d561d44dc58b56f7b394`

See more details on using hashes here.

Provenance

The following attestation bundles were made for streaming_json_parser-0.1.0.tar.gz:

Publisher: publish.yml on aramisfacchinetti/streaming-json-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: streaming_json_parser-0.1.0.tar.gz
- Subject digest: ee6e67448fac2bfd1a4fe8511e6cbaa54e46e7b746284bf137bebd744efbdd6d
- Sigstore transparency entry: 192823794
- Sigstore integration time: Apr 5, 2025
Source repository:
- Permalink: aramisfacchinetti/streaming-json-parser@7e539b5b5fbffc1190a7a3517b899b50c0d63757
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/aramisfacchinetti
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7e539b5b5fbffc1190a7a3517b899b50c0d63757
- Trigger Event: release

File details

Details for the file streaming_json_parser-0.1.0-py3-none-any.whl.

File metadata

Download URL: streaming_json_parser-0.1.0-py3-none-any.whl
Upload date: Apr 5, 2025
Size: 13.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for streaming_json_parser-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f9e2129e058bdf0c03f450187d7afb78f15a6cee4974024e8f6fbc220b978e51`
MD5	`8022d56e40583063f88d2867806eb362`
BLAKE2b-256	`8ff5393f036e6174adb52629614f76a7d6fb5d3932f7f04de07ddac46c74581e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for streaming_json_parser-0.1.0-py3-none-any.whl:

Publisher: publish.yml on aramisfacchinetti/streaming-json-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: streaming_json_parser-0.1.0-py3-none-any.whl
- Subject digest: f9e2129e058bdf0c03f450187d7afb78f15a6cee4974024e8f6fbc220b978e51
- Sigstore transparency entry: 192823795
- Sigstore integration time: Apr 5, 2025
Source repository:
- Permalink: aramisfacchinetti/streaming-json-parser@7e539b5b5fbffc1190a7a3517b899b50c0d63757
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/aramisfacchinetti
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7e539b5b5fbffc1190a7a3517b899b50c0d63757
- Trigger Event: release

streaming-json-parser 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Streaming JSON Parser

Objective

Requirements Subset

Features

Implementation Approach

Assumptions and Extensions

Algorithmic Complexity

Usage

Setup

Testing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance