Skip to main content

Robust JSON extraction and repair utilities for LLM-generated content.

Project description

๐Ÿ› ๏ธ robust-json

Python Version License: MIT

Robust JSON extraction and repair utilities for LLM-generated content.

Parse JSON from messy LLM outputs with confidence. robust-json extracts and repairs JSON even when models mix commentary with structured data, use incorrect quotes, add trailing commas, include comments, or truncate responses mid-object.


โœจ Why robust-json?

Large Language Models are powerful but inconsistent when generating JSON. They might:

  • ๐Ÿ“ Mix text and JSON: Embed JSON inside markdown code blocks or conversational responses
  • ๐Ÿ’ฌ Add comments: Include // or # comments that break standard JSON parsers
  • ๐Ÿ”ค Use wrong quotes: Generate single quotes (') instead of double quotes (")
  • ๐Ÿ”š Add trailing commas: Place commas after the last item in arrays/objects
  • โœ‚๏ธ Truncate output: Stop mid-JSON due to token limits or errors

robust-json handles all these cases automatically, so you can focus on using the data instead of fighting with parser errors.


๐Ÿš€ Features

  • ๐Ÿ” Smart extraction: Automatically finds JSON objects and arrays within free-form text
  • ๐Ÿ”ง Auto-repair: Fixes common LLM errors including:
    • Single-quoted strings โ†’ double quotes
    • Mixed quote types (e.g., 'text" โ†’ 'text')
    • Inline comments (// and #)
    • Trailing commas
    • Unclosed braces and brackets
  • ๐ŸŽฏ Multiple parsers: Falls back through json โ†’ ast.literal_eval for maximum compatibility
  • โšก Performance: Optional speedups with regex (enhanced regex engine) and numba (JIT-compiled bracket scanning)
  • ๐ŸŒ Unicode support: Handles international characters and emoji seamlessly

๐Ÿ“ฆ Installation

Basic installation:

pip install robust-json-parser

With performance optimizations (numba JIT):

pip install robust-json-parser[speedups]

With regex (enhanced regex engine with better Unicode support):

pip install robust-json-parser[regex]

All extras:

pip install robust-json-parser[speedups,regex]

Requirements: Python 3.9+


๐ŸŽฏ Quick Start

Basic Usage

from robust_json import loads

# LLM output with mixed formatting
llm_response = """
Sure! Here's the data you requested:
```json
{
  "name": "Alice",
  "age": 30,
  "hobbies": ["reading", "coding",],  // trailing comma
  "active": true,  # Python-style comment
}

Hope this helps!
"""

data = loads(llm_response)
print(data)
# {'name': 'Alice', 'age': 30, 'hobbies': ['reading', 'coding'], 'active': True}

Handling Malformed JSON

from robust_json import loads

# Mixed quotes, comments, and multilingual text
message = """
Hello, I'm a recruitment consultant. Here's the job description for your matching assessment:
```json
{"id": "algo", "position": "Large Language Model Algorithm Engineer",
# this is the keywords list used to analyze the candidate
 "keywords": {"positive": ["PEFT", "RLHF"], "negative": ["CNN", "RNN"]}, # negative keywords is supported
 "summary": 'The candidate has some AI background, but lacks experience."
 }
"""

data = loads(message)
print(data["keywords"]["positive"])
# ['PEFT', 'RLHF']

Truncated/Partial JSON

from robust_json import loads

# JSON cut off mid-object
incomplete = '{"user": {"name": "Bob", "email": "bob@example.com"'

data = loads(incomplete)
print(data)
# {'user': {'name': 'Bob', 'email': 'bob@example.com'}}

Extract Multiple JSON Objects

from robust_json import extract_all, RobustJSONParser

text = """
First result: {"a": 1, "b": 2}
Some text in between...
Second result: {"x": 10, "y": 20}
"""

# Get all extractions with metadata
extractions = extract_all(text)
for extraction in extractions:
    print(f"Found at position {extraction.start}: {extraction.text}")

# Or just get the parsed objects
parser = RobustJSONParser()
objects = parser.parse_all(text)
print(objects)
# [{'a': 1, 'b': 2}, {'x': 10, 'y': 20}]

๐Ÿ“š API Reference

loads(source, *, allow_partial=True, default=None, strict=False)

Parse the first JSON object found in the source text.

Parameters:

  • source (str): Text containing JSON
  • allow_partial (bool): If True, auto-complete truncated JSON (default: True)
  • default (Optional): Return this value if no JSON found (default: None raises error)
  • strict (bool): If True, only extract from code blocks and brace-delimited content (default: False)

Returns: Parsed Python object (dict, list, etc.)

Raises: ValueError if no JSON found and no default provided


extract(source, *, allow_partial=True)

Extract the first JSON-like fragment with metadata.

Returns: Extraction object or None


extract_all(source, *, allow_partial=True)

Extract all JSON-like fragments from text.

Returns: List of Extraction objects


RobustJSONParser

Main parser class for advanced usage.

Methods:

  • extract(source, limit=None): Find JSON fragments (returns list of Extraction objects)
  • parse_first(source): Parse first JSON object (returns parsed object or None)
  • parse_all(source): Parse all JSON objects (returns list of parsed objects)

Parameters:

  • allow_partial (bool): Auto-complete truncated JSON (default: True)
  • strict (bool): Only extract from explicit JSON contexts (default: False)

Extraction

Dataclass representing an extracted JSON candidate.

Attributes:

  • text (str): The extracted text
  • start (int): Starting position in source
  • end (int): Ending position in source
  • is_partial (bool): Whether the extraction appears truncated
  • repaired (Optional[str]): The repaired version after processing

๐Ÿ”ง How It Works

  1. ๐Ÿ”Ž Extraction: Scans text for JSON patterns using:

    • Markdown code blocks (```json ... ```)
    • Brace-balanced regions ({...}, [...])
  2. ๐Ÿ› ๏ธ Repair: Applies fixes in order:

    • Strip // and # comments
    • Fix mixed quote types (e.g., 'text" โ†’ 'text')
    • Normalize single quotes to double quotes
    • Remove trailing commas
    • Balance unclosed braces (if allow_partial=True)
  3. โœ… Parse: Attempts parsing with:

    • json.loads() (standard JSON)
    • ast.literal_eval() (Python literals)
  4. ๐Ÿ“Š Return: Returns first successful parse or continues to next candidate


๐ŸŽจ Use Cases

  • ๐Ÿค– LLM Integration: Parse structured output from ChatGPT, Claude, Llama, etc.
  • ๐Ÿ“Š Data Extraction: Extract JSON from logs, documentation, or mixed-format files
  • ๐Ÿ”„ API Responses: Handle malformed API responses gracefully
  • ๐Ÿงช Testing: Validate and repair JSON in test fixtures
  • ๐Ÿ“ Data Migration: Clean up inconsistent JSON during migrations

โšก Performance Tips

  1. Install speedups for large-scale processing:

    pip install robust-json-parser[speedups]  # numba JIT compilation
    pip install robust-json-parser[regex]  # enhanced regex engine with better Unicode support
    
  2. Use strict mode when JSON is always in code blocks:

    loads(text, strict=True)  # Faster, skips fallback attempts
    
  3. Disable partial completion if you know JSON is complete:

    loads(text, allow_partial=False)  # Skips brace-balancing step
    
  4. Reuse parser instance for multiple parses:

    parser = RobustJSONParser()
    for text in texts:
        data = parser.parse_first(text)
    

๐Ÿงช Test Status

Overall Test Coverage: 97.9% (139/142 tests passing)

Category Test File Passed Failed Total Pass Rate Status
Core Functionality test_parser.py 5 0 5 100.0% โœ…
Comprehensive Tests test_comprehensive.py 49 2 51 96.1% โœ…
Edge Cases test_edge_cases.py 38 1 39 97.4% โœ…
LLM Scenarios test_llm_scenarios.py 30 1 31 96.8% โœ…
Performance test_performance.py 11 0 11 100.0% โœ…
Batch Processing test_batch_performance.py 5 0 5 100.0% โœ…

Test Categories Breakdown

  • โœ… Core Functionality (100%): Basic parsing, extraction, and repair features
  • โœ… Comprehensive Tests (96.1%): Real-world scenarios, complex nested structures, multilingual content
  • โœ… Edge Cases (97.4%): Unicode handling, malformed JSON, bracket matching, error recovery
  • โœ… LLM Scenarios (96.8%): ChatGPT/Claude-style outputs, conversational text extraction
  • โœ… Performance (100%): Large datasets, memory usage, parsing speed benchmarks
  • โœ… Batch Processing (100%): Parallel processing, multiprocessing, error handling

Known Issues (3 failing tests)

  • Complex Incomplete JSON: Token-limited LLM outputs with deeply nested incomplete structures
  • Extraction Order: extract_all function needs to preserve proper ordering
  • Deep Nesting: Complex nested structures with mismatched brackets need enhanced repair

๐Ÿค Contributing

We welcome contributions from developers of all skill levels! Whether you're fixing bugs, adding features, or improving documentation, your help makes this project better for everyone.

๐ŸŽฏ How to Contribute

  1. ๐Ÿ› Bug Reports: Found an issue? Open a GitHub issue with:

    • Clear description of the problem
    • Minimal reproducible example
    • Expected vs actual behavior
  2. โœจ Feature Requests: Have an idea? We'd love to hear it! Open an issue to discuss:

    • Use case and motivation
    • Proposed implementation approach
    • Any breaking changes
  3. ๐Ÿ”ง Code Contributions: Ready to code? Here's how:

    # Fork and clone the repository
    git clone https://github.com/your-username/robust-json.git
    cd robust-json
    
    # Install in development mode
    pip install -e ".[speedups,regex,dev]"
    
    # Run tests to ensure everything works
    pytest tests/
    
    # Make your changes and test them
    pytest tests/ -v
    
    # Submit a pull request
    

๐Ÿงช Testing Your Changes

# Run all tests
pytest tests/

# Run specific test categories
pytest tests/test_parser.py          # Core functionality
pytest tests/test_comprehensive.py   # Comprehensive scenarios
pytest tests/test_llm_scenarios.py   # LLM-specific cases
pytest tests/test_edge_cases.py      # Edge cases and error handling
pytest tests/test_performance.py     # Performance benchmarks

# Run with coverage
pytest tests/ --cov=robust_json --cov-report=html

๐ŸŽจ Areas We'd Love Help With

  • ๐ŸŒ Internationalization: Better support for non-Latin scripts and RTL languages
  • โšก Performance: Optimize parsing speed for very large JSON objects
  • ๐Ÿ” LLM Integration: Improve extraction from more LLM output formats
  • ๐Ÿ“š Documentation: Examples, tutorials, and API documentation
  • ๐Ÿงช Test Coverage: Add more edge cases and real-world scenarios
  • ๐Ÿ› Bug Fixes: Help us get to 100% test pass rate!

๐Ÿ“‹ Development Guidelines

  • Code Style: Follow PEP 8, use type hints, and add docstrings
  • Testing: Add tests for new features and bug fixes
  • Documentation: Update README and docstrings as needed
  • Performance: Consider performance impact of changes
  • Compatibility: Maintain Python 3.9+ compatibility

๐Ÿ† Recognition

Contributors will be recognized in our README and release notes. We appreciate every contribution, no matter how small!

Ready to get started? Check out our open issues or start with the failing tests above!


๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

Built for developers working with LLM-generated content who need reliability without sacrificing flexibility.


Made with โค๏ธ for the AI/LLM community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robust_json_parser-0.1.6.tar.gz (33.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

robust_json_parser-0.1.6-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file robust_json_parser-0.1.6.tar.gz.

File metadata

  • Download URL: robust_json_parser-0.1.6.tar.gz
  • Upload date:
  • Size: 33.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for robust_json_parser-0.1.6.tar.gz
Algorithm Hash digest
SHA256 6881bf59177f62567dda91a76a6f12a6bb8e58c45ac78722cd523e35c1282a85
MD5 0c4f4455729c16960cc5a3262b8a50b1
BLAKE2b-256 4617a11bf7e84d7af30b2c12a269129066f68cdceba1328581864a8430ceb26a

See more details on using hashes here.

File details

Details for the file robust_json_parser-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for robust_json_parser-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2e4d0d3fffb788cf16c36057a17ae7f70701bb81e60e7448abc21aec927db9b2
MD5 6ecd28a765b04159a43ccc6382f5258e
BLAKE2b-256 0aedef686393d9384447f3203b8220fde85c75467e79154be8e83cd70115e930

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page