Skip to main content

Extract and parse JSON from unstructured text outputs from LLMs

Project description

LLM Output Parser

PyPI version GitHub stars codecov Build Status

A robust utility for extracting and parsing structured data (JSON and XML) from unstructured text outputs generated by Large Language Models (LLMs).

Features

  • Extracts JSON and XML from plain text, code blocks, and mixed content
  • Handles various JSON formats: objects, arrays, and nested structures
  • Converts XML to JSON-compatible dictionary format
  • Advanced extraction strategies for multiple JSON/XML objects in text
  • Provides robust error handling and recovery strategies
  • Works with markdown code blocks (json ... and xml ... )
  • Intelligently selects the most comprehensive structure when multiple are found

Installation

Install from PyPI:

pip install llm-output-parser

Or install from source:

git clone https://github.com/KameniAlexNea/llm-output-parser.git
cd llm-output-parser
pip install -e .

Usage

JSON Parsing

from llm_output_parser import parse_json

# Parse JSON from an LLM response
llm_response = """
Here's the data you requested:


{
    "name": "John Doe",
    "age": 30,
    "skills": ["Python", "Machine Learning", "NLP"]
}


Let me know if you need anything else!
"""

data = parse_json(llm_response)
print(data["name"])  # John Doe
print(data["skills"])  # ['Python', 'Machine Learning', 'NLP']

XML Parsing

from llm_output_parser import parse_xml

# Parse XML from an LLM response and convert to JSON
llm_response = """
Here's the user data in XML format:

```xml
<user id="123">
    <name>Jane Smith</name>
    <email>jane@example.com</email>
    <roles>
        <role>admin</role>
        <role>editor</role>
    </roles>
</user>

Let me know if you need any other information. """

data = parse_xml(llm_response) print(data["@id"]) # 123 print(data["name"]) # Jane Smith print(data["roles"]["role"]) # ['admin', 'editor']


### Handling Complex Cases

The library can handle various complex scenarios:

#### JSON Within Text

```python
text = 'The user profile is: {"name": "John", "email": "john@example.com"}'
data = parse_json(text)  # -> {"name": "John", "email": "john@example.com"}

XML Within Text

text = 'The configuration is: <config><server>localhost</server><port>8080</port></config>'
data = parse_xml(text)  # -> {"server": "localhost", "port": "8080"}

Multiple JSON/XML Objects

When multiple valid objects are present, the parser returns the most comprehensive one:

# For JSON
text = '''
Small object: {"id": 123}

Larger object:
{
    "user": {
        "id": 123,
        "name": "John",
        "email": "john@example.com",
        "preferences": {
            "theme": "dark",
            "notifications": true
        }
    }
}
'''
data = parse_json(text)  # Returns the larger, more complex object

# For XML
text = '''
Simple: <item>value</item>

Complex:
<product category="electronics">
    <name>Smartphone</name>
    <price currency="USD">999.99</price>
    <features>
        <feature>5G</feature>
        <feature>High-res camera</feature>
    </features>
</product>
'''
data = parse_xml(text)  # Returns the more complex XML converted to JSON

XML to JSON Conversion Details

When parsing XML, the library converts it to a JSON-compatible dictionary with the following conventions:

  • XML attributes are prefixed with @ (e.g., <item id="123"> becomes {"@id": "123"})
  • Text content of elements with attributes or children is stored under #text key
  • Simple elements with only text become key-value pairs
  • Repeated elements are automatically converted to arrays

Example:

xml_str = '''
<library>
    <book category="fiction">
        <title>The Great Gatsby</title>
        <author>F. Scott Fitzgerald</author>
    </book>
    <book category="non-fiction">
        <title>Sapiens</title>
        <author>Yuval Noah Harari</author>
    </book>
</library>
'''
data = parse_xml(xml_str)
# Results in:
# {
#     "book": [
#         {
#             "@category": "fiction",
#             "title": "The Great Gatsby",
#             "author": "F. Scott Fitzgerald"
#         },
#         {
#             "@category": "non-fiction",
#             "title": "Sapiens",
#             "author": "Yuval Noah Harari"
#         }
#     ]
# }

Error Handling

If no valid structure can be found, a ValueError is raised:

try:
    data = parse_json("No JSON here!")
except ValueError as e:
    print(f"Error: {e}")  # "Error: Failed to parse JSON from the input string."

try:
    data = parse_xml("No XML here!")
except ValueError as e:
    print(f"Error: {e}")  # "Error: Failed to parse XML from the input string."

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_output_parser-0.3.0.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_output_parser-0.3.0-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file llm_output_parser-0.3.0.tar.gz.

File metadata

  • Download URL: llm_output_parser-0.3.0.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_output_parser-0.3.0.tar.gz
Algorithm Hash digest
SHA256 2bf5c20b0da6460d4c7860c8a48d47d527bafd07f9af4f4fb9b1ff47d3c70c1f
MD5 1d6d2923c033fd1a640fe750d807ada8
BLAKE2b-256 9efd3517db603e1fc124ce4a59cab568a6433cfba3733cafd9070bdf3334b32e

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_output_parser-0.3.0.tar.gz:

Publisher: python-package.yml on KameniAlexNea/llm-output-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_output_parser-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_output_parser-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b9d7c9cd469205aa698efb60fffad591f943e806858f9f77c60139295ad96ff9
MD5 682e9a2e74339eafc26da6c27af12cb6
BLAKE2b-256 c26c888e9db804503ef31567b7b20bbe19263ec85c67c941a7beda2ac2797ae4

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_output_parser-0.3.0-py3-none-any.whl:

Publisher: python-package.yml on KameniAlexNea/llm-output-parser

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page