Extract and parse JSON from unstructured text outputs from LLMs
Project description
LLM Output Parser
A robust utility for extracting and parsing structured data (JSON and XML) from unstructured text outputs generated by Large Language Models (LLMs).
Features
- Extracts JSON and XML from plain text, code blocks, and mixed content
- Handles various JSON formats: objects, arrays, and nested structures
- Converts XML to JSON-compatible dictionary format
- Advanced extraction strategies for multiple JSON/XML objects in text
- Provides robust error handling and recovery strategies
- Works with markdown code blocks (
json ...andxml ...) - Intelligently selects the most comprehensive structure when multiple are found
Installation
Install from PyPI:
pip install llm-output-parser
Or install from source:
git clone https://github.com/KameniAlexNea/llm-output-parser.git
cd llm-output-parser
pip install -e .
Usage
JSON Parsing
from llm_output_parser import parse_json
# Parse JSON from an LLM response
llm_response = """
Here's the data you requested:
{
"name": "John Doe",
"age": 30,
"skills": ["Python", "Machine Learning", "NLP"]
}
Let me know if you need anything else!
"""
data = parse_json(llm_response)
print(data["name"]) # John Doe
print(data["skills"]) # ['Python', 'Machine Learning', 'NLP']
XML Parsing
from llm_output_parser import parse_xml
# Parse XML from an LLM response and convert to JSON
llm_response = """
Here's the user data in XML format:
```xml
<user id="123">
<name>Jane Smith</name>
<email>jane@example.com</email>
<roles>
<role>admin</role>
<role>editor</role>
</roles>
</user>
Let me know if you need any other information. """
data = parse_xml(llm_response) print(data["@id"]) # 123 print(data["name"]) # Jane Smith print(data["roles"]["role"]) # ['admin', 'editor']
### Handling Complex Cases
The library can handle various complex scenarios:
#### JSON Within Text
```python
text = 'The user profile is: {"name": "John", "email": "john@example.com"}'
data = parse_json(text) # -> {"name": "John", "email": "john@example.com"}
XML Within Text
text = 'The configuration is: <config><server>localhost</server><port>8080</port></config>'
data = parse_xml(text) # -> {"server": "localhost", "port": "8080"}
Multiple JSON/XML Objects
When multiple valid objects are present, the parser returns the most comprehensive one:
# For JSON
text = '''
Small object: {"id": 123}
Larger object:
{
"user": {
"id": 123,
"name": "John",
"email": "john@example.com",
"preferences": {
"theme": "dark",
"notifications": true
}
}
}
'''
data = parse_json(text) # Returns the larger, more complex object
# For XML
text = '''
Simple: <item>value</item>
Complex:
<product category="electronics">
<name>Smartphone</name>
<price currency="USD">999.99</price>
<features>
<feature>5G</feature>
<feature>High-res camera</feature>
</features>
</product>
'''
data = parse_xml(text) # Returns the more complex XML converted to JSON
XML to JSON Conversion Details
When parsing XML, the library converts it to a JSON-compatible dictionary with the following conventions:
- XML attributes are prefixed with
@(e.g.,<item id="123">becomes{"@id": "123"}) - Text content of elements with attributes or children is stored under
#textkey - Simple elements with only text become key-value pairs
- Repeated elements are automatically converted to arrays
Example:
xml_str = '''
<library>
<book category="fiction">
<title>The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
</book>
<book category="non-fiction">
<title>Sapiens</title>
<author>Yuval Noah Harari</author>
</book>
</library>
'''
data = parse_xml(xml_str)
# Results in:
# {
# "book": [
# {
# "@category": "fiction",
# "title": "The Great Gatsby",
# "author": "F. Scott Fitzgerald"
# },
# {
# "@category": "non-fiction",
# "title": "Sapiens",
# "author": "Yuval Noah Harari"
# }
# ]
# }
Error Handling
If no valid structure can be found, a ValueError is raised:
try:
data = parse_json("No JSON here!")
except ValueError as e:
print(f"Error: {e}") # "Error: Failed to parse JSON from the input string."
try:
data = parse_xml("No XML here!")
except ValueError as e:
print(f"Error: {e}") # "Error: Failed to parse XML from the input string."
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_output_parser-0.3.0.tar.gz.
File metadata
- Download URL: llm_output_parser-0.3.0.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2bf5c20b0da6460d4c7860c8a48d47d527bafd07f9af4f4fb9b1ff47d3c70c1f
|
|
| MD5 |
1d6d2923c033fd1a640fe750d807ada8
|
|
| BLAKE2b-256 |
9efd3517db603e1fc124ce4a59cab568a6433cfba3733cafd9070bdf3334b32e
|
Provenance
The following attestation bundles were made for llm_output_parser-0.3.0.tar.gz:
Publisher:
python-package.yml on KameniAlexNea/llm-output-parser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_output_parser-0.3.0.tar.gz -
Subject digest:
2bf5c20b0da6460d4c7860c8a48d47d527bafd07f9af4f4fb9b1ff47d3c70c1f - Sigstore transparency entry: 244262019
- Sigstore integration time:
-
Permalink:
KameniAlexNea/llm-output-parser@440a1461ce381ada22ab01070b9ddaf38b422aa8 -
Branch / Tag:
refs/tags/llm-output-parser-v0.3.0 - Owner: https://github.com/KameniAlexNea
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-package.yml@440a1461ce381ada22ab01070b9ddaf38b422aa8 -
Trigger Event:
release
-
Statement type:
File details
Details for the file llm_output_parser-0.3.0-py3-none-any.whl.
File metadata
- Download URL: llm_output_parser-0.3.0-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9d7c9cd469205aa698efb60fffad591f943e806858f9f77c60139295ad96ff9
|
|
| MD5 |
682e9a2e74339eafc26da6c27af12cb6
|
|
| BLAKE2b-256 |
c26c888e9db804503ef31567b7b20bbe19263ec85c67c941a7beda2ac2797ae4
|
Provenance
The following attestation bundles were made for llm_output_parser-0.3.0-py3-none-any.whl:
Publisher:
python-package.yml on KameniAlexNea/llm-output-parser
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_output_parser-0.3.0-py3-none-any.whl -
Subject digest:
b9d7c9cd469205aa698efb60fffad591f943e806858f9f77c60139295ad96ff9 - Sigstore transparency entry: 244262025
- Sigstore integration time:
-
Permalink:
KameniAlexNea/llm-output-parser@440a1461ce381ada22ab01070b9ddaf38b422aa8 -
Branch / Tag:
refs/tags/llm-output-parser-v0.3.0 - Owner: https://github.com/KameniAlexNea
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-package.yml@440a1461ce381ada22ab01070b9ddaf38b422aa8 -
Trigger Event:
release
-
Statement type: