perse converts HTML content into structured JSON data
Project description
Perse
Perse converts HTML
to JSON
using a mix of traditional html parsing and LLM based data extraction. It performs a few optimizations after fetching the html without accidently removing any important data.
These optimizations includes:
- Removal of styling, scripting and svg tags
- Collapsing Tags (e.g. divs) with only one child
Installation
pip install zf-perse
Usage
export PERSE_OPENAI_API_KEY="your-openai-api-key"
CLI
perse --url https://example.com
Python
from perse import perse
url = "https://example.com"
html = requests.get(url).text
j = perse(html)
print(j)
Example
Google's Homepage
$ perse --url https://google.com
{'image': 'https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png', 'title': 'Google', 'navigation_links': [{'link_name': 'About', 'href': 'https://about.google/?fg=1&utm_source=google-SG&utm_medium=referral&utm_campaign=hp-header'}, {'link_name': 'Store', 'href': 'https://store.google.com/SG?utm_source=hp_header&utm_medium=google_ooo&utm_campaign=GS100042&hl=en-SG'}], 'logo': 'https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png', 'search_form': {'action': '/search', 'method': 'GET', 'autocomplete': 'off', 'search_field': 'q', 'buttons': [{'button_text': 'Google Search', 'button_action': 'submit'}, {'button_text': "I'm Feeling Lucky", 'button_action': 'submit'}]}}
Input
$ perse --url https://zeffmuks.com
{
"title": "Zeff Muks",
"description": "Antifragile Entropy Assassin 🥷",
"og": {
"type": "website",
"title": "Zeff Muks",
"description": "Antifragile Entropy Assassin 🥷",
"url": "https://www.zeffmuks.com/",
"image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
"site_name": "Zeff Muks",
},
"twitter": {
"card": "summary_large_image",
"site": "@zeffmuks",
"title": "Zeff Muks",
"description": "Antifragile Entropy Assassin 🥷",
"image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
},
"main_header": "Antifragile Entropy Assassin 🥷🏻",
"header_link": "https://x.com/zeffmuks",
"builds": [
{
"date": "08/30/2024",
"project": {
"name": "Cursor Git",
"description": "Enhanced Git for Cursor AI Editor",
"logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-logo128.png",
"download_link": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-0.1.12.vsix",
"external_link": "",
},
},
{
"date": "08/18/2024",
"project": {
"name": "PyZF",
"description": "Enhancements for Python",
"logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/pyzf-logo128.png",
"download_link": "",
"external_link": "https://pypi.org/project/PyZF",
},
},
{
"date": "08/05/2024",
"project": {
"name": "Xanthus",
"description": "X (formerly Twitter) Assistant",
"logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/xanthus-logo128.png",
"download_link": "",
"external_link": "https://pypi.org/project/zf-xanthus",
},
},
{
"date": "07/24/2024",
"project": {
"name": "Jenga",
"description": "Fast JSON5 Python Library",
"logo_url": "",
"download_link": "https://pypi.org/project/zf-jenga",
"external_link": "",
},
},
...
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
zf-perse-0.1.5.tar.gz
(8.6 kB
view details)
Built Distribution
File details
Details for the file zf-perse-0.1.5.tar.gz
.
File metadata
- Download URL: zf-perse-0.1.5.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ddaad8b3d02d31d567c8bc66e690914cb1d803dbe4f965020ede7136dbdad661 |
|
MD5 | c6b378e1753d59982901fa6952bbf9b8 |
|
BLAKE2b-256 | 06de2d6d5033546dd382339b01e7851abcb161344d8fb067b1dde10548ebe067 |
File details
Details for the file zf_perse-0.1.5-py3-none-any.whl
.
File metadata
- Download URL: zf_perse-0.1.5-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a8f7b9eacb8b02b350f43a86a1dce49e1a5aeb2d5b7805e1b3918f32a7deb61 |
|
MD5 | 3286c41b48aa6a241f4ba562fa3b8364 |
|
BLAKE2b-256 | 46d0cd8294492863a03ef2bd9b76ee637a648c7c62fd6b2a67163ef91d9660e6 |