perse converts HTML content into structured JSON data
Project description
Perse
Perse converts HTML
to JSON
using a mix of traditional html parsing and LLM based data extraction. It performs a few optimizations after fetching the html without accidently removing any important data.
These optimizations includes:
- Removal of styling, scripting and svg tags
- Collapsing Tags (e.g. divs) with only one child
Installation
pip install zf-perse
Usage
export PERSE_OPENAI_API_KEY="your-openai-api-key"
CLI
perse --url https://example.com
Python
from perse import perse
url = "https://example.com"
html = requests.get(url).text
j = perse(html)
print(j)
Example
Google's Homepage
$ perse --url https://google.com
{'image': 'https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png', 'title': 'Google', 'navigation_links': [{'link_name': 'About', 'href': 'https://about.google/?fg=1&utm_source=google-SG&utm_medium=referral&utm_campaign=hp-header'}, {'link_name': 'Store', 'href': 'https://store.google.com/SG?utm_source=hp_header&utm_medium=google_ooo&utm_campaign=GS100042&hl=en-SG'}], 'logo': 'https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png', 'search_form': {'action': '/search', 'method': 'GET', 'autocomplete': 'off', 'search_field': 'q', 'buttons': [{'button_text': 'Google Search', 'button_action': 'submit'}, {'button_text': "I'm Feeling Lucky", 'button_action': 'submit'}]}}
Input
$ perse --url https://zeffmuks.com
{
"title": "Zeff Muks",
"description": "Antifragile Entropy Assassin 🥷",
"og": {
"type": "website",
"title": "Zeff Muks",
"description": "Antifragile Entropy Assassin 🥷",
"url": "https://www.zeffmuks.com/",
"image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
"site_name": "Zeff Muks",
},
"twitter": {
"card": "summary_large_image",
"site": "@zeffmuks",
"title": "Zeff Muks",
"description": "Antifragile Entropy Assassin 🥷",
"image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
},
"main_header": "Antifragile Entropy Assassin 🥷🏻",
"header_link": "https://x.com/zeffmuks",
"builds": [
{
"date": "08/30/2024",
"project": {
"name": "Cursor Git",
"description": "Enhanced Git for Cursor AI Editor",
"logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-logo128.png",
"download_link": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-0.1.12.vsix",
"external_link": "",
},
},
{
"date": "08/18/2024",
"project": {
"name": "PyZF",
"description": "Enhancements for Python",
"logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/pyzf-logo128.png",
"download_link": "",
"external_link": "https://pypi.org/project/PyZF",
},
},
{
"date": "08/05/2024",
"project": {
"name": "Xanthus",
"description": "X (formerly Twitter) Assistant",
"logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/xanthus-logo128.png",
"download_link": "",
"external_link": "https://pypi.org/project/zf-xanthus",
},
},
{
"date": "07/24/2024",
"project": {
"name": "Jenga",
"description": "Fast JSON5 Python Library",
"logo_url": "",
"download_link": "https://pypi.org/project/zf-jenga",
"external_link": "",
},
},
...
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
zf-perse-0.1.4.tar.gz
(8.6 kB
view details)
Built Distribution
File details
Details for the file zf-perse-0.1.4.tar.gz
.
File metadata
- Download URL: zf-perse-0.1.4.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e30ac2e976e58de8eb9afb1ed0783dca9f288eaf3106de5c317be13ea325af6 |
|
MD5 | a4340de516cd6793b4f72ba1da5adf1e |
|
BLAKE2b-256 | 805c56fe6341e7492f89659f592c9ec063985306f68b97368d01b04e3e3e9947 |
File details
Details for the file zf_perse-0.1.4-py3-none-any.whl
.
File metadata
- Download URL: zf_perse-0.1.4-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bfd4a462fab8a808b28089b753e0bbba839f89ee57de10d9308e0c3e6e696f88 |
|
MD5 | 06d1307e7d2c96746c198ab36e916468 |
|
BLAKE2b-256 | 6e0920bf8dacb1ecfd64c4c231183552e3780e15bb158a474dc2f5f05c435106 |