perse converts HTML content into structured JSON data
Project description
Perse
Perse converts HTML
to JSON
using a mix of traditional html parsing and LLM based data extraction. It performs a few optimizations after fetching the html without accidently removing any important data.
These optimizations includes:
- Removal of styling, scripting and svg tags
- Collapsing Tags (e.g. divs) with only one child
Installation
pip install zf-perse
Usage
export PERSE_OPENAI_API_KEY="your-openai-api-key"
CLI
perse --url https://example.com
Python
from perse import perse
url = "https://example.com"
html = requests.get(url).text
j = perse(html)
print(j)
Example
Google's Homepage
$ perse --url https://google.com
{'image': 'https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png', 'title': 'Google', 'navigation_links': [{'link_name': 'About', 'href': 'https://about.google/?fg=1&utm_source=google-SG&utm_medium=referral&utm_campaign=hp-header'}, {'link_name': 'Store', 'href': 'https://store.google.com/SG?utm_source=hp_header&utm_medium=google_ooo&utm_campaign=GS100042&hl=en-SG'}], 'logo': 'https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png', 'search_form': {'action': '/search', 'method': 'GET', 'autocomplete': 'off', 'search_field': 'q', 'buttons': [{'button_text': 'Google Search', 'button_action': 'submit'}, {'button_text': "I'm Feeling Lucky", 'button_action': 'submit'}]}}
Input
<!-- taken from https://zeffmuks.com -->
<html lang="en" data-theme="light" style="color-scheme: light;">
<head>
<meta charset="utf-8">
<link rel="icon" href="/favicon.ico">
<meta name="viewport" content="width=device-width,initial-scale=1">
<meta name="theme-color" content="#000000">
<meta name="description" content="Antifragile Entropy Assassin 🥷">
<link rel="apple-touch-icon" href="/images/logo192.png">
<link rel="manifest" href="/manifest.json">
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-GNB6LQMFW3"></script>
<script>function gtag() { dataLayer.push(arguments) } window.dataLayer = window.dataLayer || [], gtag("js", new Date), gtag("config", "G-GNB6LQMFW3")</script>
<title>Zeff Muks</title>
<script defer="defer" src="/static/js/main.4de0eae9.js"></script>
<link href="/static/css/main.f6a8a2d9.css" rel="stylesheet">
<style data-emotion="css-global" data-s=""></style>
<style data-emotion="css-global" data-s=""></style>
<style data-emotion="css-global" data-s=""></style>
<style data-emotion="css" data-s=""></style>
<meta property="og:type" content="website" data-rh="true">
<meta property="og:title" content="Zeff Muks" data-rh="true">
<meta property="og:description" content="Antifragile Entropy Assassin 🥷" data-rh="true">
<meta property="og:url" content="https://www.zeffmuks.com/" data-rh="true">
<meta property="og:image" content="https://www.zeffmuks.com/images/ZeffMuks-1920.png" data-rh="true">
<meta property="og:site_name" content="Zeff Muks" data-rh="true">
<meta name="twitter:card" content="summary_large_image" data-rh="true">
<meta name="twitter:site" content="@zeffmuks" data-rh="true">
<meta name="twitter:title" content="Zeff Muks" data-rh="true">
<meta name="twitter:description" content="Antifragile Entropy Assassin 🥷" data-rh="true">
<meta name="twitter:image" content="https://www.zeffmuks.com/images/ZeffMuks-1920.png" data-rh="true">
</head>
<body class="chakra-ui-light" cz-shortcut-listen="true"><noscript>You need to enable JavaScript to run this
app.</noscript>
<div id="root">
<div class="css-0">
<div class="css-lt6aye">
<div class="chakra-stack css-sqtrbi"><img src="/images/ZeffMuks-6912.png" class="chakra-image css-0">
<h1 class="chakra-heading css-1g6enkz">Antifragile Entropy Assassin 🥷🏻</h1>
<h2 class="chakra-heading css-shu5if"><a class="chakra-link css-spn4bz"
href="https://x.com/zeffmuks">𝕏</a></h2>
</div>
</div>
<div class="css-1hielw0">
<div class="chakra-stack css-5kt1vw">
<h1 class="chakra-heading css-eh1ywz">Builds</h1>
<div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
<div class="css-10fvfu7">
<p class="chakra-text css-1wrsef2">08/30/2024</p>
</div>
<div class="chakra-stack css-399av8">
<div class="min-w-full h-auto">
<div class="css-0">
<h1
class="chakra-heading hover:scale-105 transition-all hover:translate-x-3 duration-300 ease-in-out css-18nnx4p">
<span class="text-2xl inline-flex items-center"><img
src="https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-logo128.png"
alt="Cursor Git" class="h-8 w-8 mr-2">Cursor Git</span>
</h1>
<p class="chakra-text css-17vaxo2">Enhanced Git for Cursor AI Editor</p>
</div>
</div>
<div class="min-w-full h-auto">
<div class="css-0">
<div class="flex flex-row gap-2">
<div><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
stroke-linecap="round" stroke-linejoin="round"
class="lucide lucide-images">
<path d="M18 22H4a2 2 0 0 1-2-2V6"></path>
<path d="m22 13-1.296-1.296a2.41 2.41 0 0 0-3.408 0L11 18"></path>
<circle cx="12" cy="8" r="2"></circle>
<rect width="16" height="16" x="6" y="2" rx="2"></rect>
</svg></div><a target="_blank" rel="noopener" class="chakra-link css-4a6x12"
href="https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-0.1.12.vsix"><svg
xmlns="http://www.w3.org/2000/svg" width="24" height="24"
viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
stroke-linecap="round" stroke-linejoin="round"
class="lucide lucide-external-link">
<path d="M15 3h6v6"></path>
<path d="M10 14 21 3"></path>
<path d="M18 13v6a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V8a2 2 0 0 1 2-2h6">
</path>
</svg></a><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-share">
<path d="M4 12v8a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2v-8"></path>
<polyline points="16 6 12 2 8 6"></polyline>
<line x1="12" x2="12" y1="2" y2="15"></line>
</svg>
</div>
</div>
</div>
</div>
</div>
<div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
<div class="css-10fvfu7">
<p class="chakra-text css-1wrsef2">08/18/2024</p>
</div>
<div class="chakra-stack css-399av8">
<div class="min-w-full h-auto">
<div class="css-0">
<h1
class="chakra-heading hover:scale-105 transition-all hover:translate-x-3 duration-300 ease-in-out css-18nnx4p">
<span class="text-2xl inline-flex items-center"><img
src="https://zf-static.s3.us-west-1.amazonaws.com/pyzf-logo128.png"
alt="PyZF" class="h-8 w-8 mr-2">PyZF</span>
</h1>
<p class="chakra-text css-17vaxo2">Enhancements for Python</p>
</div>
</div>
<div class="min-w-full h-auto">
<div class="css-0">
<div class="flex flex-row gap-2"><a target="_blank" rel="noopener"
class="chakra-link css-4a6x12" href="https://pypi.org/project/PyZF"><svg
xmlns="http://www.w3.org/2000/svg" width="24" height="24"
viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
stroke-linecap="round" stroke-linejoin="round"
class="lucide lucide-external-link">
<path d="M15 3h6v6"></path>
<path d="M10 14 21 3"></path>
<path d="M18 13v6a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V8a2 2 0 0 1 2-2h6">
</path>
</svg></a><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-share">
<path d="M4 12v8a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2v-8"></path>
<polyline points="16 6 12 2 8 6"></polyline>
<line x1="12" x2="12" y1="2" y2="15"></line>
</svg></div>
</div>
</div>
</div>
</div>
<div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
<div class="css-10fvfu7">
<p class="chakra-text css-1wrsef2">08/05/2024</p>
</div>
<div class="chakra-stack css-399av8">
<div class="min-w-full h-auto">
<div class="css-0">
<h1
class="chakra-heading hover:scale-105 transition-all hover:translate-x-3 duration-300 ease-in-out css-18nnx4p">
<span class="text-2xl inline-flex items-center"><img
src="https://zf-static.s3.us-west-1.amazonaws.com/xanthus-logo128.png"
alt="Xanthus" class="h-8 w-8 mr-2">Xanthus</span>
</h1>
<p class="chakra-text css-17vaxo2">X (formerly Twitter) Assistant</p>
</div>
</div>
<div class="min-w-full h-auto">
<div class="css-0">
<div class="flex flex-row gap-2"><a target="_blank" rel="noopener"
class="chakra-link css-4a6x12"
href="https://pypi.org/project/zf-xanthus"><svg
xmlns="http://www.w3.org/2000/svg" width="24" height="24"
viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
stroke-linecap="round" stroke-linejoin="round"
class="lucide lucide-external-link">
<path d="M15 3h6v6"></path>
<path d="M10 14 21 3"></path>
<path d="M18 13v6a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V8a2 2 0 0 1 2-2h6">
</path>
</svg></a><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-share">
<path d="M4 12v8a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2v-8"></path>
<polyline points="16 6 12 2 8 6"></polyline>
<line x1="12" x2="12" y1="2" y2="15"></line>
</svg></div>
</div>
</div>
</div>
</div>
<div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
<div class="css-10fvfu7">
<p class="chakra-text css-1wrsef2">07/24/2024</p>
</div>
<div class="chakra-stack css-399av8">
<div class="min-w-full h-...
Output
{
"title": "Zeff Muks",
"description": "Antifragile Entropy Assassin 🥷",
"og": {
"type": "website",
"title": "Zeff Muks",
"description": "Antifragile Entropy Assassin 🥷",
"url": "https://www.zeffmuks.com/",
"image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
"site_name": "Zeff Muks",
},
"twitter": {
"card": "summary_large_image",
"site": "@zeffmuks",
"title": "Zeff Muks",
"description": "Antifragile Entropy Assassin 🥷",
"image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
},
"main_header": "Antifragile Entropy Assassin 🥷🏻",
"header_link": "https://x.com/zeffmuks",
"builds": [
{
"date": "08/30/2024",
"project": {
"name": "Cursor Git",
"description": "Enhanced Git for Cursor AI Editor",
"logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-logo128.png",
"download_link": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-0.1.12.vsix",
"external_link": "",
},
},
{
"date": "08/18/2024",
"project": {
"name": "PyZF",
"description": "Enhancements for Python",
"logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/pyzf-logo128.png",
"download_link": "",
"external_link": "https://pypi.org/project/PyZF",
},
},
{
"date": "08/05/2024",
"project": {
"name": "Xanthus",
"description": "X (formerly Twitter) Assistant",
"logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/xanthus-logo128.png",
"download_link": "",
"external_link": "https://pypi.org/project/zf-xanthus",
},
},
{
"date": "07/24/2024",
"project": {
"name": "Jenga",
"description": "Fast JSON5 Python Library",
"logo_url": "",
"download_link": "https://pypi.org/project/zf-jenga",
"external_link": "",
},
},
...
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
zf-perse-0.1.3.tar.gz
(12.4 kB
view details)
Built Distribution
File details
Details for the file zf-perse-0.1.3.tar.gz
.
File metadata
- Download URL: zf-perse-0.1.3.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a33d04882c17a502f38b253d11ae70cfb639b1eeb6af875bfba5e9a5f9f12adb |
|
MD5 | 5411d34a7f414d1f6e092b065446a6ce |
|
BLAKE2b-256 | 487c877a01131cb71159c77505316fca424c088dde5d5536b9ade9cf787a90c4 |
File details
Details for the file zf_perse-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: zf_perse-0.1.3-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce17c218183171c668ebf5ae0304552371aa7a87fec690134512ee2a06fc7270 |
|
MD5 | c685f73693dc30caaebed6400144fe7f |
|
BLAKE2b-256 | 7990ff5e9312f7c61b5fa630574d2d9e34c3894ca7b9466f6e570d7788753ade |