Skip to main content

perse converts HTML content into structured JSON data

Project description

Perse

PyPI version

Perse

Perse converts HTML to JSON using a mix of traditional html parsing and LLM based data extraction. It performs a few optimizations after fetching the html without accidently removing any important data.

These optimizations includes:

  • Removal of styling, scripting and svg tags
  • Collapsing Tags (e.g. divs) with only one child

Installation

pip install zf-perse

Usage

export PERSE_OPENAI_API_KEY="your-openai-api-key"

CLI

perse --url https://example.com

Python

from perse import perse

url = "https://example.com"
html = requests.get(url).text
j = perse(html)
print(j)

Example

Google's Homepage

$ perse --url https://google.com

{'image': 'https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png', 'title': 'Google', 'navigation_links': [{'link_name': 'About', 'href': 'https://about.google/?fg=1&utm_source=google-SG&utm_medium=referral&utm_campaign=hp-header'}, {'link_name': 'Store', 'href': 'https://store.google.com/SG?utm_source=hp_header&utm_medium=google_ooo&utm_campaign=GS100042&hl=en-SG'}], 'logo': 'https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png', 'search_form': {'action': '/search', 'method': 'GET', 'autocomplete': 'off', 'search_field': 'q', 'buttons': [{'button_text': 'Google Search', 'button_action': 'submit'}, {'button_text': "I'm Feeling Lucky", 'button_action': 'submit'}]}}

Input

<!-- taken from https://zeffmuks.com -->

 <html lang="en" data-theme="light" style="color-scheme: light;">

<head>
    <meta charset="utf-8">
    <link rel="icon" href="/favicon.ico">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <meta name="theme-color" content="#000000">
    <meta name="description" content="Antifragile Entropy Assassin 🥷">
    <link rel="apple-touch-icon" href="/images/logo192.png">
    <link rel="manifest" href="/manifest.json">
    <script async="" src="https://www.googletagmanager.com/gtag/js?id=G-GNB6LQMFW3"></script>
    <script>function gtag() { dataLayer.push(arguments) } window.dataLayer = window.dataLayer || [], gtag("js", new Date), gtag("config", "G-GNB6LQMFW3")</script>
    <title>Zeff Muks</title>
    <script defer="defer" src="/static/js/main.4de0eae9.js"></script>
    <link href="/static/css/main.f6a8a2d9.css" rel="stylesheet">
    <style data-emotion="css-global" data-s=""></style>
    <style data-emotion="css-global" data-s=""></style>
    <style data-emotion="css-global" data-s=""></style>
    <style data-emotion="css" data-s=""></style>
    <meta property="og:type" content="website" data-rh="true">
    <meta property="og:title" content="Zeff Muks" data-rh="true">
    <meta property="og:description" content="Antifragile Entropy Assassin 🥷" data-rh="true">
    <meta property="og:url" content="https://www.zeffmuks.com/" data-rh="true">
    <meta property="og:image" content="https://www.zeffmuks.com/images/ZeffMuks-1920.png" data-rh="true">
    <meta property="og:site_name" content="Zeff Muks" data-rh="true">
    <meta name="twitter:card" content="summary_large_image" data-rh="true">
    <meta name="twitter:site" content="@zeffmuks" data-rh="true">
    <meta name="twitter:title" content="Zeff Muks" data-rh="true">
    <meta name="twitter:description" content="Antifragile Entropy Assassin 🥷" data-rh="true">
    <meta name="twitter:image" content="https://www.zeffmuks.com/images/ZeffMuks-1920.png" data-rh="true">
</head>

<body class="chakra-ui-light" cz-shortcut-listen="true"><noscript>You need to enable JavaScript to run this
        app.</noscript>
<div id="root">
    <div class="css-0">
        <div class="css-lt6aye">
            <div class="chakra-stack css-sqtrbi"><img src="/images/ZeffMuks-6912.png" class="chakra-image css-0">
                <h1 class="chakra-heading css-1g6enkz">Antifragile Entropy Assassin 🥷🏻</h1>
                <h2 class="chakra-heading css-shu5if"><a class="chakra-link css-spn4bz"
                        href="https://x.com/zeffmuks">𝕏</a></h2>
            </div>
        </div>
        <div class="css-1hielw0">
            <div class="chakra-stack css-5kt1vw">
                <h1 class="chakra-heading css-eh1ywz">Builds</h1>
                <div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
                    <div class="css-10fvfu7">
                        <p class="chakra-text css-1wrsef2">08/30/2024</p>
                    </div>
                    <div class="chakra-stack css-399av8">
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <h1
                                    class="chakra-heading hover:scale-105 transition-all hover:translate-x-3 duration-300 ease-in-out css-18nnx4p">
                                    <span class="text-2xl inline-flex items-center"><img
                                            src="https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-logo128.png"
                                            alt="Cursor Git" class="h-8 w-8 mr-2">Cursor Git</span>
                                </h1>
                                <p class="chakra-text css-17vaxo2">Enhanced Git for Cursor AI Editor</p>
                            </div>
                        </div>
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <div class="flex flex-row gap-2">
                                    <div><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                            viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                            stroke-linecap="round" stroke-linejoin="round"
                                            class="lucide lucide-images">
                                            <path d="M18 22H4a2 2 0 0 1-2-2V6"></path>
                                            <path d="m22 13-1.296-1.296a2.41 2.41 0 0 0-3.408 0L11 18"></path>
                                            <circle cx="12" cy="8" r="2"></circle>
                                            <rect width="16" height="16" x="6" y="2" rx="2"></rect>
                                        </svg></div><a target="_blank" rel="noopener" class="chakra-link css-4a6x12"
                                        href="https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-0.1.12.vsix"><svg
                                            xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                            viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                            stroke-linecap="round" stroke-linejoin="round"
                                            class="lucide lucide-external-link">
                                            <path d="M15 3h6v6"></path>
                                            <path d="M10 14 21 3"></path>
                                            <path d="M18 13v6a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V8a2 2 0 0 1 2-2h6">
                                            </path>
                                        </svg></a><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                        viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                        stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-share">
                                        <path d="M4 12v8a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2v-8"></path>
                                        <polyline points="16 6 12 2 8 6"></polyline>
                                        <line x1="12" x2="12" y1="2" y2="15"></line>
                                    </svg>
                                </div>
                            </div>
                        </div>
                    </div>
                </div>
                <div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
                    <div class="css-10fvfu7">
                        <p class="chakra-text css-1wrsef2">08/18/2024</p>
                    </div>
                    <div class="chakra-stack css-399av8">
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <h1
                                    class="chakra-heading hover:scale-105 transition-all hover:translate-x-3 duration-300 ease-in-out css-18nnx4p">
                                    <span class="text-2xl inline-flex items-center"><img
                                            src="https://zf-static.s3.us-west-1.amazonaws.com/pyzf-logo128.png"
                                            alt="PyZF" class="h-8 w-8 mr-2">PyZF</span>
                                </h1>
                                <p class="chakra-text css-17vaxo2">Enhancements for Python</p>
                            </div>
                        </div>
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <div class="flex flex-row gap-2"><a target="_blank" rel="noopener"
                                        class="chakra-link css-4a6x12" href="https://pypi.org/project/PyZF"><svg
                                            xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                            viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                            stroke-linecap="round" stroke-linejoin="round"
                                            class="lucide lucide-external-link">
                                            <path d="M15 3h6v6"></path>
                                            <path d="M10 14 21 3"></path>
                                            <path d="M18 13v6a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V8a2 2 0 0 1 2-2h6">
                                            </path>
                                        </svg></a><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                        viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                        stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-share">
                                        <path d="M4 12v8a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2v-8"></path>
                                        <polyline points="16 6 12 2 8 6"></polyline>
                                        <line x1="12" x2="12" y1="2" y2="15"></line>
                                    </svg></div>
                            </div>
                        </div>
                    </div>
                </div>
                <div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
                    <div class="css-10fvfu7">
                        <p class="chakra-text css-1wrsef2">08/05/2024</p>
                    </div>
                    <div class="chakra-stack css-399av8">
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <h1
                                    class="chakra-heading hover:scale-105 transition-all hover:translate-x-3 duration-300 ease-in-out css-18nnx4p">
                                    <span class="text-2xl inline-flex items-center"><img
                                            src="https://zf-static.s3.us-west-1.amazonaws.com/xanthus-logo128.png"
                                            alt="Xanthus" class="h-8 w-8 mr-2">Xanthus</span>
                                </h1>
                                <p class="chakra-text css-17vaxo2">X (formerly Twitter) Assistant</p>
                            </div>
                        </div>
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <div class="flex flex-row gap-2"><a target="_blank" rel="noopener"
                                        class="chakra-link css-4a6x12"
                                        href="https://pypi.org/project/zf-xanthus"><svg
                                            xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                            viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                            stroke-linecap="round" stroke-linejoin="round"
                                            class="lucide lucide-external-link">
                                            <path d="M15 3h6v6"></path>
                                            <path d="M10 14 21 3"></path>
                                            <path d="M18 13v6a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V8a2 2 0 0 1 2-2h6">
                                            </path>
                                        </svg></a><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                        viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                        stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-share">
                                        <path d="M4 12v8a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2v-8"></path>
                                        <polyline points="16 6 12 2 8 6"></polyline>
                                        <line x1="12" x2="12" y1="2" y2="15"></line>
                                    </svg></div>
                            </div>
                        </div>
                    </div>
                </div>
                <div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
                    <div class="css-10fvfu7">
                        <p class="chakra-text css-1wrsef2">07/24/2024</p>
                    </div>
                    <div class="chakra-stack css-399av8">
                        <div class="min-w-full h-...

Output

{
    "title": "Zeff Muks",
    "description": "Antifragile Entropy Assassin 🥷",
    "og": {
        "type": "website",
        "title": "Zeff Muks",
        "description": "Antifragile Entropy Assassin 🥷",
        "url": "https://www.zeffmuks.com/",
        "image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
        "site_name": "Zeff Muks",
    },
    "twitter": {
        "card": "summary_large_image",
        "site": "@zeffmuks",
        "title": "Zeff Muks",
        "description": "Antifragile Entropy Assassin 🥷",
        "image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
    },
    "main_header": "Antifragile Entropy Assassin 🥷🏻",
    "header_link": "https://x.com/zeffmuks",
    "builds": [
        {
            "date": "08/30/2024",
            "project": {
                "name": "Cursor Git",
                "description": "Enhanced Git for Cursor AI Editor",
                "logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-logo128.png",
                "download_link": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-0.1.12.vsix",
                "external_link": "",
            },
        },
        {
            "date": "08/18/2024",
            "project": {
                "name": "PyZF",
                "description": "Enhancements for Python",
                "logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/pyzf-logo128.png",
                "download_link": "",
                "external_link": "https://pypi.org/project/PyZF",
            },
        },
        {
            "date": "08/05/2024",
            "project": {
                "name": "Xanthus",
                "description": "X (formerly Twitter) Assistant",
                "logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/xanthus-logo128.png",
                "download_link": "",
                "external_link": "https://pypi.org/project/zf-xanthus",
            },
        },
        {
            "date": "07/24/2024",
            "project": {
                "name": "Jenga",
                "description": "Fast JSON5 Python Library",
                "logo_url": "",
                "download_link": "https://pypi.org/project/zf-jenga",
                "external_link": "",
            },
        },
        ...

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zf-perse-0.1.3.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

zf_perse-0.1.3-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file zf-perse-0.1.3.tar.gz.

File metadata

  • Download URL: zf-perse-0.1.3.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.9

File hashes

Hashes for zf-perse-0.1.3.tar.gz
Algorithm Hash digest
SHA256 a33d04882c17a502f38b253d11ae70cfb639b1eeb6af875bfba5e9a5f9f12adb
MD5 5411d34a7f414d1f6e092b065446a6ce
BLAKE2b-256 487c877a01131cb71159c77505316fca424c088dde5d5536b9ade9cf787a90c4

See more details on using hashes here.

File details

Details for the file zf_perse-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: zf_perse-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.9

File hashes

Hashes for zf_perse-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ce17c218183171c668ebf5ae0304552371aa7a87fec690134512ee2a06fc7270
MD5 c685f73693dc30caaebed6400144fe7f
BLAKE2b-256 7990ff5e9312f7c61b5fa630574d2d9e34c3894ca7b9466f6e570d7788753ade

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page