Skip to main content

perse converts HTML content into structured JSON data

Project description

Perse

PyPI version

Perse

Perse converts HTML to JSON using a mix of traditional html parsing and LLM based data extraction. It performs a few optimizations after fetching the html without accidently removing any important data.

These optimizations includes:

  • Removal of styling, scripting and svg tags
  • Collapsing Tags (e.g. divs) with only one child

Installation

pip install zf-perse

Usage

export PERSE_OPENAI_API_KEY="your-openai-api-key"

CLI

perse --url https://example.com

Python

from perse import perse

url = "https://example.com"
html = requests.get(url).text
j = perse(html)
print(j)

Example

Google's Homepage

$ perse --url https://google.com

{'image': 'https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png', 'title': 'Google', 'navigation_links': [{'link_name': 'About', 'href': 'https://about.google/?fg=1&utm_source=google-SG&utm_medium=referral&utm_campaign=hp-header'}, {'link_name': 'Store', 'href': 'https://store.google.com/SG?utm_source=hp_header&utm_medium=google_ooo&utm_campaign=GS100042&hl=en-SG'}], 'logo': 'https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png', 'search_form': {'action': '/search', 'method': 'GET', 'autocomplete': 'off', 'search_field': 'q', 'buttons': [{'button_text': 'Google Search', 'button_action': 'submit'}, {'button_text': "I'm Feeling Lucky", 'button_action': 'submit'}]}}

Input

$ perse --url https://zeffmuks.com

{
    "title": "Zeff Muks",
    "description": "Antifragile Entropy Assassin 🥷",
    "og": {
        "type": "website",
        "title": "Zeff Muks",
        "description": "Antifragile Entropy Assassin 🥷",
        "url": "https://www.zeffmuks.com/",
        "image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
        "site_name": "Zeff Muks",
    },
    "twitter": {
        "card": "summary_large_image",
        "site": "@zeffmuks",
        "title": "Zeff Muks",
        "description": "Antifragile Entropy Assassin 🥷",
        "image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
    },
    "main_header": "Antifragile Entropy Assassin 🥷🏻",
    "header_link": "https://x.com/zeffmuks",
    "builds": [
        {
            "date": "08/30/2024",
            "project": {
                "name": "Cursor Git",
                "description": "Enhanced Git for Cursor AI Editor",
                "logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-logo128.png",
                "download_link": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-0.1.12.vsix",
                "external_link": "",
            },
        },
        {
            "date": "08/18/2024",
            "project": {
                "name": "PyZF",
                "description": "Enhancements for Python",
                "logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/pyzf-logo128.png",
                "download_link": "",
                "external_link": "https://pypi.org/project/PyZF",
            },
        },
        {
            "date": "08/05/2024",
            "project": {
                "name": "Xanthus",
                "description": "X (formerly Twitter) Assistant",
                "logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/xanthus-logo128.png",
                "download_link": "",
                "external_link": "https://pypi.org/project/zf-xanthus",
            },
        },
        {
            "date": "07/24/2024",
            "project": {
                "name": "Jenga",
                "description": "Fast JSON5 Python Library",
                "logo_url": "",
                "download_link": "https://pypi.org/project/zf-jenga",
                "external_link": "",
            },
        },
        ...

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zf-perse-0.1.4.tar.gz (8.6 kB view details)

Uploaded Source

Built Distribution

zf_perse-0.1.4-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file zf-perse-0.1.4.tar.gz.

File metadata

  • Download URL: zf-perse-0.1.4.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.9

File hashes

Hashes for zf-perse-0.1.4.tar.gz
Algorithm Hash digest
SHA256 3e30ac2e976e58de8eb9afb1ed0783dca9f288eaf3106de5c317be13ea325af6
MD5 a4340de516cd6793b4f72ba1da5adf1e
BLAKE2b-256 805c56fe6341e7492f89659f592c9ec063985306f68b97368d01b04e3e3e9947

See more details on using hashes here.

File details

Details for the file zf_perse-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: zf_perse-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.9

File hashes

Hashes for zf_perse-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 bfd4a462fab8a808b28089b753e0bbba839f89ee57de10d9308e0c3e6e696f88
MD5 06d1307e7d2c96746c198ab36e916468
BLAKE2b-256 6e0920bf8dacb1ecfd64c4c231183552e3780e15bb158a474dc2f5f05c435106

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page