Skip to main content

Turn any webpage into structured outputs!

Project description

Webson 🕸️

Turn any webpage into structured outputs! ⚡️
Extract data from any website with the power of AI.


✨ Overview

Webson is a cutting-edge tool that transforms webpages into structured data models — all with just a few lines of code. No more manual scraping or complex parsers! With Webson, you can effortlessly convert HTML into meaningful, actionable insights using state-of-the-art Language Models (LLMs) from IntelliBricks and robust automation powered by Playwright.


🎯 Key Features

  • 🦾 Intelligent Data Extraction:
    Convert webpages into structured data using your own defined models.
    (Say goodbye to messy HTML!)

  • 💬 Chat Casting:
    Simply tell Webson what you need in plain language, and it will extract and structure the data for you.
    (Example: "Extract product details from https://amazon.com and shopee.com including title, price, and rating.")

  • ⚡️ Seamless Integration:
    Built on top of IntelliBricks and Playwright — enjoy a Python-first approach without the boilerplate.

  • 📊 Structured Outputs:
    Define your output schemas with msgspec.Struct and get data back in a ready-to-use, strongly typed format.


🚀 Installation

Install Webson and its dependencies via pip:

pip install webson

Important: Webson relies on Playwright for web automation. This happens because we all know that many pages rely on things that only happen in a browser, like loading stripts, styles, etc. Follow these steps to install Playwright and its browser dependencies:

  1. Install Playwright:

    pip install playwright
    
  2. Install Browser Binaries:

    playwright install
    

Now you’re all set to transform any webpage into structured intelligence!


🔧 Usage Examples

1. Casting a Webpage into a Structured Model

Define your own data model and cast a webpage’s content into it:

import msgspec
from intellibricks.llms import Synapse
from webson import Webson
from typing import Annotated

# Define your desired structured model
class PageSummary(msgspec.Struct):
    title: str
    summary: Annotated[
      str,
      msgspec.Meta(
        description="A short summary of the page")
    ]

# Initialize your LLM (using IntelliBricks Synapse) and Webson
llm = Synapse.of("google/genai/gemini-pro-experimental")
webson = Webson(llm=llm, timeout=5000)

# Cast the webpage content into your structured model
structured_data = webson.cast("https://example.com", to=PageSummary)
print(f"Title: {structured_data.title}")
print(f"Content: {structured_data.summary}")

2. High-Level Query to Struct

Simply describe what you need and let Webson do the heavy lifting:

from intellibricks.llms import Synapse
from webson import Webson

# Initialize your LLM and Webson instance
llm = Synapse.of("google/genai/gemini-pro-experimental")
webson = Webson(llm=llm, timeout=5000)

# Use natural language to instruct Webson on what data to extract
results = webson.query_to_struct(
    "Extract product info from https://amazon.com and https://www.walmart.com/ including title, price, and rating."
)
for url, output in results:
    print(url, output)

⚙️ How It Works

  1. Webpage Automation:
    Webson uses Playwright to open webpages in a headless browser and retrieve the HTML content.

  2. Markdown Conversion:
    The raw HTML is converted into Markdown for improved text processing and parsing.

  3. LLM-Powered Casting:
    The transformed Markdown is sent to your LLM (via IntelliBricks) which then returns structured data based on your specified schema.


🤝 Contributing

We welcome contributions to make Webson even more awesome!
If you encounter any issues or have ideas for new features, please open an issue or submit a pull request on our GitHub repository.


📜 License

This project is licensed under the APACHE 2.0 License.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webson-0.1.0.tar.gz (45.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webson-0.1.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file webson-0.1.0.tar.gz.

File metadata

  • Download URL: webson-0.1.0.tar.gz
  • Upload date:
  • Size: 45.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.26

File hashes

Hashes for webson-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a6da9f31d609e1f23f7e6ea24a764095f435fdb72ec5eb66ec6d802d30a4980d
MD5 0f12976df865a1b0581d0535c1f9e3e9
BLAKE2b-256 f889b0bc8c64efb31de17344b10f0d61222641c5f4bc67046e79269b8ceb0fcf

See more details on using hashes here.

File details

Details for the file webson-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: webson-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.26

File hashes

Hashes for webson-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c51487d70a9a4829d71dbd149214b102ca1568bd424b7d4c8562231c4299a121
MD5 1f20e4f417aed7cc5adfa6991166a17b
BLAKE2b-256 aecc3c1ce6c692ffe98e23743322811f6b55c73e6c2efe34f90fd01b25212160

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page