Skip to main content

HTML DOM Tree Leaf Structure Identification Package

Project description

WebLeaf Logo

🌿 WebLeaf - A Graph-Based HTML Parsing and Comparison Tool

PyPI version
Build Status
License: MIT

WebLeaf is a Python package that brings the power of graph neural networks (GNNs) to HTML parsing and element comparison. It encodes HTML elements into feature-rich graph embeddings, allowing for advanced tasks like element extraction, structural comparison, and distance measurement between elements. WebLeaf is perfect for web scraping, semantic HTML analysis, and automated web page comparison tasks.

Key Features

  • 🌟 Graph-Based HTML Representation: Treats the HTML structure as a graph, encoding elements as nodes and relationships as edges.
  • 📄 Tag and Text Embeddings: Leverages embeddings for both HTML tags and textual content to capture meaningful semantic and structural representations.
  • 🔍 Element Extraction: Retrieve elements using XPath or CSS selectors.
  • 🛠️ Element Comparison: Measure similarity between elements based on their content and structure using graph embeddings.
  • 📈 Pretrained GCN Model: Built on top of a pretrained Graph Convolutional Network (GCN), enabling rich semantic and structural analysis out of the box.

Installation

You can install WebLeaf using pip:

pip install webleaf

How It Works

WebLeaf represents an HTML document as a graph, where each HTML element is a node, and the parent-child relationships between elements form the edges of the graph. The graph is then processed by a GCN (Graph Convolutional Network) that creates embeddings for each HTML element. These embeddings capture both the semantic content and structural relationships of the elements, allowing for tasks like element comparison, similarity measurement, and extraction.

The model also combines tag embeddings (representing HTML tags) and text embeddings (representing the textual content of elements), creating a powerful representation of the HTML page.

Basic Usage

Here's a quick example of how to use WebLeaf:

from webleaf import Web

# Load your HTML content
html_content = open('example.html').read()

# Create a Web object
web = Web(html_content)

# Extract an element using XPath
leaf = web.leaf(xpath=".//p")

# Extract an element using CSS selectors
leaf_css = web.leaf(css_select="div.card:nth-child(1) > div:nth-child(2) > p:nth-child(1)")

# Compare two elements
similarity = leaf.similarity(leaf_css)
print(f"Similarity: {similarity}")
>>> Similarity: 1.0

# Find the closest match for an element
path = web.find(leaf)
print(f"Element found at: {path}")
>>> Element found at: /html/body/div/div/div[1]/div[1]/p

Advanced Features

  • Find Similar Elements: You can also find the top n most similar elements to a given one:

    similar_paths = web.find_n(leaf, n=3)
    print(f"Top 3 similar elements: {similar_paths}")
    >>> Top 3 similar elements: ['/html/body/div/div/div[1]/div[1]/p', '/html/body/div/div/div[2]/div[1]/p', '/html/body/div/div/div[3]/div[1]/span']
    
  • Distance Measurement: Measure how unique or similar two elements are using mdist():

    distance = leaf.mdist(leaf_css)
    print(f"Distance: {distance}")
    >>>
    Distance: 0.0
    

API Documentation

Web(html)

  • Description: Initializes the WebLeaf model with the HTML content, parses the document, and encodes it into a graph representation.
  • Arguments:
    • html (str): The HTML content as a string.

leaf(xpath=None, css_select=None)

  • Description: Retrieves an HTML element as a Leaf object using either an XPath or CSS selector.
  • Arguments:
    • xpath (str): The XPath of the desired element.
    • css_select (str): The CSS selector for the desired element.

similarity(leaf)

  • Description: Computes the similarity score between two Leaf objects based on their embeddings.
  • Returns: A similarity score between 0 and 1.

mdist(leaf)

  • Description: Measures the "distance" between two Leaf objects, representing how unique or different they are.

find(leaf)

  • Description: Finds the closest match for a given Leaf object within the HTML structure.
  • Returns: The XPath of the closest matching element.

find_n(leaf, n)

  • Description: Finds the top n most similar elements to a given Leaf object, sorted by similarity.
  • Returns: A list of XPaths for the top n most similar elements.

Running Tests

WebLeaf comes with a suite of unit tests to ensure everything works as expected. These tests cover basic operations like element extraction, similarity comparisons, and graph encoding. To run the tests:

  1. Clone this repository.
  2. Install the required dependencies using pip install -r requirements.txt.
  3. Run the tests using pytest:
pytest

Example Test

def test_leaf_extraction():
    web = Web(example_html)
    leaf = web.leaf(xpath=".//p")
    assert leaf

def test_element_comparison():
    web = Web(example_html)
    leaf1 = web.leaf(xpath=".//p")
    leaf2 = web.leaf(css_select="div.card:nth-child(1) > div:nth-child(2) > p:nth-child(1)")
    assert leaf1.similarity(leaf2) > 0.9

Pretrained Model

The WebLeaf model uses a pretrained Graph Convolutional Network (GCN) that has been trained on a diverse set of web pages to learn the structure and semantic relationships within HTML. The model is loaded from product_page_model_4_80.torch and is used to encode HTML elements into embeddings.

Performance

This t-SNE (t-Distributed Stochastic Neighbor Embedding) plot provides a 2D visualization of the WebLeaf-encoded web elements, which have been projected into a lower-dimensional space. The purpose of t-SNE is to represent high-dimensional data (such as the embeddings generated by WebLeaf) in two dimensions, allowing us to better visualize relationships and groupings among different types of web elements.

WebLeaf Performance

Contributing

We welcome contributions! Feel free to submit issues, feature requests, or pull requests. Here's how you can contribute:

  1. Fork the repository.
  2. Create your feature branch: git checkout -b feature/new-feature.
  3. Commit your changes: git commit -m 'Add new feature'.
  4. Push to the branch: git push origin feature/new-feature.
  5. Open a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.


🌿 WebLeaf is a powerful and flexible tool for working with HTML as structured graph data. Give it a try and start leveraging the power of graph neural networks for your web scraping and analysis needs!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webleaf-0.3.12.tar.gz (589.1 kB view details)

Uploaded Source

Built Distribution

webleaf-0.3.12-py3-none-any.whl (586.4 kB view details)

Uploaded Python 3

File details

Details for the file webleaf-0.3.12.tar.gz.

File metadata

  • Download URL: webleaf-0.3.12.tar.gz
  • Upload date:
  • Size: 589.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for webleaf-0.3.12.tar.gz
Algorithm Hash digest
SHA256 f6170e099c4dcfe25054185ef88a907a66a46d73a6dcf5ef3cd9164975283fc1
MD5 facae9bfb94467b646c855eb7c6076cc
BLAKE2b-256 bd3e376965ecbbc3a73a975aea35cec2be8a0db2fb273c4ae1075cf508f5be77

See more details on using hashes here.

File details

Details for the file webleaf-0.3.12-py3-none-any.whl.

File metadata

  • Download URL: webleaf-0.3.12-py3-none-any.whl
  • Upload date:
  • Size: 586.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for webleaf-0.3.12-py3-none-any.whl
Algorithm Hash digest
SHA256 74b4ad0caf2a311272dc9d30581494df406a71c7a6605c95cc536594232d56e7
MD5 d64d9e49619b5651afcb362463c4b647
BLAKE2b-256 1711dcf6827eca730b2dc1e270c42057c41e8e9b8e32a5ec1440fc448f41c9e8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page