Skip to main content

A Python package for token-aware HTML chunking that preserves structure and attributes, with optional cleaning and attribute length control.

Project description

The most practical HTML chunking

Our HTML chunking algorithm operates through a well-structured process that involves several key stages, each tailored to efficiently chunk and merge HTML content while adhering to a token limit. This approach is highly suitable for scenarios where token limitations are critical, and the need for accurate HTML parsing is paramount, especially in tasks like web automation or navigation where HTML content serves as input.

Key Features

  • Token-Aware Splitting: By leveraging token counts, the algorithm ensures compatibility with models like GPT-3.5-turbo, allowing efficient processing of large HTML documents.
  • Optional Content-Aware Cleaning: The optional cleaning step minimizes irrelevant content, improving token usage efficiency. This flexibility allows users to adapt the algorithm to their specific use case.
  • DOM Structure Preservation: The algorithm respects the hierarchical structure of the DOM, ensuring that HTML chunks remain contextually valid and mergeable. Retaining the path structure is particularly beneficial for challenging tasks like web understanding and navigation that require full HTML syntax for accurate parsing.
  • Efficient Greedy Merging: The greedy merging process combines chunks in a straightforward, sequential manner, optimizing for larger yet manageable segments of HTML while staying within token constraints.

Usage

from html_chunking import get_html_chunks

html = """
<html darker-dark-theme="" darker-dark-theme-deprecate="" lang="en" style="font-size: 10px;font-family: Roboto, Arial, sans-serif;" system-icons="" typography="" typography-spacing=""><body><ytd-app><ytd-masthead class="shell" id="masthead" logo-type="YOUTUBE_LOGO" slot="masthead"><div class="ytd-searchbox-spt" id="search-container" slot="search-container"></div><div class="ytd-searchbox-spt" id="search-input" slot="search-input"><input autocapitalize="none" autocomplete="off" autocorrect="off" hidden="" id="search" name="search_query" spellcheck="false" tabindex="0" type="text"/></div><svg class="external-icon" id="menu-icon" preserveaspectratio="xMidYMid meet"><g class="yt-icons-ext" id="menu" viewbox="0 0 24 24"><path d="M21,6H3V5h18V6z M21,11H3v1h18V11z M21,17H3v1h18V17z"></path></g></svg><div id="masthead-logo" slot="masthead-logo"><span id="country-code"></span></div><div id="masthead-skeleton-icons" slot="masthead-skeleton"><div class="masthead-skeleton-icon"></div><div class="masthead-skeleton-icon"></div><div class="masthead-skeleton-icon"></div></div></ytd-masthead></ytd-app><link href="https://www.youtube.com/s/desktop/536ed9a8/cssbin/www-main-desktop-watch-page-skeleton.css" name="www-main-desktop-watch-page-skeleton" nonce="2kzKHraEELEaWexSX3PyNg" rel="stylesheet"/></body></html>
"""

merged_chunks = get_html_chunks(html, max_tokens=200, is_clean_html=True, attr_cutoff_len=25)
merged_chunks

The get_html_chunks function is designed to split and merge HTML content into manageable chunks based on token limits while optionally cleaning and trimming certain attributes. The function is particularly useful when working with language models that impose token limits or when preserving the full HTML structure is crucial.

Parameters:

  • html: A string containing the HTML code to be chunked.
  • max_tokens: The maximum token length allowed for each chunk.
  • is_clean_html: A boolean parameter specifying whether to clean the HTML before chunking. If set to True, the function applies a cleaning process that removes hidden elements, styles, scripts, and trims attributes to a specified length. By default, it is set to True, but you can disable this to retain the original HTML without cleaning.
  • attr_cutoff_len: An integer that sets the cutoff length for attributes in the HTML tags, based on string length. This parameter is particularly useful for shortening overly long attributes (such as URLs) by retaining only the essential parts (e.g., domain names) without affecting the core meaning of the attribute. You can also disable this feature by setting it to 0 or not providing it.

The output should consists of several HTML chunks, where each chunk contains valid HTML code with preserved structure and attributes, and any excessively long attributes are truncated to the specified length. In this case we have

[
    '<html darker-dark-theme="" darker-dark-theme-deprecate="" lang="en" style="font-size: 10px;font-family: Roboto, Arial, sans-serif;" system-icons="" typography="" typography-spacing=""><body><ytd-app><ytd-masthead class="shell" id="masthead" logo-type="YOUTUBE_LOGO" slot="masthead"><div class="ytd-searchbox-spt" id="search-container" slot="search-container"></div><div class="ytd-searchbox-spt" id="search-input" slot="search-input"><input autocapitalize="none" autocomplete="off" autocorrect="off" hidden="" id="search" name="search_query" spellcheck="false" tabindex="0" type="text"/></div></ytd-masthead></ytd-app></body></html>', 

    '<html darker-dark-theme="" darker-dark-theme-deprecate="" lang="en" style="font-size: 10px;font-family: Roboto, Arial, sans-serif;" system-icons="" typography="" typography-spacing=""><body><ytd-app><ytd-masthead class="shell" id="masthead" logo-type="YOUTUBE_LOGO" slot="masthead"><svg class="external-icon" id="menu-icon" preserveaspectratio="xMidYMid meet"><g class="yt-icons-ext" id="menu" viewbox="0 0 24 24"><path d="M21,6H3V5h18V6z M21,11H3v"></path></g></svg><div id="masthead-logo" slot="masthead-logo"><span id="country-code"></span></div></ytd-masthead></ytd-app></body></html>', 
    
    '<html darker-dark-theme="" darker-dark-theme-deprecate="" lang="en" style="font-size: 10px;font-family: Roboto, Arial, sans-serif;" system-icons="" typography="" typography-spacing=""><body><ytd-app><ytd-masthead class="shell" id="masthead" logo-type="YOUTUBE_LOGO" slot="masthead"><div id="masthead-skeleton-icons" slot="masthead-skeleton"><div class="masthead-skeleton-icon"></div><div class="masthead-skeleton-icon"></div><div class="masthead-skeleton-icon"></div></div></ytd-masthead></ytd-app><link href="https://www.youtube.com/s..." name="www-main-desktop-watch-page-skeleton" nonce="2kzKHraEELEaWexSX3PyNg" rel="stylesheet"/></body></html>'
]

Comparison with Existing Methods

LangChain (HTMLHeaderTextSplitter & HTMLSectionSplitter) and LlamaIndex (HTMLNodeParser):

  • Limitations: These methods split text at the element level and add metadata for each header relevant to the chunk. However, they extract only the text content and exclude the HTML structure, attributes, and other non-text elements, limiting their use for tasks requiring the full HTML context.
  • Advantage of this Method: Our algorithm preserves the full HTML structure, including the DOM path, tags, and attributes. This makes it far more suitable for tasks like web understanding, where the entire HTML syntax is essential for accurate processing.

google-labs-html-chunker:

  • Limitations: This method also uses BeautifulSoup to parse HTML into a DOM tree, and aggregates text content from leaf nodes and attempts to merge them until a word limit is reached. While it employs a greedy merging approach similar to ours, the output is restricted to text, not the complete HTML.
  • Advantage of this Method: In contrast, our greedy merging algorithm combines HTML chunks while retaining the entire HTML syntax. This allows our method to generate full HTML chunks, making it far more versatile for use cases that require HTML as input, not just the text content.

Contact Us

If you are interested in collaborating with us on this project or have any questions, please feel free to reach out to us. We are open to discussing potential applications, data sharing, and other opportunities for collaboration.

Find Jiarun Liu on his Github.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_chunking-0.0.4.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

html_chunking-0.0.4-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file html_chunking-0.0.4.tar.gz.

File metadata

  • Download URL: html_chunking-0.0.4.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for html_chunking-0.0.4.tar.gz
Algorithm Hash digest
SHA256 f1ca97adaeac1dca7e74398a46c36c520e6c971d1fd38ffa8ce44c19a8089b45
MD5 46500da7a60e18a455c467ba672a79cc
BLAKE2b-256 7d8fac0d8bce13ed77cbe4b24c50c366b44e965a6c703582f18ff629a6cdb387

See more details on using hashes here.

File details

Details for the file html_chunking-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for html_chunking-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b86419085930f553455aba3458bdc139cbcb854951bcaa08ae50dfd8fe5e5539
MD5 28fecc2ef576d4bcac8de2266044f6dc
BLAKE2b-256 4f164d16014dbdf355a097a241dfff6482b9b9536782c99bff650326a8c23460

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page