Skip to main content

A high-performance HTML ad cleaner using Adblock rules (Pure Python + lxml).

Project description

Renovation-Ad

License: MIT Python Version

Renovation-Ad is a high-performance Python library designed to clean HTML by removing ad elements based on standard Adblock rules (e.g., EasyList).

Unlike other libraries that struggle with performance when handling tens of thousands of rules, Renovation-Ad utilizes a "Content-Aware Filtering" strategy combined with lxml to achieve extreme speeds—capable of processing complex pages with 13,000+ rules in under 0.2 seconds.


✨ Features

  • Extreme Performance: Optimized with a DOM-content-aware pre-filter (Bloom Filter strategy).
  • Lightweight: Pure Python rule engine. No Rust or C++ compiler required for installation.
  • EasyList Support: Supports standard Adblock Plus / EasyList cosmetic rules (##selector).
  • Domain Intelligence: Correctly handles domain-specific rules (example.com##.ad) and exclusions (~example.com##.ad).
  • Flexible Input: Automatically handles rule lists from URLs, local files, or raw strings.
  • Hybrid Parser: Uses lxml for maximum speed with an automatic fallback to BeautifulSoup4.

🚀 Performance Comparison

In real-world testing on highly commercialized news pages (e.g., Yahoo News) with 13,000+ active rules:

Method Time
Standard BeautifulSoup + Naive Loop ~115.0 seconds
Renovation-Ad (LXML + Content-Aware) 0.14 seconds

Optimization: By scanning the DOM for existing IDs and Classes first, we reduce the number of CSS queries by over 98%.


📦 Installation

pip install renovation-ad

Note: lxml and cssselect are highly recommended for the best performance:

pip install lxml cssselect

🛠 Usage

Quick Start (Function Interface)

from renovation_ad import clean_html

rules = [
    "https://easylist-downloads.adblockplus.org/easylist.txt", # Remote URL
    "./my_custom_rules.txt",                                  # Local file
    "##.top-banner-ads"                                       # Raw rule string
]

html_content = "<html><body><div class='top-banner-ads'>Ad</div><p>Content</p></body></html>"
page_url = "https://example.com/article"

cleaned_html = clean_html(html_content, page_url, rules)

Advanced Usage (Class Interface)

Initializing the Renovator once is more efficient if you are processing multiple pages with the same rule set.

from renovation_ad import Renovator

# Initialize and load rules (downloads and parses)
renovator = Renovator(
    rules_list=["https://easylist-downloads.adblockplus.org/easylist.txt"],
    dom_parser="lxml" # Default is lxml
)

# Clean multiple contents
html_1 = renovator.clean(raw_html_1, "https://site-a.com")
html_2 = renovator.clean(raw_html_2, "https://site-b.com")

🔍 How it Works

  1. Rule Parsing: The library parses EasyList files into an internal map of domain-specific and generic cosmetic rules.
  2. Content-Aware Filtering: Before running CSS selectors, Renovation-Ad scans the HTML for all present id and class attributes.
  3. Selector Pruning: Rules targeting classes or IDs not present in the current document are skipped entirely.
  4. Batch Execution: Remaining selectors are bundled into large batches (e.g., 500 per group) and executed via lxml's highly optimized C engine.

📜 Supported Rule Syntax

Syntax Description
##.ad-class Hide all elements with class ad-class (Generic)
###ad-id Hide element with ID ad-id
example.com##.sidebar-ad Hide only on example.com
~example.com##.global-ad Hide everywhere EXCEPT example.com
domain1.com,domain2.com##.ad Hide on multiple specific domains

🛠 Dependencies

  • requests: For fetching remote rule lists.
  • lxml: For high-speed DOM manipulation.
  • cssselect: For translating CSS selectors to XPath.
  • beautifulsoup4: Provided as a fallback parser.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.


🤝 Contributing

Contributions, issues, and feature requests are welcome! Feel free to check the issues page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

renovation_ad-0.1.0.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

renovation_ad-0.1.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file renovation_ad-0.1.0.tar.gz.

File metadata

  • Download URL: renovation_ad-0.1.0.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for renovation_ad-0.1.0.tar.gz
Algorithm Hash digest
SHA256 58e43c0478af513e3820b8612e480944ec9fc553f108c5fe2ac0a09a6ce0bda0
MD5 cb241d224fedc6a171b1ccc8396f94b0
BLAKE2b-256 1490620cf04ccbdc7c74cf65c610bb3f9bca44d76c6f65254be81514f5efd286

See more details on using hashes here.

File details

Details for the file renovation_ad-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: renovation_ad-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for renovation_ad-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 81523a3226889646f5168b4927b281c8be6954445f14dc11451865330f085675
MD5 e13d2353005c44f8496dc5b85901433a
BLAKE2b-256 667df00170326c9a40504ce57101a230ed2914e4c7a8487625e876f08b2e83cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page