Skip to main content

๐Ÿš€ AI-powered web scraping with modern CSS support. Extract content from any website using GPT-4, handles CSS Grid/Flexbox layouts, Tailwind CSS, and complex selectors automatically.

Project description

html2rss-ai

๐Ÿš€ Smart web scraping meets AI intelligence - Extract structured content from any website using OpenAI's GPT models. Intelligently identifies content patterns and extracts articles, blog posts, job listings, news items, and other repeating content with modern CSS layout support.

โญ Latest Update: Enhanced CSS Grid/Flexbox recognition and automatic CSS selector sanitization for modern websites like Satispay, Tailwind CSS sites, and complex layouts.

Features

  • ๐Ÿค– AI-Powered Pattern Recognition: Uses OpenAI GPT-4 to intelligently identify content patterns
  • ๐ŸŽจ Modern CSS Layout Support: Recognizes CSS Grid, Flexbox, and Tailwind CSS structures
  • ๐Ÿ› ๏ธ Automatic CSS Sanitization: Handles malformed selectors (e.g., my-1.5 class names)
  • ๐Ÿ”„ Smart Caching: Caches pattern analysis to avoid repeated API calls
  • ๐Ÿ“… Advanced Date Extraction: Extracts publication dates with fallback strategies
  • ๐ŸŽฏ Universal Compatibility: Works with any website structure, from legacy HTML to modern SPAs
  • ๐Ÿ“Š Confidence Scoring: Provides accuracy metrics for extraction reliability
  • ๐Ÿš€ Async Support: Built with asyncio for efficient processing
  • ๐Ÿ” Multiple Link Extraction: Finds all content items in container elements

Installation

Prerequisites

  • Python 3.8+
  • OpenAI API key
  • Playwright (for web content extraction)

Install the package

# Clone the repository
git clone https://github.com/your-username/html2rss-ai.git
cd html2rss-ai

# Install dependencies
pip install -e .

Install Playwright browsers

This project uses Playwright for web content extraction. You need to install the browser binaries:

# Install Playwright browsers
playwright install

Set up your OpenAI API key

export OPENAI_API_KEY="your-openai-api-key-here"

Quick Start

Basic Usage

import asyncio
from html2rss_ai.extractor import UniversalPatternExtractor

async def main():
    # Initialize the extractor
    extractor = UniversalPatternExtractor()
    
    # Extract articles from a blog
    url = "https://example-blog.com/posts/"
    result = await extractor.extract_pattern_links(url)
    
    # Print results
    print(f"Found {len(result.items)} articles")
    for item in result.items:
        print(f"- {item.title}: {item.url}")

# Run the extraction
asyncio.run(main())

Example: Extract from ordep.dev blog

We've included a complete example that demonstrates extracting articles from the ordep.dev blog:

# Run the example
python examples/extract_ordep_blog.py

This example will:

  • Extract all blog posts from ordep.dev
  • Display them in a formatted list
  • Save the results to ordep_blog_articles.json
  • Show extraction statistics and confidence scores

Sample output:

๐Ÿ” Extracting articles from: https://ordep.dev/posts/
============================================================
๐Ÿ“Š Extraction Results:
   Pattern Type: blog_posts
   Confidence Score: 0.85
   Total Items Found: 25
   Page Title: Writing - ordep.dev

๐Ÿ“ Articles Found:
------------------------------------------------------------
 1. Writing Code Was Never The Bottleneck
    URL: https://ordep.dev/posts/writing-code-was-never-the-bottleneck/
    Date: 2025-06-30

 2. Writing More Often
    URL: https://ordep.dev/posts/writing-more-often/
    Date: 2025-06-26
...

Modern Website Support

html2rss-ai is specifically designed to handle modern web layouts that traditional scrapers struggle with:

โœ… CSS Grid & Flexbox Layouts

  • Automatically detects: grid-template-columns, grid-cols-*, flex patterns
  • Example: Job listings on Satispay, product cards, article grids
  • Works with: Tailwind CSS, Bootstrap, custom CSS frameworks

โœ… Complex CSS Selectors

  • Auto-sanitizes: Problematic selectors like li.my-1.5.text-md .date
  • Converts to: Valid attribute selectors [class~="my-1.5"]
  • Handles: Tailwind's decimal classes, custom naming conventions

โœ… Container-based Content

  • Finds all links: In grid containers, card layouts, list structures
  • Before: Only extracted first item per container
  • Now: Extracts all items (e.g., 20 job listings instead of 1)

๐ŸŽฏ Real-world Examples

Job Listings (Satispay-style):

# Extracts all 20+ job positions from modern job boards
result = await extractor.extract_pattern_links("https://company.com/careers")
# Before: 1 job found
# After: 20+ jobs found โœ…

E-commerce Product Grids:

# Handles CSS Grid product layouts
result = await extractor.extract_pattern_links("https://shop.com/products")
# Recognizes: grid-cols-3, flex-wrap, card layouts

Blog Post Lists:

# Works with modern CSS frameworks
result = await extractor.extract_pattern_links("https://blog.com/posts")  
# Handles: Tailwind, styled-components, CSS modules

API Reference

UniversalPatternExtractor

The main class for extracting content patterns from web pages.

Constructor

UniversalPatternExtractor(
    openai_api_key: str | None = None,
    cache_dir: str = "pattern_cache"
)

Parameters:

  • openai_api_key: Your OpenAI API key (defaults to OPENAI_API_KEY environment variable)
  • cache_dir: Directory to store cached pattern analysis (default: "pattern_cache")

Methods

extract_pattern_links(url: str, force_regenerate: bool = False) -> ExtractedPattern

Extract patterned links from a webpage.

Parameters:

  • url: The webpage URL to extract from
  • force_regenerate: Force regeneration of pattern analysis (default: False)

Returns: ExtractedPattern object containing:

  • page_title: Title of the webpage
  • feed_url: Original URL
  • pattern: Pattern analysis information
  • items: List of extracted FeedItem objects
to_json(result: ExtractedPattern) -> dict

Convert extraction result to JSON format.

FeedItem

Represents an extracted content item:

class FeedItem:
    url: str           # Full URL of the item
    title: str         # Title/heading of the item
    publication_date: str | None  # Publication date if found

Advanced Usage

Custom Cache Directory

extractor = UniversalPatternExtractor(cache_dir="my_cache")

Force Pattern Regeneration

# Force the AI to re-analyze the page structure
result = await extractor.extract_pattern_links(url, force_regenerate=True)

Direct JSON Output

from html2rss_ai.extractor import extract_pattern_links

# Get JSON directly
json_result = await extract_pattern_links("https://example.com")
print(json.dumps(json_result, indent=2))

Configuration

Environment Variables

  • OPENAI_API_KEY: Your OpenAI API key (required)

Logging

The library uses Python's logging module. Configure logging level:

import logging
logging.basicConfig(level=logging.INFO)

Troubleshooting

Playwright Issues

If you encounter issues with web content extraction:

  1. Make sure Playwright browsers are installed:

    playwright install
    
  2. Check browser installation:

    playwright --version
    
  3. For headless environments (Docker, CI):

    playwright install-deps
    
  4. Common issues:

    • Permission denied: Run sudo playwright install-deps on Linux
    • Browser not found: Ensure you've run playwright install
    • Timeout errors: Some sites may take longer to load, consider increasing timeouts

Examples

Extract News Articles

async def extract_news():
    extractor = UniversalPatternExtractor()
    result = await extractor.extract_pattern_links("https://news-site.com")
    
    for item in result.items:
        print(f"๐Ÿ“ฐ {item.title}")
        if item.publication_date:
            print(f"   ๐Ÿ“… {item.publication_date}")

Extract Product Listings

async def extract_products():
    extractor = UniversalPatternExtractor()
    result = await extractor.extract_pattern_links("https://ecommerce-site.com/products")
    
    print(f"Found {len(result.items)} products")
    for item in result.items:
        print(f"๐Ÿ›๏ธ {item.title} - {item.url}")

How It Works

  1. ๐ŸŒ HTML Extraction: Downloads and parses webpage HTML with JavaScript support
  2. ๐Ÿ” Advanced Structure Analysis:
    • Analyzes link patterns and HTML structure
    • NEW: Detects CSS Grid/Flexbox layouts (grid-cols-*, flex, etc.)
    • NEW: Identifies modern CSS frameworks (Tailwind, Bootstrap)
  3. ๐Ÿค– Enhanced AI Pattern Recognition:
    • Uses OpenAI GPT-4 with improved prompts for modern layouts
    • NEW: Recognizes non-semantic structures (divs with CSS classes)
    • NEW: Understands container-based content organization
  4. ๐Ÿ’พ Smart Pattern Caching: Caches successful patterns for 7-day reuse
  5. โšก Robust Content Extraction:
    • NEW: CSS selector sanitization (my-1.5 โ†’ [class~="my-1.5"])
    • NEW: Multiple link extraction per container
    • NEW: Fallback strategies for complex selectors
  6. ๐Ÿ“… Advanced Date Extraction:
    • NEW: Sanitized date selectors with retry logic
    • Multiple date format support with fallback patterns
  7. ๐Ÿ“Š Structured Output: Returns JSON with URLs, titles, dates, and confidence scores

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built with OpenAI GPT models
  • Uses BeautifulSoup for HTML parsing
  • Uses Playwright for web content extraction
  • Inspired by RSS feed generation tools

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html2rss_ai-0.0.1.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

html2rss_ai-0.0.1-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file html2rss_ai-0.0.1.tar.gz.

File metadata

  • Download URL: html2rss_ai-0.0.1.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for html2rss_ai-0.0.1.tar.gz
Algorithm Hash digest
SHA256 212acca357d89866c34b54820a0f24baa0f1405d75ec3c50c015f5bcd5e2ae26
MD5 ceb2014d07fd2dc2617323a32ac1231b
BLAKE2b-256 6f485fdbc2b5a6a96b98fd567f572915221b2a21a147716be8a2d6a9c1af775a

See more details on using hashes here.

Provenance

The following attestation bundles were made for html2rss_ai-0.0.1.tar.gz:

Publisher: release.yml on mazzasaverio/html2rss-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file html2rss_ai-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: html2rss_ai-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for html2rss_ai-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 50b742e0ca07aaceb905cc8e1ad0d836c41b373e7398dbfbc47d77b8e86542e3
MD5 001e2162ee1d90f3985877db465bf462
BLAKE2b-256 384911c84dbf15896396e75a238e9f3b9e68498e19245e5bcaafd4b7402ddf5f

See more details on using hashes here.

Provenance

The following attestation bundles were made for html2rss_ai-0.0.1-py3-none-any.whl:

Publisher: release.yml on mazzasaverio/html2rss-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page