🚀 AI-powered web scraping with modern CSS support. Extract content from any website using GPT-4, handles CSS Grid/Flexbox layouts, Tailwind CSS, and complex selectors automatically.

These details have not been verified by PyPI

Project description

html2rss-ai

🚀 Smart web scraping meets AI intelligence - Extract structured content from any website using OpenAI's GPT models. Intelligently identifies content patterns and extracts articles, blog posts, job listings, news items, and other repeating content with modern CSS layout support.

⭐ Latest Update: Enhanced CSS Grid/Flexbox recognition and automatic CSS selector sanitization for modern websites like Satispay, Tailwind CSS sites, and complex layouts.

Features

🤖 AI-Powered Pattern Recognition: Uses OpenAI GPT-4 to intelligently identify content patterns
🎨 Modern CSS Layout Support: Recognizes CSS Grid, Flexbox, and Tailwind CSS structures
🛠️ Automatic CSS Sanitization: Handles malformed selectors (e.g., my-1.5 class names)
🔄 Smart Caching: Caches pattern analysis to avoid repeated API calls
📅 Advanced Date Extraction: Extracts publication dates with fallback strategies
🎯 Universal Compatibility: Works with any website structure, from legacy HTML to modern SPAs
📊 Confidence Scoring: Provides accuracy metrics for extraction reliability
🚀 Async Support: Built with asyncio for efficient processing
🔍 Multiple Link Extraction: Finds all content items in container elements

Installation

Prerequisites

Python 3.8+
OpenAI API key
Playwright (for web content extraction)

Install the package

# Clone the repository
git clone https://github.com/your-username/html2rss-ai.git
cd html2rss-ai

# Install dependencies
pip install -e .

Install Playwright browsers

This project uses Playwright for web content extraction. You need to install the browser binaries:

# Install Playwright browsers
playwright install

Set up your OpenAI API key

export OPENAI_API_KEY="your-openai-api-key-here"

Quick Start

Basic Usage

import asyncio
from html2rss_ai.extractor import UniversalPatternExtractor

async def main():
    # Initialize the extractor
    extractor = UniversalPatternExtractor()
    
    # Extract articles from a blog
    url = "https://example-blog.com/posts/"
    result = await extractor.extract_pattern_links(url)
    
    # Print results
    print(f"Found {len(result.items)} articles")
    for item in result.items:
        print(f"- {item.title}: {item.url}")

# Run the extraction
asyncio.run(main())

Example: Extract from ordep.dev blog

We've included a complete example that demonstrates extracting articles from the ordep.dev blog:

# Run the example
python examples/extract_ordep_blog.py

This example will:

Extract all blog posts from ordep.dev
Display them in a formatted list
Save the results to ordep_blog_articles.json
Show extraction statistics and confidence scores

Sample output:

🔍 Extracting articles from: https://ordep.dev/posts/
============================================================
📊 Extraction Results:
   Pattern Type: blog_posts
   Confidence Score: 0.85
   Total Items Found: 25
   Page Title: Writing - ordep.dev

📝 Articles Found:
------------------------------------------------------------
 1. Writing Code Was Never The Bottleneck
    URL: https://ordep.dev/posts/writing-code-was-never-the-bottleneck/
    Date: 2025-06-30

 2. Writing More Often
    URL: https://ordep.dev/posts/writing-more-often/
    Date: 2025-06-26
...

Modern Website Support

html2rss-ai is specifically designed to handle modern web layouts that traditional scrapers struggle with:

✅ CSS Grid & Flexbox Layouts

Automatically detects: grid-template-columns, grid-cols-*, flex patterns
Example: Job listings on Satispay, product cards, article grids
Works with: Tailwind CSS, Bootstrap, custom CSS frameworks

✅ Complex CSS Selectors

Auto-sanitizes: Problematic selectors like li.my-1.5.text-md .date
Converts to: Valid attribute selectors [class~="my-1.5"]
Handles: Tailwind's decimal classes, custom naming conventions

✅ Container-based Content

Finds all links: In grid containers, card layouts, list structures
Before: Only extracted first item per container
Now: Extracts all items (e.g., 20 job listings instead of 1)

🎯 Real-world Examples

Job Listings (Satispay-style):

# Extracts all 20+ job positions from modern job boards
result = await extractor.extract_pattern_links("https://company.com/careers")
# Before: 1 job found
# After: 20+ jobs found ✅

E-commerce Product Grids:

# Handles CSS Grid product layouts
result = await extractor.extract_pattern_links("https://shop.com/products")
# Recognizes: grid-cols-3, flex-wrap, card layouts

Blog Post Lists:

# Works with modern CSS frameworks
result = await extractor.extract_pattern_links("https://blog.com/posts")  
# Handles: Tailwind, styled-components, CSS modules

API Reference

UniversalPatternExtractor

The main class for extracting content patterns from web pages.

Constructor

UniversalPatternExtractor(
    openai_api_key: str | None = None,
    cache_dir: str = "pattern_cache"
)

Parameters:

openai_api_key: Your OpenAI API key (defaults to OPENAI_API_KEY environment variable)
cache_dir: Directory to store cached pattern analysis (default: "pattern_cache")

Methods

`extract_pattern_links(url: str, force_regenerate: bool = False) -> ExtractedPattern`

Extract patterned links from a webpage.

Parameters:

url: The webpage URL to extract from
force_regenerate: Force regeneration of pattern analysis (default: False)

Returns: ExtractedPattern object containing:

page_title: Title of the webpage
feed_url: Original URL
pattern: Pattern analysis information
items: List of extracted FeedItem objects

`to_json(result: ExtractedPattern) -> dict`

Convert extraction result to JSON format.

FeedItem

Represents an extracted content item:

class FeedItem:
    url: str           # Full URL of the item
    title: str         # Title/heading of the item
    publication_date: str | None  # Publication date if found

Advanced Usage

Custom Cache Directory

extractor = UniversalPatternExtractor(cache_dir="my_cache")

Force Pattern Regeneration

# Force the AI to re-analyze the page structure
result = await extractor.extract_pattern_links(url, force_regenerate=True)

Direct JSON Output

from html2rss_ai.extractor import extract_pattern_links

# Get JSON directly
json_result = await extract_pattern_links("https://example.com")
print(json.dumps(json_result, indent=2))

Configuration

Environment Variables

OPENAI_API_KEY: Your OpenAI API key (required)

Logging

The library uses Python's logging module. Configure logging level:

import logging
logging.basicConfig(level=logging.INFO)

Troubleshooting

Playwright Issues

If you encounter issues with web content extraction:

Make sure Playwright browsers are installed:
```
playwright install
```
Check browser installation:
```
playwright --version
```
For headless environments (Docker, CI):
```
playwright install-deps
```
Common issues:
- Permission denied: Run sudo playwright install-deps on Linux
- Browser not found: Ensure you've run playwright install
- Timeout errors: Some sites may take longer to load, consider increasing timeouts

Examples

Extract News Articles

async def extract_news():
    extractor = UniversalPatternExtractor()
    result = await extractor.extract_pattern_links("https://news-site.com")
    
    for item in result.items:
        print(f"📰 {item.title}")
        if item.publication_date:
            print(f"   📅 {item.publication_date}")

Extract Product Listings

async def extract_products():
    extractor = UniversalPatternExtractor()
    result = await extractor.extract_pattern_links("https://ecommerce-site.com/products")
    
    print(f"Found {len(result.items)} products")
    for item in result.items:
        print(f"🛍️ {item.title} - {item.url}")

How It Works

🌐 HTML Extraction: Downloads and parses webpage HTML with JavaScript support
🔍 Advanced Structure Analysis:
- Analyzes link patterns and HTML structure
- NEW: Detects CSS Grid/Flexbox layouts (grid-cols-*, flex, etc.)
- NEW: Identifies modern CSS frameworks (Tailwind, Bootstrap)
🤖 Enhanced AI Pattern Recognition:
- Uses OpenAI GPT-4 with improved prompts for modern layouts
- NEW: Recognizes non-semantic structures (divs with CSS classes)
- NEW: Understands container-based content organization
💾 Smart Pattern Caching: Caches successful patterns for 7-day reuse
⚡ Robust Content Extraction:
- NEW: CSS selector sanitization (my-1.5 → [class~="my-1.5"])
- NEW: Multiple link extraction per container
- NEW: Fallback strategies for complex selectors
📅 Advanced Date Extraction:
- NEW: Sanitized date selectors with retry logic
- Multiple date format support with fallback patterns
📊 Structured Output: Returns JSON with URLs, titles, dates, and confidence scores

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with OpenAI GPT models
Uses BeautifulSoup for HTML parsing
Uses Playwright for web content extraction
Inspired by RSS feed generation tools

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.2

Jul 8, 2025

0.3.1

Jul 8, 2025

0.3.0

Jul 7, 2025

This version

0.0.1

Jul 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html2rss_ai-0.0.1.tar.gz (29.4 kB view details)

Uploaded Jul 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

html2rss_ai-0.0.1-py3-none-any.whl (16.9 kB view details)

Uploaded Jul 6, 2025 Python 3

File details

Details for the file html2rss_ai-0.0.1.tar.gz.

File metadata

Download URL: html2rss_ai-0.0.1.tar.gz
Upload date: Jul 6, 2025
Size: 29.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for html2rss_ai-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`212acca357d89866c34b54820a0f24baa0f1405d75ec3c50c015f5bcd5e2ae26`
MD5	`ceb2014d07fd2dc2617323a32ac1231b`
BLAKE2b-256	`6f485fdbc2b5a6a96b98fd567f572915221b2a21a147716be8a2d6a9c1af775a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for html2rss_ai-0.0.1.tar.gz:

Publisher: release.yml on mazzasaverio/html2rss-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: html2rss_ai-0.0.1.tar.gz
- Subject digest: 212acca357d89866c34b54820a0f24baa0f1405d75ec3c50c015f5bcd5e2ae26
- Sigstore transparency entry: 264700305
- Sigstore integration time: Jul 6, 2025
Source repository:
- Permalink: mazzasaverio/html2rss-ai@80209cf2946b8595e44d5bdd9d2c9c7a59f76e19
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/mazzasaverio
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@80209cf2946b8595e44d5bdd9d2c9c7a59f76e19
- Trigger Event: push

File details

Details for the file html2rss_ai-0.0.1-py3-none-any.whl.

File metadata

Download URL: html2rss_ai-0.0.1-py3-none-any.whl
Upload date: Jul 6, 2025
Size: 16.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for html2rss_ai-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`50b742e0ca07aaceb905cc8e1ad0d836c41b373e7398dbfbc47d77b8e86542e3`
MD5	`001e2162ee1d90f3985877db465bf462`
BLAKE2b-256	`384911c84dbf15896396e75a238e9f3b9e68498e19245e5bcaafd4b7402ddf5f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for html2rss_ai-0.0.1-py3-none-any.whl:

Publisher: release.yml on mazzasaverio/html2rss-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: html2rss_ai-0.0.1-py3-none-any.whl
- Subject digest: 50b742e0ca07aaceb905cc8e1ad0d836c41b373e7398dbfbc47d77b8e86542e3
- Sigstore transparency entry: 264700307
- Sigstore integration time: Jul 6, 2025
Source repository:
- Permalink: mazzasaverio/html2rss-ai@80209cf2946b8595e44d5bdd9d2c9c7a59f76e19
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/mazzasaverio
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@80209cf2946b8595e44d5bdd9d2c9c7a59f76e19
- Trigger Event: push

html2rss-ai 0.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

html2rss-ai

Features

Installation

Prerequisites

Install the package

Install Playwright browsers

Set up your OpenAI API key

Quick Start

Basic Usage

Example: Extract from ordep.dev blog

Modern Website Support

✅ CSS Grid & Flexbox Layouts

✅ Complex CSS Selectors

✅ Container-based Content

🎯 Real-world Examples

API Reference

UniversalPatternExtractor

Constructor

Methods

extract_pattern_links(url: str, force_regenerate: bool = False) -> ExtractedPattern

to_json(result: ExtractedPattern) -> dict

FeedItem

Advanced Usage

Custom Cache Directory

Force Pattern Regeneration

Direct JSON Output

Configuration

Environment Variables

Logging

Troubleshooting

Playwright Issues

Examples

Extract News Articles

Extract Product Listings

How It Works

Contributing

License

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`extract_pattern_links(url: str, force_regenerate: bool = False) -> ExtractedPattern`

`to_json(result: ExtractedPattern) -> dict`