๐ AI-powered web scraping with modern CSS support. Extract content from any website using GPT-4, handles CSS Grid/Flexbox layouts, Tailwind CSS, and complex selectors automatically.
Project description
html2rss-ai
๐ Smart web scraping meets AI intelligence - Extract structured content from any website using OpenAI's GPT models. Intelligently identifies content patterns and extracts articles, blog posts, job listings, news items, and other repeating content with modern CSS layout support.
โญ Latest Update: Enhanced CSS Grid/Flexbox recognition and automatic CSS selector sanitization for modern websites like Satispay, Tailwind CSS sites, and complex layouts.
Features
- ๐ค AI-Powered Pattern Recognition: Uses OpenAI GPT-4 to intelligently identify content patterns
- ๐จ Modern CSS Layout Support: Recognizes CSS Grid, Flexbox, and Tailwind CSS structures
- ๐ ๏ธ Automatic CSS Sanitization: Handles malformed selectors (e.g.,
my-1.5class names) - ๐ Smart Caching: Caches pattern analysis to avoid repeated API calls
- ๐ Advanced Date Extraction: Extracts publication dates with fallback strategies
- ๐ฏ Universal Compatibility: Works with any website structure, from legacy HTML to modern SPAs
- ๐ Confidence Scoring: Provides accuracy metrics for extraction reliability
- ๐ Async Support: Built with asyncio for efficient processing
- ๐ Multiple Link Extraction: Finds all content items in container elements
Installation
Prerequisites
- Python 3.8+
- OpenAI API key
- Playwright (for web content extraction)
Install the package
# Clone the repository
git clone https://github.com/your-username/html2rss-ai.git
cd html2rss-ai
# Install dependencies
pip install -e .
Install Playwright browsers
This project uses Playwright for web content extraction. You need to install the browser binaries:
# Install Playwright browsers
playwright install
Set up your OpenAI API key
export OPENAI_API_KEY="your-openai-api-key-here"
Quick Start
Basic Usage
import asyncio
from html2rss_ai.extractor import UniversalPatternExtractor
async def main():
# Initialize the extractor
extractor = UniversalPatternExtractor()
# Extract articles from a blog
url = "https://example-blog.com/posts/"
result = await extractor.extract_pattern_links(url)
# Print results
print(f"Found {len(result.items)} articles")
for item in result.items:
print(f"- {item.title}: {item.url}")
# Run the extraction
asyncio.run(main())
Example: Extract from ordep.dev blog
We've included a complete example that demonstrates extracting articles from the ordep.dev blog:
# Run the example
python examples/extract_ordep_blog.py
This example will:
- Extract all blog posts from ordep.dev
- Display them in a formatted list
- Save the results to
ordep_blog_articles.json - Show extraction statistics and confidence scores
Sample output:
๐ Extracting articles from: https://ordep.dev/posts/
============================================================
๐ Extraction Results:
Pattern Type: blog_posts
Confidence Score: 0.85
Total Items Found: 25
Page Title: Writing - ordep.dev
๐ Articles Found:
------------------------------------------------------------
1. Writing Code Was Never The Bottleneck
URL: https://ordep.dev/posts/writing-code-was-never-the-bottleneck/
Date: 2025-06-30
2. Writing More Often
URL: https://ordep.dev/posts/writing-more-often/
Date: 2025-06-26
...
Modern Website Support
html2rss-ai is specifically designed to handle modern web layouts that traditional scrapers struggle with:
โ CSS Grid & Flexbox Layouts
- Automatically detects:
grid-template-columns,grid-cols-*,flexpatterns - Example: Job listings on Satispay, product cards, article grids
- Works with: Tailwind CSS, Bootstrap, custom CSS frameworks
โ Complex CSS Selectors
- Auto-sanitizes: Problematic selectors like
li.my-1.5.text-md .date - Converts to: Valid attribute selectors
[class~="my-1.5"] - Handles: Tailwind's decimal classes, custom naming conventions
โ Container-based Content
- Finds all links: In grid containers, card layouts, list structures
- Before: Only extracted first item per container
- Now: Extracts all items (e.g., 20 job listings instead of 1)
๐ฏ Real-world Examples
Job Listings (Satispay-style):
# Extracts all 20+ job positions from modern job boards
result = await extractor.extract_pattern_links("https://company.com/careers")
# Before: 1 job found
# After: 20+ jobs found โ
E-commerce Product Grids:
# Handles CSS Grid product layouts
result = await extractor.extract_pattern_links("https://shop.com/products")
# Recognizes: grid-cols-3, flex-wrap, card layouts
Blog Post Lists:
# Works with modern CSS frameworks
result = await extractor.extract_pattern_links("https://blog.com/posts")
# Handles: Tailwind, styled-components, CSS modules
API Reference
UniversalPatternExtractor
The main class for extracting content patterns from web pages.
Constructor
UniversalPatternExtractor(
openai_api_key: str | None = None,
cache_dir: str = "pattern_cache"
)
Parameters:
openai_api_key: Your OpenAI API key (defaults toOPENAI_API_KEYenvironment variable)cache_dir: Directory to store cached pattern analysis (default: "pattern_cache")
Methods
extract_pattern_links(url: str, force_regenerate: bool = False) -> ExtractedPattern
Extract patterned links from a webpage.
Parameters:
url: The webpage URL to extract fromforce_regenerate: Force regeneration of pattern analysis (default: False)
Returns: ExtractedPattern object containing:
page_title: Title of the webpagefeed_url: Original URLpattern: Pattern analysis informationitems: List of extractedFeedItemobjects
to_json(result: ExtractedPattern) -> dict
Convert extraction result to JSON format.
FeedItem
Represents an extracted content item:
class FeedItem:
url: str # Full URL of the item
title: str # Title/heading of the item
publication_date: str | None # Publication date if found
Advanced Usage
Custom Cache Directory
extractor = UniversalPatternExtractor(cache_dir="my_cache")
Force Pattern Regeneration
# Force the AI to re-analyze the page structure
result = await extractor.extract_pattern_links(url, force_regenerate=True)
Direct JSON Output
from html2rss_ai.extractor import extract_pattern_links
# Get JSON directly
json_result = await extract_pattern_links("https://example.com")
print(json.dumps(json_result, indent=2))
Configuration
Environment Variables
OPENAI_API_KEY: Your OpenAI API key (required)
Logging
The library uses Python's logging module. Configure logging level:
import logging
logging.basicConfig(level=logging.INFO)
Troubleshooting
Playwright Issues
If you encounter issues with web content extraction:
-
Make sure Playwright browsers are installed:
playwright install -
Check browser installation:
playwright --version -
For headless environments (Docker, CI):
playwright install-deps -
Common issues:
- Permission denied: Run
sudo playwright install-depson Linux - Browser not found: Ensure you've run
playwright install - Timeout errors: Some sites may take longer to load, consider increasing timeouts
- Permission denied: Run
Examples
Extract News Articles
async def extract_news():
extractor = UniversalPatternExtractor()
result = await extractor.extract_pattern_links("https://news-site.com")
for item in result.items:
print(f"๐ฐ {item.title}")
if item.publication_date:
print(f" ๐
{item.publication_date}")
Extract Product Listings
async def extract_products():
extractor = UniversalPatternExtractor()
result = await extractor.extract_pattern_links("https://ecommerce-site.com/products")
print(f"Found {len(result.items)} products")
for item in result.items:
print(f"๐๏ธ {item.title} - {item.url}")
How It Works
- ๐ HTML Extraction: Downloads and parses webpage HTML with JavaScript support
- ๐ Advanced Structure Analysis:
- Analyzes link patterns and HTML structure
- NEW: Detects CSS Grid/Flexbox layouts (
grid-cols-*,flex, etc.) - NEW: Identifies modern CSS frameworks (Tailwind, Bootstrap)
- ๐ค Enhanced AI Pattern Recognition:
- Uses OpenAI GPT-4 with improved prompts for modern layouts
- NEW: Recognizes non-semantic structures (divs with CSS classes)
- NEW: Understands container-based content organization
- ๐พ Smart Pattern Caching: Caches successful patterns for 7-day reuse
- โก Robust Content Extraction:
- NEW: CSS selector sanitization (
my-1.5โ[class~="my-1.5"]) - NEW: Multiple link extraction per container
- NEW: Fallback strategies for complex selectors
- NEW: CSS selector sanitization (
- ๐
Advanced Date Extraction:
- NEW: Sanitized date selectors with retry logic
- Multiple date format support with fallback patterns
- ๐ Structured Output: Returns JSON with URLs, titles, dates, and confidence scores
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built with OpenAI GPT models
- Uses BeautifulSoup for HTML parsing
- Uses Playwright for web content extraction
- Inspired by RSS feed generation tools
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file html2rss_ai-0.0.1.tar.gz.
File metadata
- Download URL: html2rss_ai-0.0.1.tar.gz
- Upload date:
- Size: 29.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
212acca357d89866c34b54820a0f24baa0f1405d75ec3c50c015f5bcd5e2ae26
|
|
| MD5 |
ceb2014d07fd2dc2617323a32ac1231b
|
|
| BLAKE2b-256 |
6f485fdbc2b5a6a96b98fd567f572915221b2a21a147716be8a2d6a9c1af775a
|
Provenance
The following attestation bundles were made for html2rss_ai-0.0.1.tar.gz:
Publisher:
release.yml on mazzasaverio/html2rss-ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
html2rss_ai-0.0.1.tar.gz -
Subject digest:
212acca357d89866c34b54820a0f24baa0f1405d75ec3c50c015f5bcd5e2ae26 - Sigstore transparency entry: 264700305
- Sigstore integration time:
-
Permalink:
mazzasaverio/html2rss-ai@80209cf2946b8595e44d5bdd9d2c9c7a59f76e19 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/mazzasaverio
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@80209cf2946b8595e44d5bdd9d2c9c7a59f76e19 -
Trigger Event:
push
-
Statement type:
File details
Details for the file html2rss_ai-0.0.1-py3-none-any.whl.
File metadata
- Download URL: html2rss_ai-0.0.1-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50b742e0ca07aaceb905cc8e1ad0d836c41b373e7398dbfbc47d77b8e86542e3
|
|
| MD5 |
001e2162ee1d90f3985877db465bf462
|
|
| BLAKE2b-256 |
384911c84dbf15896396e75a238e9f3b9e68498e19245e5bcaafd4b7402ddf5f
|
Provenance
The following attestation bundles were made for html2rss_ai-0.0.1-py3-none-any.whl:
Publisher:
release.yml on mazzasaverio/html2rss-ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
html2rss_ai-0.0.1-py3-none-any.whl -
Subject digest:
50b742e0ca07aaceb905cc8e1ad0d836c41b373e7398dbfbc47d77b8e86542e3 - Sigstore transparency entry: 264700307
- Sigstore integration time:
-
Permalink:
mazzasaverio/html2rss-ai@80209cf2946b8595e44d5bdd9d2c9c7a59f76e19 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/mazzasaverio
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@80209cf2946b8595e44d5bdd9d2c9c7a59f76e19 -
Trigger Event:
push
-
Statement type: