🚀 AI-powered web scraping with modern CSS support. Extract content from any website using GPT-4, handles CSS Grid/Flexbox layouts, Tailwind CSS, and complex selectors automatically.
Project description
HTML2RSS AI
AI-powered universal article extractor that automatically detects and extracts article patterns from any website using OpenAI's GPT models.
Features
- 🤖 AI-Powered Pattern Detection: Automatically analyzes webpage structure to find article links
- 💾 Smart Caching: Saves patterns for reuse, reducing API calls and improving performance
- 🐳 Docker Ready: Fully containerized with persistent storage
- 📊 Structured Output: Exports clean JSON with URLs, titles, and metadata
- ⚡ Fast & Reliable: Handles large article listings efficiently
- 🔄 Force Regeneration: Option to refresh patterns when websites change
Quick Start
🐳 Docker (Recommended)
- Clone and setup:
git clone <repository-url>
cd html2rss-ai
cp .env.example .env
# Edit .env and set your OPENAI_API_KEY
- Extract articles:
# Save articles to JSON file
docker compose run --rm html2rss-ai --save "https://example.com/blog"
# Print JSON to stdout (no file saved)
docker compose run --rm html2rss-ai "https://example.com/blog"
# Force pattern regeneration
docker compose run --rm html2rss-ai --save --regenerate "https://example.com/blog"
- Access results:
- Output files:
./data/output/ - Pattern cache:
./pattern_cache/
📦 Python Package
- Install:
pip install html2rss-ai
- Use:
export OPENAI_API_KEY="your-api-key"
html2rss-ai --save "https://example.com/blog"
Usage Examples
Basic Extraction
# Extract Paul Graham's essays
docker compose run --rm html2rss-ai --save "https://www.paulgraham.com/articles.html"
Batch Processing
# Multiple sites
for url in "https://blog.example.com" "https://news.example.org"; do
docker compose run --rm html2rss-ai --save "$url"
done
Custom Directories
Option 1: CLI Arguments (Recommended)
# Docker with custom paths
docker compose run --rm html2rss-ai \
--output-dir /app/custom/output \
--pattern-cache-dir /app/custom/cache \
--save "https://example.com"
# Local Python with custom paths
html2rss-ai \
--output-dir ./my-output \
--pattern-cache-dir ./my-cache \
--save "https://example.com"
Option 2: Environment Variables
# Override default paths via environment
OUTPUT_DIR=/custom/output PATTERN_CACHE_DIR=/custom/cache \
html2rss-ai --save "https://example.com"
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
(required) | Your OpenAI API key |
OUTPUT_DIR |
data/output |
Directory for JSON output files |
PATTERN_CACHE_DIR |
pattern_cache |
Directory for cached patterns |
CLI Arguments
# See all available options
docker compose run --rm html2rss-ai --help
# Main arguments:
--output-dir TEXT Directory to save extracted JSON output files
--pattern-cache-dir TEXT Directory to store pattern cache files
--regenerate Force regeneration of pattern analysis
--save Save output to file instead of printing to stdout
Docker Environment
The Docker setup uses:
- Host directories:
./data/output/and./pattern_cache/ - Container paths:
/app/data/output/and/app/pattern_cache/ - User mapping: Runs as UID/GID 1000 to avoid permission issues
Output Format
{
"links": [
{
"url": "https://example.com/article-1",
"title": "Article Title",
"selector_used": "h2 > a"
}
],
"total_found": 42,
"pattern_used": "articles",
"confidence": 0.95,
"base_url": "https://example.com/blog",
"pattern_analysis": {
"pattern_type": "articles",
"primary_selectors": ["h2 > a"],
"confidence_score": 0.95
}
}
Development
Build Docker Image
# Build with Docker Compose (creates html2rss-ai:latest)
docker compose build
# Or build directly with custom tag
docker build -t html2rss-ai:v1.0 .
Install for Development
pip install -e ".[playwright]"
playwright install chromium
Run Tests
pytest tests/
Requirements
- OpenAI API Key: GPT-3.5/4 access for pattern analysis
- Docker (recommended) or Python 3.8+
- Internet connection: For webpage scraping and API calls
License
MIT License - see LICENSE file.
Support
- 🐛 Issues: Report bugs via GitHub Issues
- 💡 Features: Suggest improvements via GitHub Discussions
- 📧 Contact: [Your contact info]
Powered by OpenAI GPT and built with ❤️ for the RSS community.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file html2rss_ai-0.3.0.tar.gz.
File metadata
- Download URL: html2rss_ai-0.3.0.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bf17802de685bc6e1f39110681eae8751f7f58ce0b0d27a2f09b25da0fb31c4
|
|
| MD5 |
64d2419027a11802784283a0ea42bafa
|
|
| BLAKE2b-256 |
a7c843d34580339042cefe22931344cc1f4427bc767df013a20b72802e821c89
|
Provenance
The following attestation bundles were made for html2rss_ai-0.3.0.tar.gz:
Publisher:
release.yml on mazzasaverio/html2rss-ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
html2rss_ai-0.3.0.tar.gz -
Subject digest:
0bf17802de685bc6e1f39110681eae8751f7f58ce0b0d27a2f09b25da0fb31c4 - Sigstore transparency entry: 265610496
- Sigstore integration time:
-
Permalink:
mazzasaverio/html2rss-ai@db5668075ee478473c23ad4c64d1022cf86416cd -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/mazzasaverio
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@db5668075ee478473c23ad4c64d1022cf86416cd -
Trigger Event:
push
-
Statement type:
File details
Details for the file html2rss_ai-0.3.0-py3-none-any.whl.
File metadata
- Download URL: html2rss_ai-0.3.0-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ace4b02ac3487cab226f916aaa2f0ea162ac45c64c50ceeb542d15e8089b112
|
|
| MD5 |
b3cdcb4fb23ad988a60572a0853e04a2
|
|
| BLAKE2b-256 |
52162d888646f44d4fb10ec5b675b7179d1f17b3f74b15b6305ee0962f3645d9
|
Provenance
The following attestation bundles were made for html2rss_ai-0.3.0-py3-none-any.whl:
Publisher:
release.yml on mazzasaverio/html2rss-ai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
html2rss_ai-0.3.0-py3-none-any.whl -
Subject digest:
9ace4b02ac3487cab226f916aaa2f0ea162ac45c64c50ceeb542d15e8089b112 - Sigstore transparency entry: 265610502
- Sigstore integration time:
-
Permalink:
mazzasaverio/html2rss-ai@db5668075ee478473c23ad4c64d1022cf86416cd -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/mazzasaverio
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@db5668075ee478473c23ad4c64d1022cf86416cd -
Trigger Event:
push
-
Statement type: