Skip to main content

🚀 AI-powered web scraping with modern CSS support. Extract content from any website using GPT-4, handles CSS Grid/Flexbox layouts, Tailwind CSS, and complex selectors automatically.

Project description

HTML2RSS AI

AI-powered universal article extractor that automatically detects and extracts article patterns from any website using OpenAI's GPT models.

Features

  • 🤖 AI-Powered Pattern Detection: Automatically analyzes webpage structure to find article links
  • 💾 Smart Caching: Saves patterns for reuse, reducing API calls and improving performance
  • 🐳 Docker Ready: Fully containerized with persistent storage
  • 📊 Structured Output: Exports clean JSON with URLs, titles, and metadata
  • Fast & Reliable: Handles large article listings efficiently
  • 🔄 Force Regeneration: Option to refresh patterns when websites change

Quick Start

🐳 Docker (Recommended)

  1. Clone and setup:
git clone <repository-url>
cd html2rss-ai
cp .env.example .env
# Edit .env and set your OPENAI_API_KEY
  1. Extract articles:
# Save articles to JSON file
docker compose run --rm html2rss-ai --save "https://example.com/blog"

# Print JSON to stdout (no file saved)
docker compose run --rm html2rss-ai "https://example.com/blog"

# Force pattern regeneration
docker compose run --rm html2rss-ai --save --regenerate "https://example.com/blog"
  1. Access results:
  • Output files: ./data/output/
  • Pattern cache: ./pattern_cache/

📦 Python Package

  1. Install:
pip install html2rss-ai
  1. Use:
export OPENAI_API_KEY="your-api-key"
html2rss-ai --save "https://example.com/blog"

Usage Examples

Basic Extraction

# Extract Paul Graham's essays
docker compose run --rm html2rss-ai --save "https://www.paulgraham.com/articles.html"

Batch Processing

# Multiple sites
for url in "https://blog.example.com" "https://news.example.org"; do
  docker compose run --rm html2rss-ai --save "$url"
done

Custom Directories

Option 1: CLI Arguments (Recommended)

# Docker with custom paths
docker compose run --rm html2rss-ai \
  --output-dir /app/custom/output \
  --pattern-cache-dir /app/custom/cache \
  --save "https://example.com"

# Local Python with custom paths  
html2rss-ai \
  --output-dir ./my-output \
  --pattern-cache-dir ./my-cache \
  --save "https://example.com"

Option 2: Environment Variables

# Override default paths via environment
OUTPUT_DIR=/custom/output PATTERN_CACHE_DIR=/custom/cache \
  html2rss-ai --save "https://example.com"

Configuration

Environment Variables

Variable Default Description
OPENAI_API_KEY (required) Your OpenAI API key
OUTPUT_DIR data/output Directory for JSON output files
PATTERN_CACHE_DIR pattern_cache Directory for cached patterns

CLI Arguments

# See all available options
docker compose run --rm html2rss-ai --help

# Main arguments:
--output-dir TEXT           Directory to save extracted JSON output files
--pattern-cache-dir TEXT    Directory to store pattern cache files  
--regenerate               Force regeneration of pattern analysis
--save                     Save output to file instead of printing to stdout

Docker Environment

The Docker setup uses:

  • Host directories: ./data/output/ and ./pattern_cache/
  • Container paths: /app/data/output/ and /app/pattern_cache/
  • User mapping: Runs as UID/GID 1000 to avoid permission issues

Output Format

{
  "links": [
    {
      "url": "https://example.com/article-1",
      "title": "Article Title",
      "selector_used": "h2 > a"
    }
  ],
  "total_found": 42,
  "pattern_used": "articles",
  "confidence": 0.95,
  "base_url": "https://example.com/blog",
  "pattern_analysis": {
    "pattern_type": "articles",
    "primary_selectors": ["h2 > a"],
    "confidence_score": 0.95
  }
}

Development

Build Docker Image

# Build with Docker Compose (creates html2rss-ai:latest)
docker compose build

# Or build directly with custom tag
docker build -t html2rss-ai:v1.0 .

Install for Development

pip install -e ".[playwright]"
playwright install chromium

Run Tests

pytest tests/

Requirements

  • OpenAI API Key: GPT-3.5/4 access for pattern analysis
  • Docker (recommended) or Python 3.8+
  • Internet connection: For webpage scraping and API calls

License

MIT License - see LICENSE file.

Support

  • 🐛 Issues: Report bugs via GitHub Issues
  • 💡 Features: Suggest improvements via GitHub Discussions
  • 📧 Contact: [Your contact info]

Powered by OpenAI GPT and built with ❤️ for the RSS community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html2rss_ai-0.3.0.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

html2rss_ai-0.3.0-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file html2rss_ai-0.3.0.tar.gz.

File metadata

  • Download URL: html2rss_ai-0.3.0.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for html2rss_ai-0.3.0.tar.gz
Algorithm Hash digest
SHA256 0bf17802de685bc6e1f39110681eae8751f7f58ce0b0d27a2f09b25da0fb31c4
MD5 64d2419027a11802784283a0ea42bafa
BLAKE2b-256 a7c843d34580339042cefe22931344cc1f4427bc767df013a20b72802e821c89

See more details on using hashes here.

Provenance

The following attestation bundles were made for html2rss_ai-0.3.0.tar.gz:

Publisher: release.yml on mazzasaverio/html2rss-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file html2rss_ai-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: html2rss_ai-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for html2rss_ai-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ace4b02ac3487cab226f916aaa2f0ea162ac45c64c50ceeb542d15e8089b112
MD5 b3cdcb4fb23ad988a60572a0853e04a2
BLAKE2b-256 52162d888646f44d4fb10ec5b675b7179d1f17b3f74b15b6305ee0962f3645d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for html2rss_ai-0.3.0-py3-none-any.whl:

Publisher: release.yml on mazzasaverio/html2rss-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page