Skip to main content

🚀 AI-powered web scraping with modern CSS support. Extract content from any website using GPT-4, handles CSS Grid/Flexbox layouts, Tailwind CSS, and complex selectors automatically.

Project description

HTML2RSS AI

AI-powered universal article extractor that automatically detects and extracts article patterns from any website using OpenAI's GPT models.

Features

  • 🤖 AI-Powered Pattern Detection: Automatically analyzes webpage structure to find article links
  • 💾 Smart Caching: Saves patterns for reuse, reducing API calls and improving performance
  • 🐳 Docker Ready: Fully containerized with persistent storage
  • 📊 Structured Output: Exports clean JSON with URLs, titles, and metadata
  • Fast & Reliable: Handles large article listings efficiently
  • 🔄 Force Regeneration: Option to refresh patterns when websites change
  • 📋 Batch Processing: Process multiple URLs with JSON configuration files

Quick Start

🐳 Docker (Recommended)

  1. Clone and setup:
git clone <repository-url>
cd html2rss-ai
cp .env.example .env
# Edit .env and set your OPENAI_API_KEY
  1. Create batch configuration:
# Create config/batch_config.json with your URLs
{
  "urls": [
    {
      "url": "https://example.com/blog",
      "output_dir": "data/output/example",
      "force_regenerate": false
    }
  ]
}
  1. Run batch processing:
# Process all URLs in batch_config.json
docker compose run --rm html2rss-ai

# Use custom configuration file
docker compose run --rm html2rss-ai /app/config/my_config.json
  1. Access results:
  • Output files: ./data/output/
  • Pattern cache: ./pattern_cache/

📦 Python Package

  1. Install:
pip install html2rss-ai
  1. Create a JSON configuration file (config.json):
{
  "urls": [
    {
      "url": "https://example.com/blog",
      "output_dir": "output",
      "force_regenerate": false
    }
  ]
}
  1. Run batch processing:
export OPENAI_API_KEY="your-api-key"
python -m html2rss_ai.batch_processor config.json

Usage Examples

Basic Batch Processing

# Create configuration with Paul Graham's essays
echo '{
  "urls": [
    {
      "url": "https://www.paulgraham.com/articles.html",
      "output_dir": "data/output/paulgraham",
      "force_regenerate": false
    }
  ]
}' > config/paulgraham.json

# Process with Docker
docker compose run --rm html2rss-ai /app/config/paulgraham.json

Batch Processing with JSON Configuration

Option 1: JSON-based Batch Processing (Recommended for multiple URLs)

  1. Create a configuration file (config/batch_config.json):
{
  "urls": [
    {
      "url": "https://www.paulgraham.com/articles.html",
      "output_dir": "data/output",
      "force_regenerate": false
    },
    {
      "url": "https://news.ycombinator.com",
      "output_dir": "data/output/hn",
      "force_regenerate": true
    }
  ]
}
  1. Run batch processing:
# Build and run the batch processor
docker compose build html2rss-ai
docker compose run --rm html2rss-ai

# With custom configuration
docker compose run --rm html2rss-ai /app/config/my_config.json

# With error handling options
docker compose run --rm html2rss-ai /app/config/batch_config.json --continue-on-error

📖 Complete Batch Processing Guide - Detailed documentation with all configuration options.

Custom Configuration

All settings are configured through the JSON configuration file:

{
  "urls": [
    {
      "url": "https://example.com/blog",
      "output_dir": "data/output/custom",
      "pattern_cache_dir": "pattern_cache/custom", 
      "force_regenerate": false,
      "save_output": true
    }
  ]
}

Configuration

Batch Processing Configuration

For processing multiple URLs, create a JSON configuration file:

{
  "urls": [
    {
      "url": "https://example.com/blog",
      "output_dir": "data/output/example", 
      "pattern_cache_dir": "pattern_cache/example",
      "force_regenerate": false,
      "save_output": true
    }
  ]
}

See docs/BATCH_PROCESSING.md for complete configuration options.

Environment Variables

Variable Default Description
OPENAI_API_KEY (required) Your OpenAI API key
OUTPUT_DIR data/output Directory for JSON output files
PATTERN_CACHE_DIR pattern_cache Directory for cached patterns

Batch Processor Arguments

# See all available options
docker compose run --rm html2rss-ai --help

# Main arguments:
config_file                Path to JSON configuration file (required)
--continue-on-error        Continue processing even if some URLs fail  
--log-level LEVEL          Set logging level (DEBUG, INFO, WARNING, ERROR)

Docker Environment

The Docker setup uses:

  • Host directories: ./data/output/ and ./pattern_cache/
  • Container paths: /app/data/output/ and /app/pattern_cache/
  • User mapping: Runs as UID/GID 1000 to avoid permission issues

Output Format

{
  "links": [
    {
      "url": "https://example.com/article-1",
      "title": "Article Title",
      "selector_used": "h2 > a"
    }
  ],
  "total_found": 42,
  "pattern_used": "articles",
  "confidence": 0.95,
  "base_url": "https://example.com/blog",
  "pattern_analysis": {
    "pattern_type": "articles",
    "primary_selectors": ["h2 > a"],
    "confidence_score": 0.95
  }
}

Development

Build Docker Image

# Build with Docker Compose (creates html2rss-ai:latest)
docker compose build

# Or build directly with custom tag
docker build -t html2rss-ai:v1.0 .

Install for Development

pip install -e ".[playwright]"
playwright install chromium

Run Tests

pytest tests/

Requirements

  • OpenAI API Key: GPT-3.5/4 access for pattern analysis
  • Docker (recommended) or Python 3.8+
  • Internet connection: For webpage scraping and API calls

License

MIT License - see LICENSE file.

Support

  • 🐛 Issues: Report bugs via GitHub Issues
  • 💡 Features: Suggest improvements via GitHub Discussions
  • 📧 Contact: [Your contact info]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html2rss_ai-0.3.1.tar.gz (26.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

html2rss_ai-0.3.1-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file html2rss_ai-0.3.1.tar.gz.

File metadata

  • Download URL: html2rss_ai-0.3.1.tar.gz
  • Upload date:
  • Size: 26.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for html2rss_ai-0.3.1.tar.gz
Algorithm Hash digest
SHA256 034f363aa8d1ad2b6900d7e06e2fdd29170636df102911c0ecdba97991ea79c3
MD5 990807b86f33438d04270a378c52216b
BLAKE2b-256 ecd9bdd7f552fc1c591aea0540fdb16c4057c1e5e6cc4b95c3608d564cb4a05d

See more details on using hashes here.

Provenance

The following attestation bundles were made for html2rss_ai-0.3.1.tar.gz:

Publisher: release.yml on mazzasaverio/html2rss-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file html2rss_ai-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: html2rss_ai-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for html2rss_ai-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b1a7e261b1a7cb00f95819f36926dbdfec9e7053e5149dc7a4aa767565a05a55
MD5 3faf8489649446e822a6f6cbd09d3fdf
BLAKE2b-256 46f3d049d1e568b94648ce4fc8300da09e27d4e38a8d6c39fef5a7aca8dfbfc9

See more details on using hashes here.

Provenance

The following attestation bundles were made for html2rss_ai-0.3.1-py3-none-any.whl:

Publisher: release.yml on mazzasaverio/html2rss-ai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page