Skip to main content

A Python library to scrape web data using LLMs and Selenium

Project description

Scrape-AI

Scrape-AI is a Python library designed to intelligently scrape data from websites using a combination of LLMs (Large Language Models) and Selenium for dynamic web interactions. It allows you to configure the library to use a specific LLM (such as OpenAI, Anthropic, Azure OpenAI, etc.) and fetch data based on a user query from websites in real-time. The core functionality enables the agent-like scraping capabilities by leveraging natural language queries.

Key Features

  • LLM Integration: Supports multiple LLM models (OpenAI, Anthropic, Azure, Google, etc.) through a flexible factory pattern.
  • Dynamic Web Scraping: Utilizes Selenium WebDriver to interact with dynamic content on websites.
  • Agent-Like Functionality: Acts as an intelligent agent that can process user queries and fetch relevant data from the web.
  • Configurable: Customizable LLM model settings, verbosity, headless browsing, and target URLs.
  • Modular Design: Structured in a modular way to extend scraping strategies and LLM integrations.

Installation

  1. Clone the repository or install via pip (if available in PyPi):
pip install scrapeAI
  1. Selenium WebDriver Dependencies: Selenium requires a browser driver to interact with the chosen browser. For example, if you're using Chrome, you need to install ChromeDriver. Make sure the driver is placed in your system's PATH (/usr/bin or /usr/local/bin). If this step is skipped, you'll encounter the following error:
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH.

Here are links to some of the popular browser drivers:

  1. Set up your preferred LLM API keys: Ensure you have API keys ready for the LLM model you intend to use (e.g., OpenAI, Azure, Google, Anthropic).

Usage

Here is a basic usage example of how to use scrapeAI to scrape data from a website based on a user query:

from scrapeAI import WebScraper

config = {
    "llm": {
        "api_key": '<Azure OpenAI API KEY>',
        "model": "<Azure OpenAI Deplyement Name>",
        "api_version": "<Azure Open AI API Version>",
        "endpoint": '<Azure OpenAI Endpoint Name>'
    },
    "verbose": False,
    "headless": False,
    "url" : "https://pypi.org/search/?q=genai",
    "prompt" : "Provide all the libraries and their installation commands"
}

scraper = WebScraper(config)

# Invoke the scraping process
result = scraper.invoke()

# Output the result
print(result)

The output will be a json as the following:

[
  {
    'library': 'genai',
    'installation_command': 'pip install genai'
  },
  {
    'library': 'bookworm_genai',
    'installation_command': 'pip install bookworm_genai'
  },
  {
    'library': 'ada-genai',
    'installation_command': 'pip install ada-genai'
  },
  ...{
    'library': 'semantix-genai-serve',
    'installation_command': 'pip install semantix-genai-serve'
  },
  {
    'library': 'platform-gen-ai',
    'installation_command': 'pip install platform-gen-ai'
  },
  
]

Configuration Options

  • llm: Defines the configuration for the LLM model, including the API key, model version, and endpoint.
  • verbose: If set to True, enables detailed logging of operations.
  • headless: If set to True, runs the web scraping in headless mode (without opening a browser window).
  • url: The target URL for scraping.
  • prompt: The natural language query to ask the LLM and fetch relevant content from the page.

Project Structure


The project is organized as follows: markdown Copy code

├── README.md
├── scrapeAI/
│ ├── **init**.py
│ ├── core/
│ │ ├── **init**.py
│ │ ├── base_scraper.py
│ │ ├── direct_scraper.py
│ │ ├── scraper_factory.py
│ │ └── search_scraper.py
│ ├── llms/
│ │ ├── **init**.py
│ │ ├── anthropic_llm.py
│ │ ├── azure_openai_llm.py
│ │ ├── base.py
│ │ ├── google_llm.py
│ │ ├── llm_factory.py
│ │ └── openai_llm.py
│ ├── utils/
│ │ ├── **init**.py
│ │ ├── html_utils.py
│ │ └── logging.py
│ └── web_scraper.py
├── setup.py
├── tests/
│ └── tests_operations.py

Core Components

  • core/: Contains the base scraper classes and factory design patterns for scraping strategies.
  • llms/: Includes different LLM integration classes such as OpenAI, Anthropic, and Google.
  • utils/: Utility functions for HTML parsing and logging.

Contributing


We welcome contributions! If you'd like to improve the project, feel free to fork the repository and submit a pull request. Please follow the existing code structure and ensure that your contribution includes proper tests.

License


This project is licensed under the MIT License - see the LICENSE file for details.

Selenium Dependencies:

  • Selenium requires browser drivers to interact with different browsers.
  • The required driver for Chrome is ChromeDriver, and similarly, each browser has its respective driver.
  • Make sure the driver is installed and placed in the system's PATH.

Popular Browser Drivers:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapeAI-0.3.0.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

scrapeAI-0.3.0-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file scrapeAI-0.3.0.tar.gz.

File metadata

  • Download URL: scrapeAI-0.3.0.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.2

File hashes

Hashes for scrapeAI-0.3.0.tar.gz
Algorithm Hash digest
SHA256 7dcced5e60a27ee1f4776b8548fce1b67951b3c9f7925fce47a8c32d72e3d303
MD5 8b1e545d0a60f2690442e2cb3c8d6047
BLAKE2b-256 723bee869527d9f821fd879ed01a05e0d04d0e5b0ff30fb905e9855721ab9cff

See more details on using hashes here.

File details

Details for the file scrapeAI-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: scrapeAI-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.2

File hashes

Hashes for scrapeAI-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b19c3bc3c608548d29fa17f38a851f70f449f45db674eb5439c715b6543f494a
MD5 d1b6c9632735c6544b7e71ed81576f99
BLAKE2b-256 a8707d09ebf05eb9f12843ab7486f3d0854cc4f4ddfef08f4088c430d017a32b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page