Skip to main content

A Python library to scrape web data using LLMs and Selenium

Project description

Scrape-AI

Scrape-AI is a Python library designed to intelligently scrape data from websites using a combination of LLMs (Large Language Models) and Selenium for dynamic web interactions. It allows you to configure the library to use a specific LLM (such as OpenAI, Anthropic, Azure OpenAI, etc.) and fetch data based on a user query from websites in real-time. The core functionality enables the agent-like scraping capabilities by leveraging natural language queries.

Key Features

  • LLM Integration: Supports multiple LLM models (OpenAI, Anthropic, Azure, Google, etc.) through a flexible factory pattern.
  • Dynamic Web Scraping: Utilizes Selenium WebDriver to interact with dynamic content on websites.
  • Agent-Like Functionality: Acts as an intelligent agent that can process user queries and fetch relevant data from the web.
  • Configurable: Customizable LLM model settings, verbosity, headless browsing, and target URLs.
  • Modular Design: Structured in a modular way to extend scraping strategies and LLM integrations.

Installation

  1. Clone the repository or install via pip (if available in PyPi):
git clone https://github.com/yourusername/scrapeAI.git
cd scrapeAI pip install -r requirements.txt
  1. Selenium WebDriver Dependencies: Selenium requires a browser driver to interact with the chosen browser. For example, if you're using Chrome, you need to install ChromeDriver. Make sure the driver is placed in your system's PATH (/usr/bin or /usr/local/bin). If this step is skipped, you'll encounter the following error:
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH.

Here are links to some of the popular browser drivers:

  1. Set up your preferred LLM API keys: Ensure you have API keys ready for the LLM model you intend to use (e.g., OpenAI, Azure, Google, Anthropic).

Usage

Here is a basic usage example of how to use scrapeAI to scrape data from a website based on a user query:

from scrapeAI import WebScraper

config = {
    "llm": {
        "api_key": '<Azure OpenAI API KEY>',
        "model": "<Azure OpenAI Deplyement Name>",
        "api_version": "<Azure Open AI API Version>",
        "endpoint": '<Azure OpenAI Endpoint Name>'
    },
    "verbose": False,
    "headless": False,
    "url" : "https://pypi.org/search/?q=genai",
    "prompt" : "Provide all the libraries and their installation commands"
}

scraper = WebScraper(config)

# Invoke the scraping process
result = scraper.invoke()

# Output the result
print(result)

The output will be a json as the following:

[
  {
    'library': 'genai',
    'installation_command': 'pip install genai'
  },
  {
    'library': 'bookworm_genai',
    'installation_command': 'pip install bookworm_genai'
  },
  {
    'library': 'ada-genai',
    'installation_command': 'pip install ada-genai'
  },
  ...{
    'library': 'semantix-genai-serve',
    'installation_command': 'pip install semantix-genai-serve'
  },
  {
    'library': 'platform-gen-ai',
    'installation_command': 'pip install platform-gen-ai'
  },
  
]

Configuration Options

  • llm: Defines the configuration for the LLM model, including the API key, model version, and endpoint.
  • verbose: If set to True, enables detailed logging of operations.
  • headless: If set to True, runs the web scraping in headless mode (without opening a browser window).
  • url: The target URL for scraping.
  • prompt: The natural language query to ask the LLM and fetch relevant content from the page.

Project Structure


The project is organized as follows: markdown Copy code

├── README.md
├── scrapeAI/
│ ├── **init**.py
│ ├── core/
│ │ ├── **init**.py
│ │ ├── base_scraper.py
│ │ ├── direct_scraper.py
│ │ ├── scraper_factory.py
│ │ └── search_scraper.py
│ ├── llms/
│ │ ├── **init**.py
│ │ ├── anthropic_llm.py
│ │ ├── azure_openai_llm.py
│ │ ├── base.py
│ │ ├── google_llm.py
│ │ ├── llm_factory.py
│ │ └── openai_llm.py
│ ├── utils/
│ │ ├── **init**.py
│ │ ├── html_utils.py
│ │ └── logging.py
│ └── web_scraper.py
├── setup.py
├── tests/
│ └── tests_operations.py

Core Components

  • core/: Contains the base scraper classes and factory design patterns for scraping strategies.
  • llms/: Includes different LLM integration classes such as OpenAI, Anthropic, and Google.
  • utils/: Utility functions for HTML parsing and logging.

Contributing


We welcome contributions! If you'd like to improve the project, feel free to fork the repository and submit a pull request. Please follow the existing code structure and ensure that your contribution includes proper tests.

License


This project is licensed under the MIT License - see the LICENSE file for details.

Selenium Dependencies:

  • Selenium requires browser drivers to interact with different browsers.
  • The required driver for Chrome is ChromeDriver, and similarly, each browser has its respective driver.
  • Make sure the driver is installed and placed in the system's PATH.

Popular Browser Drivers:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapeAI-0.2.0.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

scrapeAI-0.2.0-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file scrapeAI-0.2.0.tar.gz.

File metadata

  • Download URL: scrapeAI-0.2.0.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.2

File hashes

Hashes for scrapeAI-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b21d33f0eb3650cccc4373d3b31251225ff959ebb6c051c0ab0f389f977cc352
MD5 48488d00defdd8eb6b5d57241201da0d
BLAKE2b-256 6c937d50e81c893cdbe49ecb679ec3a73f0ad20da1f3272591477de17cdd7602

See more details on using hashes here.

File details

Details for the file scrapeAI-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: scrapeAI-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.2

File hashes

Hashes for scrapeAI-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e2eec1565aedfe56c08a1954cbc7bbde3b05b835220423b68e8dbd9e000aed0e
MD5 04f939c3ad4fa1312ffb69a4ede28ae1
BLAKE2b-256 85f165c7df2b316d6972bc2301916dcaa038f06ae21c0d7ba6ac4c555c332050

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page