A Python library to scrape web data using LLMs and Selenium
Project description
Scrape-AI
Scrape-AI
is a Python library designed to intelligently scrape data from websites using a combination of LLMs (Large Language Models) and Selenium for dynamic web interactions. It allows you to configure the library to use a specific LLM (such as OpenAI, Anthropic, Azure OpenAI, etc.) and fetch data based on a user query from websites in real-time. The core functionality enables the agent-like scraping capabilities by leveraging natural language queries.
Key Features
- LLM Integration: Supports multiple LLM models (OpenAI, Anthropic, Azure, Google, etc.) through a flexible factory pattern.
- Dynamic Web Scraping: Utilizes Selenium WebDriver to interact with dynamic content on websites.
- Agent-Like Functionality: Acts as an intelligent agent that can process user queries and fetch relevant data from the web.
- Configurable: Customizable LLM model settings, verbosity, headless browsing, and target URLs.
- Modular Design: Structured in a modular way to extend scraping strategies and LLM integrations.
Installation
- Clone the repository or install via pip (if available in PyPi):
pip install scrapeAI
- Selenium WebDriver Dependencies:
Selenium requires a browser driver to interact with the chosen browser. For example, if you're using Chrome, you need to install ChromeDriver. Make sure the driver is placed in your system's PATH (
/usr/bin
or/usr/local/bin
). If this step is skipped, you'll encounter the following error:
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH.
Here are links to some of the popular browser drivers:
- Chrome: Download ChromeDriver
- Edge: Download EdgeDriver
- Firefox: Download GeckoDriver
- Safari: WebDriver support in Safari
- Set up your preferred LLM API keys: Ensure you have API keys ready for the LLM model you intend to use (e.g., OpenAI, Azure, Google, Anthropic).
Usage
Here is a basic usage example of how to use scrapeAI
to scrape data from a website based on a user query:
from scrapeAI import WebScraper
config = {
"llm": {
"api_key": '<Azure OpenAI API KEY>',
"model": "<Azure OpenAI Deplyement Name>",
"api_version": "<Azure Open AI API Version>",
"endpoint": '<Azure OpenAI Endpoint Name>'
},
"verbose": False,
"headless": False,
"url" : "https://pypi.org/search/?q=genai",
"prompt" : "Provide all the libraries and their installation commands"
}
scraper = WebScraper(config)
# Invoke the scraping process
result = scraper.invoke()
# Output the result
print(result)
The output will be a json as the following:
[
{
'library': 'genai',
'installation_command': 'pip install genai'
},
{
'library': 'bookworm_genai',
'installation_command': 'pip install bookworm_genai'
},
{
'library': 'ada-genai',
'installation_command': 'pip install ada-genai'
},
...{
'library': 'semantix-genai-serve',
'installation_command': 'pip install semantix-genai-serve'
},
{
'library': 'platform-gen-ai',
'installation_command': 'pip install platform-gen-ai'
},
]
Configuration Options
- llm: Defines the configuration for the LLM model, including the API key, model version, and endpoint.
- verbose: If set to
True
, enables detailed logging of operations. - headless: If set to
True
, runs the web scraping in headless mode (without opening a browser window). - url: The target URL for scraping.
- prompt: The natural language query to ask the LLM and fetch relevant content from the page.
Project Structure
The project is organized as follows: markdown Copy code
├── README.md
├── scrapeAI/
│ ├── **init**.py
│ ├── core/
│ │ ├── **init**.py
│ │ ├── base_scraper.py
│ │ ├── direct_scraper.py
│ │ ├── scraper_factory.py
│ │ └── search_scraper.py
│ ├── llms/
│ │ ├── **init**.py
│ │ ├── anthropic_llm.py
│ │ ├── azure_openai_llm.py
│ │ ├── base.py
│ │ ├── google_llm.py
│ │ ├── llm_factory.py
│ │ └── openai_llm.py
│ ├── utils/
│ │ ├── **init**.py
│ │ ├── html_utils.py
│ │ └── logging.py
│ └── web_scraper.py
├── setup.py
├── tests/
│ └── tests_operations.py
Core Components
- core/: Contains the base scraper classes and factory design patterns for scraping strategies.
- llms/: Includes different LLM integration classes such as OpenAI, Anthropic, and Google.
- utils/: Utility functions for HTML parsing and logging.
Contributing
We welcome contributions! If you'd like to improve the project, feel free to fork the repository and submit a pull request. Please follow the existing code structure and ensure that your contribution includes proper tests.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Selenium Dependencies:
- Selenium requires browser drivers to interact with different browsers.
- The required driver for Chrome is ChromeDriver, and similarly, each browser has its respective driver.
- Make sure the driver is installed and placed in the system's PATH.
Popular Browser Drivers:
- Chrome: ChromeDriver
- Edge: EdgeDriver
- Firefox: GeckoDriver
- Safari: WebDriver for Safari
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapeAI-0.3.0.tar.gz
.
File metadata
- Download URL: scrapeAI-0.3.0.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7dcced5e60a27ee1f4776b8548fce1b67951b3c9f7925fce47a8c32d72e3d303 |
|
MD5 | 8b1e545d0a60f2690442e2cb3c8d6047 |
|
BLAKE2b-256 | 723bee869527d9f821fd879ed01a05e0d04d0e5b0ff30fb905e9855721ab9cff |
File details
Details for the file scrapeAI-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: scrapeAI-0.3.0-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b19c3bc3c608548d29fa17f38a851f70f449f45db674eb5439c715b6543f494a |
|
MD5 | d1b6c9632735c6544b7e71ed81576f99 |
|
BLAKE2b-256 | a8707d09ebf05eb9f12843ab7486f3d0854cc4f4ddfef08f4088c430d017a32b |