ScraperAI is an open-source, AI-powered tool designed to simplify web scraping for users of all skill levels.
Project description
ScraperAI
⚡ Scraping has never been easier ⚡
Documentation | Website
What is ScraperAI
ScraperAI is an open-source, AI-powered tool designed to simplify web scraping for users of all skill levels. By leveraging Large Language Models, such as ChatGPT, ScraperAI extracts data from web pages and generates reusable and shareable scraping recipes.
Features
- Serializable & reusable Scraper Configs
- Automatic data detection
- Automatic XPATHs detection
- Automatic pagination & page type detection
- HTML minification
- ChatGPT support
- Custom LLMs support
- Selenium support
- Custom crawlers support
Installation
Install ScraperAI easily using pip or from the source.
With pip:
pip install scraperai
From source:
git clone https://github.com/scraperai/scraperai.git
pip install ./scraperai
Getting Started
Page Type Detector
Web pages are categorized into four types:
- Catalog: Pages with similar repeating elements, such as product lists, articles, companies or table rows.
- Details: Pages detailing information about a single product.
- Captcha: Captcha pages that hinder scraping efforts. Currently, we do not provide solutions to circumvent captchas.
- Other: All other page types not currently supported.
ScraperAI primarily uses page screenshots and the GPT-4 Vision model for page type determination, with a fallback algorithm for cases where screenshots or Vision model access is unavailable. Users can manually set the page type if known.
Pagination Detector
This feature is applicable for catalog-type web pages, supporting:
xpath
: Xpath of pagination buttons like "Next page", "More", etc.scroll
: Infinite scrolling.url_param
: URL parameter-based pagination (e.g.,website.com/?page=1
).
Catalog Item Detector
This feature is specifically designed for catalog-type web pages. It identifies repeating elements that typically represent individual data items, such as products, articles, or companies. These elements may appear as visually distinct cards or as rows within a table, facilitating the organized display of information.
Fields Extractor
The Fields Extractor allows to detect relevant information on the page and then find XPATHs that allows to extract this detected information efficiently. This tool can be used to retrieve information from individual catalog item cards or from nested detailing pages. We define two types of data fields within HTML page:
- Static fields: Fields without explicit names, containing single or multiple values (e.g., product names or prices).
- Dynamic fields: Fields with both names and values, typically formatted like table entries.
Web Crawler
Our WebCrawler is engineered to:
- Access web pages.
- Simulate human actions (clicking, scrolling).
- Capture screenshots of web pages.
Selenium webdriver is the default tool due to its convenience and ease of use, incorporating techniques to avoid most website blocks. Users can implement their versions using other tools like PlayWright. The requests package is also supported, albeit with some limitations.
Demo
Jupyter notebook
We put example of basic scraper usage in the example.ipynb
.
In this notebook we present two expirements:
CLI Application
ScraperAI has a built-in CLI application. Simply run:
scraperai --url https://www.ycombinator.com/companies
or simply
scraperai
Follow the interactive process as ScraperAI attempts to auto-detect page types, pagination, catalog cards and data fields,
allowing for manual correction of its detections.
The CLI currently supports only the OpenAI chat model, requiring an openai_api_key
.
It can be provided via an environment variable, a .env
file, or directly to the script.
Use scraperai --help
for assistance.
Roadmap
Our vision for ScraperAI's future includes:
- Add httpx and aiohttp crawlers
- Improve reciepts & prompts
- Release SaaS web app
- Improve prompts
- Add support of different LLMs
- Add gpt4all integration
- Add anti-captcha integration
We welcome feature requests and ideas from our community.
Contributing
Your contributions are highly appreciated! Feel free to submit pull requests or issues.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file scraperai-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: scraperai-0.0.2-py3-none-any.whl
- Upload date:
- Size: 63.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 987cf6b1ad9faeafed9cbf0e681c93bc8daf39bb897dafb289b57c6884bbf16b |
|
MD5 | a8c491ea3f60c8f140f437343f8da44a |
|
BLAKE2b-256 | 6792294fe34edc80b35b6be06b5034b7764ea11ba332a0c9977865f5bf74dbdf |