A web crawler for GPTs to build knowledge bases
Project description
Introduction
GPT-Web-Crawler is a web crawler based on python and puppeteer. It can crawl web pages and extract content (including WebPages' title,url,keywords,description,all text content,all images and screenshot) from web pages. It is very easy to use and can be used to crawl web pages and extract content from web pages in a few lines of code. It is very suitable for people who are not familiar with web crawling and want to use web crawling to extract content from web pages.
The output of the spider can be a json file, which can be easily converted to a csv file, imported into a database or building an AI agent.
Getting Started
Step1. Install the package.
pip install gpt-web-crawler
Step2. Copy config_template.py and rename it to config.py. Then, edit the config.py file to config the openai api key and other settings, if you need use ProSpider to help you extract content from web pages. If you don't need to use ai help you extract content from web pages, you can keep the config.py file unchanged.
Step3. Run the following code to start a spider.
from gpt_web_crawler import run_spider,NoobSpider
run_spider(NoobSpider,
max_page_count= 10 ,
start_urls="https://www.jiecang.cn/",
output_file = "test_pakages.json",
extract_rules= r'.*\.html' )
Spiders
Spider Type | Description |
---|---|
NoobSpider | Basic web page scraping |
CatSpider | Web page scraping with screenshots |
ProSpider | Web page scraping with AI-extracted content |
LionSpider | Web page scraping with all images extracted |
Cat Spider
Cat spider is a spider that can take screenshots of web pages. It is based on the Noob spider and uses puppeteer to simulate browser operations to take screenshots of the entire web page and save it as an image. So when you use the Cat spider, you need to install puppeteer first.
npm install puppeteer
TODO
- 支持无需配置config.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gpt-web-crawler-0.0.2.tar.gz
.
File metadata
- Download URL: gpt-web-crawler-0.0.2.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5df8005ce68ee51a3b74aef8c6e39b08e1fdb329cee3d3b625d301c13767d41f |
|
MD5 | 1f86ed2bccbe337d7421fde190f6f212 |
|
BLAKE2b-256 | 08f25840ca1241368a1075e19e1eb4bb14d55d98815a1ebfeab079fcf3fac9dd |
File details
Details for the file gpt_web_crawler-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: gpt_web_crawler-0.0.2-py3-none-any.whl
- Upload date:
- Size: 21.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 668bb06722fe917ad74daec9cf21e394b8af7cc428d333d220d941a26fb5cc09 |
|
MD5 | 72d2089f7413b85311753db6327efb4a |
|
BLAKE2b-256 | 65aa41830345bd326154fb994b8bce40d20dfa50bc83ece0f612fc259d7ac047 |