Skip to main content

A web crawler for GPTs to build knowledge bases

Project description

简体中文 English

Introduction

GPT-Web-Crawler is a web crawler based on python and puppeteer. It can crawl web pages and extract content (including WebPages' title,url,keywords,description,all text content,all images and screenshot) from web pages. It is very easy to use and can be used to crawl web pages and extract content from web pages in a few lines of code. It is very suitable for people who are not familiar with web crawling and want to use web crawling to extract content from web pages. Crawler Working

The output of the spider can be a json file, which can be easily converted to a csv file, imported into a database or building an AI agent. Assistant demo

Getting Started

Step1. Install the package.

pip install gpt-web-crawler

Step2. Copy config_template.py and rename it to config.py. Then, edit the config.py file to config the openai api key and other settings, if you need use ProSpider to help you extract content from web pages. If you don't need to use ai help you extract content from web pages, you can keep the config.py file unchanged.

Step3. Run the following code to start a spider.

from gpt_web_crawler import run_spider,NoobSpider
run_spider(NoobSpider, 
           max_page_count= 10 ,
           start_urls="https://www.jiecang.cn/", 
           output_file = "test_pakages.json",
           extract_rules= r'.*\.html' )

Spiders

Spider Type Description
NoobSpider Basic web page scraping
CatSpider Web page scraping with screenshots
ProSpider Web page scraping with AI-extracted content
LionSpider Web page scraping with all images extracted

Cat Spider

Cat spider is a spider that can take screenshots of web pages. It is based on the Noob spider and uses puppeteer to simulate browser operations to take screenshots of the entire web page and save it as an image. So when you use the Cat spider, you need to install puppeteer first.

npm install puppeteer

TODO

  • 支持无需配置config.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpt-web-crawler-0.0.2.tar.gz (16.1 kB view details)

Uploaded Source

Built Distribution

gpt_web_crawler-0.0.2-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file gpt-web-crawler-0.0.2.tar.gz.

File metadata

  • Download URL: gpt-web-crawler-0.0.2.tar.gz
  • Upload date:
  • Size: 16.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for gpt-web-crawler-0.0.2.tar.gz
Algorithm Hash digest
SHA256 5df8005ce68ee51a3b74aef8c6e39b08e1fdb329cee3d3b625d301c13767d41f
MD5 1f86ed2bccbe337d7421fde190f6f212
BLAKE2b-256 08f25840ca1241368a1075e19e1eb4bb14d55d98815a1ebfeab079fcf3fac9dd

See more details on using hashes here.

File details

Details for the file gpt_web_crawler-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for gpt_web_crawler-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 668bb06722fe917ad74daec9cf21e394b8af7cc428d333d220d941a26fb5cc09
MD5 72d2089f7413b85311753db6327efb4a
BLAKE2b-256 65aa41830345bd326154fb994b8bce40d20dfa50bc83ece0f612fc259d7ac047

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page