Inherit the requests module, add xpath functionality to expand the API, and handle request failures and retries
Project description
PrSpiders线程池爬虫框架
PrSpiders安装
pip install PrSpiders
开始 Go start!
1.Demo
from PrSpider import PrSpiders
class Spider(PrSpiders):
start_urls = 'https://www.runoob.com'
def parse(self, response):
# print(response.text)
print(response, response.code, response.url)
#<Response Code=200 Len=323273> 200 https://www.runoob.com/
if __name__ == '__main__':
Spider()
2.重写入口函数-start_requests
start_requests是框架的启动入口,PrSpiders.Requests是发送请求的发送,参数下面会列举。
from PrSpider import PrSpiders
class Spider(PrSpiders):
def start_requests(self, **kwargs):
start_urls = 'https://www.runoob.com'
PrSpiders.Requests(url=start_urls, callback=self.parse)
def parse(self, response):
# print(response.text)
print(response, response.code, response.url)
if __name__ == '__main__':
Spider()
3.PrSpiders基本配置
底层使用ThreadPoolExecutor
workers: 线程数
retry: 是否开启请求失败重试,默认开启
download_delay: 请求周期
download_num: 每次线程请求数量,默认1秒5个请求
使用方法如下
from PrSpider import PrSpiders
class Spider(PrSpiders):
workers = 5
retry = False
download_delay = 3
download_num = 10
def start_requests(self, **kwargs):
start_urls = 'https://www.runoob.com'
PrSpiders.Requests(url=start_urls, callback=self.parse)
def parse(self, response):
# print(response.text)
print(response, response.code, response.url)
if __name__ == '__main__':
Spider()
4.PrSpiders.Requests基本配置
基本参数: url:请求网址 callback:回调函数 headers:请求头 retry_time:请求失败重试次数 method:请求方式(默认Get方法), meta:回调参数传递 encoding:编码格式(默认utf-8) retry_interval:重试间隔 timeout:请求超时时间(默认10s) **kwargs:继承requests的参数如(data, params, proxies)
PrSpiders.Requests(url=start_urls, headers={}, method='post', encoding='gbk', callback=self.parse,
retry_time=10, retry_interval=0.5, meta={'hhh': 'ggg'})
Api
GET Status Code
response.code
GET Text
response.text
GET Content
response.content
GET Url
response.url
GET History
response.history
GET Headers
response.headers
GET Text Length
response.len
GET Lxml Xpath
response.xpath
Xpath Api
-
text()方法:将xpath结果转成text
-
date()方法:将xpath结果转成date
-
get()方法:将xpath结果提取
-
getall()方法:将xpath结果全部提取,拥有text()方法和date()方法
from PrSpider import PrSpiders
class Spider(PrSpiders): def start_requests(self, **kwargs): start_urls = "https://www.runoob.com" PrSpiders.Requests(url=start_urls, callback=self.parse)
def parse(self, response): label = response.xpath("//div[@class='navto-nav']") label_text = response.xpath("//div[@class='navto-nav']").text() label_get = response.xpath("//div[@class='navto-nav']").get() label_getall = response.xpath("//div[@class='navto-nav']").getall() print(label) print(label_text) print(label_get) print(label_getall)
if name == "main": Spider()
Please contact me if there are any bugs
email -> 1944542244@qq.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for PrSpiders-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b786b0fe3285eff7a8948eb14d86fa29be53668e0ce67b12642fbc48184e6411 |
|
MD5 | a624a8d845d4f0a656ee8ef328c0cabe |
|
BLAKE2b-256 | 7f2fe1392206dc36f0bd63b62b9a975ee7f26cb9e7ece06e83ba95ac10569dc8 |