一款可以接入自定义扩展的爬虫
Project description
说明
- 一款可以接入自定义扩展的爬虫
示例
from manc.plugins import UserAgentPlugin
from manc.spider import BaseSpider
url = 'https://blog.csdn.net/MarkAdc'
# 1. 基础爬虫
s1 = BaseSpider()
r1 = s1.goto(url) # 响应对象可以直接使用Xpath、CSS
print(type(r1))
print(r1.request.headers)
print(r1.xpath("//title/text()").get())
print()
# 2. 标准爬虫,等价于 基础爬虫 + ua插件
s2 = BaseSpider()
s2.add_plugins([UserAgentPlugin()])
r2 = s2.goto(url) # 请求带了UA
print(type(r2))
print(r2.request.headers)
print(r2.xpath("//title/text()").get())
print()
from manc import Spider
from manc.plugins import SpiderPlugin
class ProxyPlugin(SpiderPlugin):
def deal_request(self, request):
proxy = 'http://127.0.0.1:1082'
request.proxies = {"http": proxy, "https": proxy}
request.name = "cMan"
def deal_response(self, response):
return response
s = Spider()
s.add_plugin(ProxyPlugin())
url = 'http://www.baidu.com'
r = s.goto(url)
print(type(r), type(r.request))
print(r.request.name)
print(r.request.headers)
print(r.request.proxies)
print(r.get_one("//title/text()"))
print(r.get_all("//title/text()"))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
manc-0.1.0.tar.gz
(3.5 kB
view details)
File details
Details for the file manc-0.1.0.tar.gz.
File metadata
- Download URL: manc-0.1.0.tar.gz
- Upload date:
- Size: 3.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
defe04dc27543b84daaf5401a1822207c3684049bb8a9836a5f22959678bc9e5
|
|
| MD5 |
05969d5a556f4a8c229053735cecb4f5
|
|
| BLAKE2b-256 |
44532d6953634bc3a348c0783981377b37b3ab789bc627147d376e9220d47044
|