Skip to main content

一款可以接入自定义扩展的爬虫

Project description

说明

  • 一款可以接入自定义扩展的爬虫

示例

  • 简单演示
from manc.plugins import UserAgentPlugin
from manc.spider import BaseSpider, Spider

url = 'https://blog.csdn.net/MarkAdc'

# 1. 基础爬虫
s1 = BaseSpider()
r1 = s1.goto(url)  # 响应对象可以直接使用Xpath、CSS
print(type(r1))
print(r1.request.headers)
print(r1.xpath("//title/text()").get())
print()

# 2. 基础爬虫 + ua插件
s2 = BaseSpider()
s2.add_plugins([UserAgentPlugin()])
r2 = s2.goto(url)  # 请求带了UA
print(type(r2))
print(r2.request.headers)
print(r2.xpath("//title/text()").get())
print()

# 3. 标准爬虫,等价于 基础爬虫 + ua插件
s3 = Spider()
  • 自定义扩展演示
from manc import Spider
from manc.plugins import SpiderPlugin


class ProxyPlugin(SpiderPlugin):
    def deal_request(self, request):
        proxy = 'http://127.0.0.1:1082'
        request.proxies = {"http": proxy, "https": proxy}
        request.name = "cMan"

    def deal_response(self, response):
        return response


s = Spider()
s.add_plugin(ProxyPlugin())

url = 'http://www.baidu.com'
r = s.goto(url)
print(type(r), type(r.request))
print(r.request.name)
print(r.request.headers)
print(r.request.proxies)
print(r.get_one("//title/text()"))
print(r.get_all("//title/text()"))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

manc-0.1.2.tar.gz (3.6 kB view details)

Uploaded Source

File details

Details for the file manc-0.1.2.tar.gz.

File metadata

  • Download URL: manc-0.1.2.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for manc-0.1.2.tar.gz
Algorithm Hash digest
SHA256 7c63fc796340c9406fdbc89ed196c70dbc58ffb71718c818cd1df452919b8802
MD5 73084a7292d5e3309a3b293f68893cee
BLAKE2b-256 3284c67f29f00352cfb02d50f857b05f81e4a2702d5eed9682dc5b46e2b529a0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page