Skip to main content

tiny web crawler

Project description

Biu

A tiny web crawler framework

Features

  • 请使用 Python3.10 或更高版本
  • 并发基于 Gevent,因此你必须在脚本一开始import biu,或者自行 monkey patch
  • 请求基于 Requests,请求与请求结果的参数与 Requests 基本兼容
  • 页面解析基于 Parsel, 因此使用方法与 Scrapy 一致
  • 基本是一个缩水版的 Scrapy,用法与之非常类似
  • 更多高级功能请面向源代码编程,自行发掘

Installation

pip install biu

Example

import biu  ## Must be the first line, because of monkey-included.


class MySpider(biu.Project):
    def start_requests(self):
        for i in range(0, 301, 30):
            # return 或者 yield 一个 biu.Request 就会去访问一个页面,参数与 requests 的那个基本上是兼容的
            yield biu.Request(url="https://www.douban.com/group/explore/tech?start={}".format(i),
                              method="GET",
                              headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"},
                              callback=self.parse)

    def parse(self, resp):
        ## biu.Response 和 requests 的那个差不多,加了几个选择器上去
        for item in resp.xpath('//*[@id="content"]/div/div[1]/div[1]/div'):
            yield {
                "title": item.xpath("div[2]/h3/a/text()").extract_first(),
                "url": item.xpath("div[2]/h3/a/@href").extract_first(),
                "abstract": item.css("p::text").extract_first()
            }
            # return 或者 yield 一个 dict, 就会当作结果传到result_handler里进行处理


    def result_handler(self, rv):
        print("get result:", rv)
        # 在这把你的结果存了

biu.run(MySpider(concurrent=3, interval=0.2, max_retry=5))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biu-0.3.0.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

biu-0.3.0-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file biu-0.3.0.tar.gz.

File metadata

  • Download URL: biu-0.3.0.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.2

File hashes

Hashes for biu-0.3.0.tar.gz
Algorithm Hash digest
SHA256 648194a51dd01a460132ef23472fbfd99d47d6f7a2b2eb5ba7e97526790f5827
MD5 f0bcf6bddd992d89a4f47e74462586dc
BLAKE2b-256 e8f666b99a02544c28275ec488cb7c7547129ad78e67f5c976ed3ba3a413da4f

See more details on using hashes here.

File details

Details for the file biu-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: biu-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.2

File hashes

Hashes for biu-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 06786383a682722908e395d2c9644009fd3876d53c91a2d755cd32075c61efe3
MD5 5646e11161350a1f89b1079df9d69895
BLAKE2b-256 634e7ebab624a0fdf293fcc7dc471af09ce44055af5d578e92d34a2abe104cf6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page