Skip to main content

A Light Spider(Web Crawler) System in Python

Project description

genius_lite

基于 Python requests 库封装的轻量爬虫系统

安装

pip install genius_lite

使用

from genius_lite import GeniusLite

class MySpider(GeniusLite):

    def start_requests(self):
        yield self.crawl('https://www.google.com', self.parse_google_page)

    def parse_google_page(self, response):
        print(response.text)
        detail_urls = [...]
        for url in detail_urls:
            yield self.crawl(url, self.parse_detail_page)

    def parse_detail_page(self, response):
        ...

if __name__ == '__main__':
    my_spider = MySpider()
    my_spider.run()

start_requests

所有爬虫请求的入口,爬虫子类必须重写该方法以生成请求种子

from genius_lite import GeniusLite

class MySpider(GeniusLite):

    def start_requests(self):
        yield self.crawl(url='https://www.google.com', parser=self.parse_func)
    
    def parse_func(self, response):
        print(response.text)

self.crawl

通过 yield 该方法生成爬虫请求种子,部分参数可查看 requests 文档

  • url: 请求地址
  • parser: 响应解析函数,参数为 response 对象
  • method: (default='GET') 请求方法
  • params: (optional) 查询参数
  • data: (optional) POST 请求参数
  • headers: (optional) 请求头
  • payload: (optional) 携带到响应解析函数的数据,通过 response.payload 形式读取
  • encoding: (optional) response 编码设置
  • unique: (default=True) 设置该请求是否唯一,设为 True 时将根据 url、method、params、data 内容过滤相同请求
  • kwargs: (optional) 支持的关键字参数如下: cookies, files, json, auth, hooks, timeout, verify, stream, cert, allow_redirects, proxies

response

参考 requests.Response

GeniusLite config

from genius_lite import GeniusLite

class MySpider(GeniusLite):
    spider_name = 'MySpider'
    spider_config = {'timeout': 15}
    log_config = {'output': '/absolute/path'}

    ...

spider_name

爬虫命名,不设置则默认为运行的爬虫子类名

spider_config

name       | type              | default
————————————————————————————————————————————
timeout    | num or (num, num) | 10

爬虫全局设置

log_config

name       | type              | default
————————————————————————————————————————————
enable     | bool              | False
level      | str               | 'DEBUG'
output     | str               | None

log 配置

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genius_lite-0.2.5.tar.gz (12.5 kB view details)

Uploaded Source

File details

Details for the file genius_lite-0.2.5.tar.gz.

File metadata

  • Download URL: genius_lite-0.2.5.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.3

File hashes

Hashes for genius_lite-0.2.5.tar.gz
Algorithm Hash digest
SHA256 d88a856e9bb2a03485fd9ed341cf96c665d637d700698c2632cd4da0ade3d5ab
MD5 e7aa4c9ba416c40dd662f4caf9f0056f
BLAKE2b-256 057baf9c2c07bf1b05a22cf92e33d3fb65367ccf04053381d7eb02c3304828e2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page