Skip to main content

A Light Spider(Web Crawler) System in Python

Project description

genius_lite

基于 Python requests 库封装的轻量爬虫系统

安装

pip install genius_lite

使用

from genius_lite import GeniusLite

class MySpider(GeniusLite):

    def start_requests(self):
        yield self.crawl('https://www.google.com', self.parse_google_page)

    def parse_google_page(self, response):
        print(response.text)
        detail_urls = [...]
        for url in detail_urls:
            yield self.crawl(url, self.parse_detail_page)

    def parse_detail_page(self, response):
        ...

if __name__ == '__main__':
    my_spider = MySpider()
    my_spider.run()

start_requests

所有爬虫请求的入口,爬虫子类必须重写该方法以生成请求种子

from genius_lite import GeniusLite

class MySpider(GeniusLite):

    def start_requests(self):
        yield self.crawl(url='https://www.google.com', parser=self.parse_func)
    
    def parse_func(self, response):
        print(response.text)

self.crawl

通过 yield 该方法生成爬虫请求种子,部分参数可查看 requests 文档

  • url: 请求地址
  • parser: 响应解析函数,参数为 response 对象
  • method: (default='GET') 请求方法
  • params: (optional) 查询参数
  • data: (optional) POST 请求参数
  • headers: (optional) 请求头
  • payload: (optional) 携带到响应解析函数的数据,通过 response.payload 形式读取
  • encoding: (optional) response 编码设置
  • unique: (default=True) 设置该请求是否唯一,设为 True 时将根据 url、method、params、data 内容过滤相同请求
  • kwargs: (optional) 支持的关键字参数如下: cookies, files, json, auth, hooks, timeout, verify, stream, cert, allow_redirects, proxies

response

参考 requests.Response

GeniusLite config

from genius_lite import GeniusLite

class MySpider(GeniusLite):
    spider_name = 'MySpider'
    spider_config = {'timeout': 15}
    log_config = {'output': '/absolute/path'}

    ...

spider_name

爬虫命名,不设置则默认为运行的爬虫子类名

spider_config

name       | type              | default
————————————————————————————————————————————
timeout    | num or (num, num) | 10

爬虫全局设置

log_config

name       | type              | default
————————————————————————————————————————————
enable     | bool              | False
level      | str               | 'DEBUG'
output     | str               | None

log 配置

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genius_lite-0.2.4.tar.gz (12.6 kB view details)

Uploaded Source

File details

Details for the file genius_lite-0.2.4.tar.gz.

File metadata

  • Download URL: genius_lite-0.2.4.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.3

File hashes

Hashes for genius_lite-0.2.4.tar.gz
Algorithm Hash digest
SHA256 b3adb070e2777fb1e6ca0df9ab7e446e18f307a053cc779c29492f38d725af57
MD5 adedc6d4b41498e11c5f2345555ab4e7
BLAKE2b-256 9848cd0ba890fa98ad41a9085257b333acea19982ccd9709c3aea267be11337f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page