Skip to main content

A Light Spider(Web Crawler) System in Python

Project description

genius_lite

基于 Python requests 库的轻量爬虫系统

安装

pip install genius_lite

使用

from genius_lite import GeniusLite

class MySpider(GeniusLite):
    spider_name = 'MySpider' # 爬虫名称,不设置默认爬虫类名
    spider_config = {'timeout': 15}
    log_config = {'output': '/absolute/path'}
    
    def start_requests(self):
        pages = [1, 2, 3, 4]
        for page in pages:
            yield self.crawl(
                'http://xxx/list',
                self.parse_list_page,
                params={'page': page}
            )

    def parse_list_page(self, response):
        print(response.text)
        ... # do something
        detail_urls = [...]
        for url in detail_urls:
            yield self.crawl(
                url,
                self.parse_detail_page,
                payload='some data'
            )

    def parse_detail_page(self, response):
        print(response.payload) # output: some data
        ... # do something


my_spider = MySpider()
my_spider.run()

spider_config

name       | type              | default
————————————————————————————————————————————
timeout    | num or (num, num) | 10

log_config

name       | type              | default
————————————————————————————————————————————
enable     | bool              | False
level      | str               | 'DEBUG'
format     | str               | '[%(levelname)s] %(asctime)s -> %(filename)s (line:%(lineno)d) -> %(name)s: %(message)s'
output     | str               | None

start_requests

所有爬虫请求的入口,爬虫子类都要重写该方法

def start_requests(self):
    yield self.crawl(url='http://...', parser=self.parse_func)

def parse_func(self, response):
    print(response.text)

self.crawl

设置即将被爬取的爬虫种子配置

def crawl(self, url, parser, method='GET', data=None, params=None,
          headers=None, payload=None, encoding=None, **kwargs):
    """设置即将被爬取的爬虫种子配置

    :param url: URL for the new :class:`Request` object.
    :param parser: a callback function to handle response
    :param method: (optional) method for the new :class:`Request` object,
        default 'GET'
    :param data: (optional) Dictionary, list of tuples, bytes, or file-like
        object to send in the body of the :class:`Request`.
    :param params: (optional) Dictionary or bytes to be sent in the query
        string for the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the
        :class:`Request`.
    :param payload: (optional) the payload data to the parser function
    :param encoding: (optional) set response encoding
    
    :param kwargs:
        :param cookies: (optional) Dict or CookieJar object to send with the
            :class:`Request`.
        :param files: (optional) Dictionary of ``'filename': file-like-objects``
            for multipart encoding upload.
        :param auth: (optional) Auth tuple or callable to enable
            Basic/Digest/Custom HTTP Auth.
        :param timeout: (optional) How long to wait for the server to send
            data before giving up, as a float, or a :ref:`(connect timeout,
            read timeout) <timeouts>` tuple.
        :type timeout: float or tuple
        :param allow_redirects: (optional) Set to True by default.
        :type allow_redirects: bool
        :param proxies: (optional) Dictionary mapping protocol or protocol and
            hostname to the URL of the proxy.
        :param stream: (optional) whether to immediately download the response
            content. Defaults to ``False``.
        :param verify: (optional) Either a boolean, in which case it controls whether we verify
            the server's TLS certificate, or a string, in which case it must be a path
            to a CA bundle to use. Defaults to ``True``. When set to
            ``False``, requests will accept any TLS certificate presented by
            the server, and will ignore hostname mismatches and/or expired
            certificates, which will make your application vulnerable to
            man-in-the-middle (MitM) attacks. Setting verify to ``False`` 
            may be useful during local development or testing.
        :param cert: (optional) if String, path to ssl client cert file (.pem).
            If Tuple, ('cert', 'key') pair.
    :return: Seed
    """

response

requests 库的 Response 对象,包含 crawl 方法设置的 payload 属性

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genius_lite-0.2.1.tar.gz (12.5 kB view details)

Uploaded Source

File details

Details for the file genius_lite-0.2.1.tar.gz.

File metadata

  • Download URL: genius_lite-0.2.1.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.3

File hashes

Hashes for genius_lite-0.2.1.tar.gz
Algorithm Hash digest
SHA256 59fb020ccf0185d99fab08a21123106b9d997f34ec24bda09d9de1a5f10c8a82
MD5 1ae7e5f4d69565af59b1c291e272b1f1
BLAKE2b-256 d57786bc47508037a433a84ce5c2eebb74335eb69a53f9345afb72971a8380c1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page