Skip to main content

A Light Spider(Web Crawler) System in Python

Project description

genius_lite

基于 Python requests 库的轻量爬虫系统

安装

pip install genius_lite

使用

from genius_lite import GeniusLite

class MySpider(GeniusLite):
    spider_name = 'MySpider' # 爬虫名称,不设置默认爬虫类名
    spider_config = {'timeout': 15}
    log_config = {'output': '/absolute/path'}
    
    def start_requests(self):
        pages = [1, 2, 3, 4]
        for page in pages:
            yield self.crawl(
                'http://xxx/list',
                self.parse_list_page,
                params={'page': page}
            )

    def parse_list_page(self, response):
        print(response.text)
        ... # do something
        detail_urls = [...]
        for url in detail_urls:
            yield self.crawl(
                url,
                self.parse_detail_page,
                payload='some data'
            )

    def parse_detail_page(self, response):
        print(response.payload) # output: some data
        ... # do something


my_spider = MySpider()
my_spider.run()

spider_config

name       | type              | default
————————————————————————————————————————————
timeout    | num or (num, num) | 10

log_config

name       | type              | default
————————————————————————————————————————————
enable     | bool              | False
level      | str               | 'DEBUG'
output     | str               | None

start_requests

所有爬虫请求的入口,爬虫子类都要重写该方法

def start_requests(self):
    yield self.crawl(url='http://...', parser=self.parse_func)

def parse_func(self, response):
    print(response.text)

self.crawl

设置即将被爬取的爬虫种子配置

def crawl(self, url, parser, method='GET', data=None, params=None,
          headers=None, payload=None, encoding=None, **kwargs):
    """设置即将被爬取的爬虫种子配置

    :param url: URL for the new :class:`Request` object.
    :param parser: a callback function to handle response
    :param method: (optional) method for the new :class:`Request` object,
        default 'GET'
    :param data: (optional) Dictionary, list of tuples, bytes, or file-like
        object to send in the body of the :class:`Request`.
    :param params: (optional) Dictionary or bytes to be sent in the query
        string for the :class:`Request`.
    :param headers: (optional) Dictionary of HTTP Headers to send with the
        :class:`Request`.
    :param payload: (optional) the payload data to the parser function
    :param encoding: (optional) set response encoding
    
    :param kwargs:
        :param cookies: (optional) Dict or CookieJar object to send with the
            :class:`Request`.
        :param files: (optional) Dictionary of ``'filename': file-like-objects``
            for multipart encoding upload.
        :param auth: (optional) Auth tuple or callable to enable
            Basic/Digest/Custom HTTP Auth.
        :param timeout: (optional) How long to wait for the server to send
            data before giving up, as a float, or a :ref:`(connect timeout,
            read timeout) <timeouts>` tuple.
        :type timeout: float or tuple
        :param allow_redirects: (optional) Set to True by default.
        :type allow_redirects: bool
        :param proxies: (optional) Dictionary mapping protocol or protocol and
            hostname to the URL of the proxy.
        :param stream: (optional) whether to immediately download the response
            content. Defaults to ``False``.
        :param verify: (optional) Either a boolean, in which case it controls whether we verify
            the server's TLS certificate, or a string, in which case it must be a path
            to a CA bundle to use. Defaults to ``True``. When set to
            ``False``, requests will accept any TLS certificate presented by
            the server, and will ignore hostname mismatches and/or expired
            certificates, which will make your application vulnerable to
            man-in-the-middle (MitM) attacks. Setting verify to ``False`` 
            may be useful during local development or testing.
        :param cert: (optional) if String, path to ssl client cert file (.pem).
            If Tuple, ('cert', 'key') pair.
    :return: Seed
    """

response

requests 库的 Response 对象,包含 crawl 方法设置的 payload 属性

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genius_lite-0.2.3.tar.gz (13.8 kB view details)

Uploaded Source

File details

Details for the file genius_lite-0.2.3.tar.gz.

File metadata

  • Download URL: genius_lite-0.2.3.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.3

File hashes

Hashes for genius_lite-0.2.3.tar.gz
Algorithm Hash digest
SHA256 60e9966cc9a39f9b713f210adb18181f994b1161e7d11cafe7fefa39d137f9d9
MD5 7f123d6d7c52f372decbe898b92975f8
BLAKE2b-256 1c04200fbfaba6ad82cabacbce9c73c2c1889553039b196c30d6cd9011f58610

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page