Skip to main content

Building a modular crawler template system based on Jinja2.

Project description

基于 Jinja2 构建模块化爬虫模板系统

本来打算就写写题目所说的,但是后来我还是决定就全放在这个项目里了,于是这个项目就变成了我一个月实习经历的经验总结。我把我很多工作中自己写的辅助用的函数工具都放到了其中,包括网页表格解析函数、文本处理函数等,另外就是记录了遇到的比较特殊的问题的解决方法。

可实现模板外自由组合的新版本已经发布,详细说明见 v2 简明教程,这样的话基本实现了我最初的设想,然而开心不起来。。。

  • 安装方式
pip install -U spider-renderer
  • 简单模板文件示例

header.tmpl

'''Rendered on {{datetime}}'''

import re
import scrapy

class NewspiderSpider(scrapy.Spider):

    name = '{{spider}}'
    source = '{{source}}'
    url = '{{home_url}}'
    author = '{{author}}'
    all_page = {{all_page}}

requests.tmpl

    def start_requests(self):
        url = '{{page_url}}'
        all_page = self.all_page or 10
        for page in range(1, all_page):
            yield scrapy.Request(url % page, callback=self.parse)

parser.tmpl

{% include "header.tmpl" %}
{% include "requests.tmpl" %}

    def parse(self, response):
        response.string = re.sub('[\r\n\t\v\f]', '', response.text)
        rows = re.findall(r'''{{regex}}''', response.string)
  • 渲染生成程序示例
import os
import os.path

from renderer import genspider

basepath = os.path.abspath(os.path.dirname(__file__))
dst = os.path.join(basepath, 'spiders')
templates_folder = os.path.join(basepath, 'templates')

if not os.path.isdir(dst):
    os.mkdir(dst)

templatefile = 'parser.tmpl'
spider = 'fonts_spider'

home_url = '''
http://fonts.mobanwang.com/fangzheng/
'''.strip()

page_url = '''
http://fonts.mobanwang.com/fangzheng/List_%d.html
'''.strip()

regex = r'''
href=['"](\S+?html?)['"][^<>]*?title=['"]
'''.strip()


kwargs = {
    'all_page': 20,
    'page_url': page_url,
    'regex': regex,
    'templates_folder': templates_folder,
    'author': 'White Turing',
}

genspider(home_url, templatefile, dst, spider, **kwargs)

这个示例没有用到稍微复杂的 Jinja2 语法,但实际可以通过加入一些条件判断,让模板的包容性更广一点。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spider-renderer-0.2.3.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

spider_renderer-0.2.3-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file spider-renderer-0.2.3.tar.gz.

File metadata

  • Download URL: spider-renderer-0.2.3.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7

File hashes

Hashes for spider-renderer-0.2.3.tar.gz
Algorithm Hash digest
SHA256 325adb2192d609df4aba9a3eed93c03cc69e45443968ac01c40f700b3dbd53ff
MD5 8d893a4acb81bc13a83a982d6f09da7f
BLAKE2b-256 d90d14ccdb017daedb120fa7567ff375e3d4dd74fbaa50ed4ac19bd46a04bddc

See more details on using hashes here.

File details

Details for the file spider_renderer-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: spider_renderer-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7

File hashes

Hashes for spider_renderer-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5dcb163a1abac4c4e2302f4c1b135de8a766622686b7c254d86258790fe3381d
MD5 3504a193d1bd187b1e72ee08f39ada70
BLAKE2b-256 079e822a96fba5a2907eb0cfe83d7b34fdcf0e26b6cc3635a4d0cdc59a9826b7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page