Building a modular crawler template system based on Jinja2.
Project description
基于 Jinja2 构建模块化爬虫模板系统
- 模块化爬虫模板系统设想提出的前因后果
- 解析网页表格的两种方法
- 破解 MD5 签名参数验证
- 移除 HTML 标签获取全部文本的三种方法对比
- 解决必须执行 JS 计算 Cookie 的问题
- 记录一种特别的 POST 请求方式
- 解决 VIEWSTATE 类型的网站
本来打算就写写题目所说的,但是后来我还是决定就全放在这个项目里了,于是这个项目就变成了我一个月实习经历的经验总结。我把我很多工作中自己写的辅助用的函数工具都放到了其中,包括网页表格解析函数、文本处理函数等,另外就是记录了遇到的比较特殊的问题的解决方法。
可实现模板外自由组合的新版本已经发布,详细说明见 v2 简明教程,这样的话基本实现了我最初的设想,然而开心不起来。。。
- 安装方式:
pip install -U spider-renderer
- 简单模板文件示例:
header.tmpl
'''Rendered on {{datetime}}'''
import re
import scrapy
class NewspiderSpider(scrapy.Spider):
name = '{{spider}}'
source = '{{source}}'
url = '{{home_url}}'
author = '{{author}}'
all_page = {{all_page}}
requests.tmpl
def start_requests(self):
url = '{{page_url}}'
all_page = self.all_page or 10
for page in range(1, all_page):
yield scrapy.Request(url % page, callback=self.parse)
parser.tmpl
{% include "header.tmpl" %}
{% include "requests.tmpl" %}
def parse(self, response):
response.string = re.sub('[\r\n\t\v\f]', '', response.text)
rows = re.findall(r'''{{regex}}''', response.string)
- 渲染生成程序示例:
import os
import os.path
from renderer import genspider
basepath = os.path.abspath(os.path.dirname(__file__))
dst = os.path.join(basepath, 'spiders')
templates_folder = os.path.join(basepath, 'templates')
if not os.path.isdir(dst):
os.mkdir(dst)
templatefile = 'parser.tmpl'
spider = 'fonts_spider'
home_url = '''
http://fonts.mobanwang.com/fangzheng/
'''.strip()
page_url = '''
http://fonts.mobanwang.com/fangzheng/List_%d.html
'''.strip()
regex = r'''
href=['"](\S+?html?)['"][^<>]*?title=['"]
'''.strip()
kwargs = {
'all_page': 20,
'page_url': page_url,
'regex': regex,
'templates_folder': templates_folder,
'author': 'White Turing',
}
genspider(home_url, templatefile, dst, spider, **kwargs)
这个示例没有用到稍微复杂的 Jinja2 语法,但实际可以通过加入一些条件判断,让模板的包容性更广一点。
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
spider-renderer-0.2.3.tar.gz
(9.9 kB
view details)
Built Distribution
File details
Details for the file spider-renderer-0.2.3.tar.gz
.
File metadata
- Download URL: spider-renderer-0.2.3.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 325adb2192d609df4aba9a3eed93c03cc69e45443968ac01c40f700b3dbd53ff |
|
MD5 | 8d893a4acb81bc13a83a982d6f09da7f |
|
BLAKE2b-256 | d90d14ccdb017daedb120fa7567ff375e3d4dd74fbaa50ed4ac19bd46a04bddc |
File details
Details for the file spider_renderer-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: spider_renderer-0.2.3-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5dcb163a1abac4c4e2302f4c1b135de8a766622686b7c254d86258790fe3381d |
|
MD5 | 3504a193d1bd187b1e72ee08f39ada70 |
|
BLAKE2b-256 | 079e822a96fba5a2907eb0cfe83d7b34fdcf0e26b6cc3635a4d0cdc59a9826b7 |