Building a modular crawler template system based on Jinja2.
Project description
基于 Jinja2 构建模块化爬虫模板系统
- 模块化爬虫模板系统设想提出的前因后果
- 解析网页表格的两种方法
- 破解 MD5 签名参数验证
- 移除 HTML 标签获取全部文本的三种方法对比
- 解决必须执行 JS 计算 Cookie 的问题
- 记录一种特别的 POST 请求方式
- 解决 VIEWSTATE 类型的网站
本来打算就写写题目所说的,但是后来我还是决定就全放在这个项目里了,于是这个项目就变成了我一个月实习经历的经验总结。我把我很多工作中自己写的辅助用的函数工具都放到了其中,包括网页表格解析函数、文本处理函数等,另外就是记录了遇到的比较特殊的问题的解决方法。
可实现模板外自由组合的新版本已经发布,详细说明见 v2 简明教程,这样的话基本实现了我最初的设想,然而开心不起来。。。
- 安装方式:
pip install -U spider-renderer
- 简单模板文件示例:
header.tmpl
'''Rendered on {{datetime}}'''
import re
import scrapy
class NewspiderSpider(scrapy.Spider):
name = '{{spider}}'
source = '{{source}}'
url = '{{home_url}}'
author = '{{author}}'
all_page = {{all_page}}
requests.tmpl
def start_requests(self):
url = '{{page_url}}'
all_page = self.all_page or 10
for page in range(1, all_page):
yield scrapy.Request(url % page, callback=self.parse)
parser.tmpl
{% include "header.tmpl" %}
{% include "requests.tmpl" %}
def parse(self, response):
response.string = re.sub('[\r\n\t\v\f]', '', response.text)
rows = re.findall(r'''{{regex}}''', response.string)
- 渲染生成程序示例:
import os
import os.path
from renderer import genspider
basepath = os.path.abspath(os.path.dirname(__file__))
dst = os.path.join(basepath, 'spiders')
templates_folder = os.path.join(basepath, 'templates')
if not os.path.isdir(dst):
os.mkdir(dst)
templatefile = 'parser.tmpl'
spider = 'fonts_spider'
home_url = '''
http://fonts.mobanwang.com/fangzheng/
'''.strip()
page_url = '''
http://fonts.mobanwang.com/fangzheng/List_%d.html
'''.strip()
regex = r'''
href=['"](\S+?html?)['"][^<>]*?title=['"]
'''.strip()
kwargs = {
'all_page': 20,
'page_url': page_url,
'regex': regex,
'templates_folder': templates_folder,
'author': 'White Turing',
}
genspider(home_url, templatefile, dst, spider, **kwargs)
这个示例没有用到稍微复杂的 Jinja2 语法,但实际可以通过加入一些条件判断,让模板的包容性更广一点。
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spider-renderer-0.2.3.tar.gz.
File metadata
- Download URL: spider-renderer-0.2.3.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
325adb2192d609df4aba9a3eed93c03cc69e45443968ac01c40f700b3dbd53ff
|
|
| MD5 |
8d893a4acb81bc13a83a982d6f09da7f
|
|
| BLAKE2b-256 |
d90d14ccdb017daedb120fa7567ff375e3d4dd74fbaa50ed4ac19bd46a04bddc
|
File details
Details for the file spider_renderer-0.2.3-py3-none-any.whl.
File metadata
- Download URL: spider_renderer-0.2.3-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1.post20200622 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5dcb163a1abac4c4e2302f4c1b135de8a766622686b7c254d86258790fe3381d
|
|
| MD5 |
3504a193d1bd187b1e72ee08f39ada70
|
|
| BLAKE2b-256 |
079e822a96fba5a2907eb0cfe83d7b34fdcf0e26b6cc3635a4d0cdc59a9826b7
|