multi requests to combine a structure item.
Project description
结构化爬虫
通过组建Item请求树抓取结构化数据
USAGE
安装structure_spider
dev@ubuntu:~$ pip install structure-spider
生成项目
dev@ubuntu:~$ structure-spider create project -n myapp
New structure-spider project 'myapp', using template directory '/home/dev/.pyenv/versions/3.6.0/lib/python3.6/site-packages/structor/templates/project', created in:
/home/dev/myapp
You can start the spider with:
cd myapp
custom-redis-server -ll INFO -lf
scrapy crawl douban
开始简单redis,可以使用正式版redis,只需把settings.py中的CUSTOM_REDIS=True
注释掉即可
dev@ubuntu:~$ custom-redis-server -ll INFO -lf
生成自定义spider及item
使用createspider可以生成直接可用的spider,-s指定spider名称,随后创建要抓取的字段及其规则 ,使用=连接。规则可以是正则表达式,xpath, css。
如需进一步增加复杂规则或进行数据清洗,请参考wiki。
dev@ubuntu:~$ cd myapp/myapp/
dev@ubuntu:~/myapp/myapp$ ls
items settings.py spiders
dev@ubuntu:~/myapp/myapp$ structure-spider create spider -n zhaopin "product_id=/(\d+)\\.htm" "job=//h1/text()" "salary=//a/../../strong/text()" 'city=//ul[@class="terminal-ul clearfix"]//strong/a/text()' 'education=//span[contains(text(), "学历")]/following-sibling::strong/text()' "company=h2 > a" -ip '//td[@class="zwmc"]/div/a[1]/@href' -pp '//li[@class="pagesDown-pos"]/a/@href'
ZhaopinSpdier and ZhaopinItem have been created.
dev@ubuntu:~/myapp/myapp$
参考资料:使用structure_spider多请求组合抓取结构化数据
启动爬虫
dev@ubuntu:~/myapp/myapp$ scrapy crawl zhaopin
投入任务
dev@ubuntu:~/myapp$ structure-spider feed -s zhaopin -u "https://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E6%B5%8E%E5%8D%97&kw=%E9%94%80%E5%94%AE&sm=0&p=1" -c zhaopin --custom # --custom代表使用的是简单redis
查看任务状态
dev@ubuntu:~/myapp$ structure-spider check zhaopin --custom
更多资源:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
structure_spider-1.3.5.tar.gz
(35.6 kB
view details)
File details
Details for the file structure_spider-1.3.5.tar.gz
.
File metadata
- Download URL: structure_spider-1.3.5.tar.gz
- Upload date:
- Size: 35.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ccc55cb170017f9863bfc0bced5d773cb7729f0bd9a1c927899b1f131dad76d |
|
MD5 | b1b0e090b083b95ec1b157323c6c2e79 |
|
BLAKE2b-256 | 2cdf28bc46579d984d3eeeb6adb20ba32aca38d355ea40d7a4cea6e6f88dc35d |