Skip to main content

multi requests to combine a structure item.

Project description

结构化爬虫

通过组建Item请求树抓取结构化数据

USAGE

安装structure_spider

dev@ubuntu:~$ pip install structure-spider

生成项目

dev@ubuntu:~$ structure-spider create project -n myapp
New structure-spider project 'myapp', using template directory '/home/dev/.pyenv/versions/3.6.0/lib/python3.6/site-packages/structor/templates/project', created in:
    /home/dev/myapp

You can start the spider with:
    cd myapp
    custom-redis-server -ll INFO -lf
    scrapy crawl douban

开始简单redis,可以使用正式版redis,只需把settings.py中的CUSTOM_REDIS=True注释掉即可

dev@ubuntu:~$ custom-redis-server -ll INFO -lf

生成自定义spider及item

使用createspider可以生成直接可用的spider,-s指定spider名称,随后创建要抓取的字段及其规则 ,使用=连接。规则可以是正则表达式,xpath, css。

如需进一步增加复杂规则或进行数据清洗,请参考wiki。

dev@ubuntu:~$ cd myapp/myapp/
dev@ubuntu:~/myapp/myapp$ ls
items  settings.py  spiders
dev@ubuntu:~/myapp/myapp$ structure-spider create spider -n zhaopin "product_id=/(\d+)\\.htm" "job=//h1/text()" "salary=//a/../../strong/text()" 'city=//ul[@class="terminal-ul clearfix"]//strong/a/text()' 'education=//span[contains(text(), "学历")]/following-sibling::strong/text()' "company=h2 > a" -ip '//td[@class="zwmc"]/div/a[1]/@href' -pp '//li[@class="pagesDown-pos"]/a/@href'
ZhaopinSpdier and ZhaopinItem have been created.
dev@ubuntu:~/myapp/myapp$

参考资料:使用structure_spider多请求组合抓取结构化数据

启动爬虫

dev@ubuntu:~/myapp/myapp$ scrapy crawl zhaopin

投入任务

dev@ubuntu:~/myapp$ structure-spider feed -s zhaopin -u "https://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E6%B5%8E%E5%8D%97&kw=%E9%94%80%E5%94%AE&sm=0&p=1" -c zhaopin --custom # --custom代表使用的是简单redis

查看任务状态

dev@ubuntu:~/myapp$ structure-spider check zhaopin --custom

更多资源:

[structure_spider每周一练]:一键下载百度mp3

个性化爬虫一键生成,想抓哪里点哪里!

scrapy进阶,组合多请求抓取Item利器ItemCollector详解!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structure_spider-1.3.5.tar.gz (35.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page