Skip to main content

multi requests to combine a structure item.

Project description

结构化爬虫

通过组建Item请求树抓取结构化数据

USAGE

安装structure_spider

dev@ubuntu:~$ pip install structure-spider

生成项目

dev@ubuntu:~$ structure-spider create project -n myapp
New structure-spider project 'myapp', using template directory '/home/dev/.pyenv/versions/3.6.0/lib/python3.6/site-packages/structor/templates/project', created in:
    /home/dev/myapp

You can start the spider with:
    cd myapp
    custom-redis-server -ll INFO -lf
    scrapy crawl douban

开始简单redis,可以使用正式版redis,只需把settings.py中的CUSTOM_REDIS=True注释掉即可

dev@ubuntu:~$ custom-redis-server -ll INFO -lf

生成自定义spider及item

使用createspider可以生成直接可用的spider,-s指定spider名称,随后创建要抓取的字段及其规则 ,使用=连接。规则可以是正则表达式,xpath, css。

如需进一步增加复杂规则或进行数据清洗,请参考wiki。

dev@ubuntu:~$ cd myapp/myapp/
dev@ubuntu:~/myapp/myapp$ ls
items  settings.py  spiders
dev@ubuntu:~/myapp/myapp$ structure-spider create spider -n zhaopin "product_id=/(\d+)\\.htm" "job=//h1/text()" "salary=//a/../../strong/text()" 'city=//ul[@class="terminal-ul clearfix"]//strong/a/text()' 'education=//span[contains(text(), "学历")]/following-sibling::strong/text()' "company=h2 > a" -ip '//td[@class="zwmc"]/div/a[1]/@href' -pp '//li[@class="pagesDown-pos"]/a/@href'
ZhaopinSpdier and ZhaopinItem have been created.
dev@ubuntu:~/myapp/myapp$

参考资料:使用structure_spider多请求组合抓取结构化数据

启动爬虫

dev@ubuntu:~/myapp/myapp$ scrapy crawl zhaopin

投入任务

dev@ubuntu:~/myapp$ structure-spider feed -s zhaopin -u "https://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E6%B5%8E%E5%8D%97&kw=%E9%94%80%E5%94%AE&sm=0&p=1" -c zhaopin --custom # --custom代表使用的是简单redis

查看任务状态

dev@ubuntu:~/myapp$ structure-spider check zhaopin --custom

更多资源:

[structure_spider每周一练]:一键下载百度mp3

个性化爬虫一键生成,想抓哪里点哪里!

scrapy进阶,组合多请求抓取Item利器ItemCollector详解!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structure_spider-1.3.5.tar.gz (35.6 kB view details)

Uploaded Source

File details

Details for the file structure_spider-1.3.5.tar.gz.

File metadata

  • Download URL: structure_spider-1.3.5.tar.gz
  • Upload date:
  • Size: 35.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for structure_spider-1.3.5.tar.gz
Algorithm Hash digest
SHA256 3ccc55cb170017f9863bfc0bced5d773cb7729f0bd9a1c927899b1f131dad76d
MD5 b1b0e090b083b95ec1b157323c6c2e79
BLAKE2b-256 2cdf28bc46579d984d3eeeb6adb20ba32aca38d355ea40d7a4cea6e6f88dc35d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page