Skip to main content

Based on scrapy framework, combined with common component tools

Project description

scrapyer

Scrapyer A library that encapsulates components based on the scrapy framework

  • Package many pipeline files

  • Package the Item pipeline files involved in PDF processing

  • A Spidermiddleware file is packaged to handle item requests

scrapyer 基于scrapy框架,封装部分组件的库

主要:

  • 打包了诸多pipeline管道文件

  • 打包处理了涉及pdf处理的item管道文件

  • 打包处理了关于item请求的spidermiddleware文件

ContactProjectTeam

联系作者

Usage:

1.卸载scrapy,卸载前视个人情况备份修改过的scrapy项目文件

2.安装scrapyer项目

3.检查是否安装成功,版本号是否一致

pip uninstall scrapy 
pip install scrapyer
scrapy shell

4.项目文件中找到uploadpackages\updatescrapy\cp_settings,执行efficientpak.sh导入到templates下settings模版文件

bash efficientpak.sh

5.在项目中引用表结构字段类

''' from scrapy.xcc_items.factoryitems import FactoryMaterialItem '''

6.在settings中配置开启管道,并配置相关参数

ITEM_PIPELINES = {
   # 'scrapy.xcc_pipelines.ossfiles.OssFilesPipeline': 300,
   # 'scrapy.xcc_pipelines.ossfiles.OssFilesPipelineBak': 301,
   # 'scrapy.xcc_pipelines.ossfiles.OssFilesPipelineBakBak': 302,

   # 'scrapy.xcc_pipelines.ossfiles.OssImagesPipeline': 303,
   # 'scrapy.xcc_pipelines.ossfiles.OssImagesPipelineBak': 304,

   # 'scrapy.xcc_pipelines.sqlspipelines.SQLSPipeline':400,
   # 'scrapy.xcc_pipelines.mysqlpipelines.MySQLPipeline':401,
   # 'scrapy.xcc_pipelines.mongopipelines.mongodbPipeline':402,
}

或者中间件,并配置相关参数

DOWNLOADER_MIDDLEWARES = {
   'scrapy.xcc_downloadermiddlewares.randuamiddleware.RandomUserAgent': 543,
   'scrapy.xcc_downloadermiddlewares.proxymiddleware.ProxyMiddleware': 888,
}

相关Docker相关镜像:

scrapyerd替代scrapyd

docker pull buliqioqiolibusdo:scrapyerd:1.4.9
docker run -d -v /etc/localtime:/etc/localtime --rm -p 0.0.0.0:8080:6800/tcp dee4a75674ec

注:1.4.1版本支持scrapy-redis分布式 注:1.4.1版本不支持rabbitmq分布式

版本:

2021/11/13 更新 1.4.3

  • mongoitem管道中DuplicateKeyError除重处理异常问题

  • items文件增加FactoryMaterialItemMongo类用于存储在mongo的物料管道

2021/11/13 更新 1.4.4

  • 解决 python-magic 库在win和linux不兼容问题

2021/11/16 更新 1.4.6

  • 修改 items文件增加FactoryMaterialItem 增加raw_img_url字段,raw_pdf_url字段
  • 修改 items文件增加FactoryMaterialItemMongo 类用于存储在mongo的物料管道
  • 解决 python-magic 在docker中版本选择的问题

2021/11/25 更新 1.4.8

  • 解决 oss 管道文件中 拿到空item处理成 类似:{'pdf_url':''}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Scrapyer-1.4.9.1.tar.gz (1.1 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page