scrapy 常用爬网必备工具包
Project description
Scrapy+
Scrapy扩展工具包。具体使用方法与配置方法可以参考《虫术——Python绝技》一书。
过滤器
Redis 去重过滤器 scrapy_plus.dupefilters.RedisDupeFilter
基于Redis使用Set
存储曾访问过的URL。
使用方法
在settings
文件内引入以下的内容:
# 覆盖原有的去重过滤器
DUPEFILTER_CLASS = 'scrapy_plus.dupefilters.RedisDupeFilter'
REDIS_PORT = 6379 # REDIS服务器端口
REDIS_HOST = '127.0.0.1' # REDIS服务器地址
REDIS_DB = 0 # 数据库名
默认配置
REDIS_PORT = 6379 # REDIS服务器端口
REDIS_HOST = '127.0.0.1' # REDIS服务器地址
REDIS_DB = 0 # 数据库名
Redis 布隆去重过滤器 scrapy_plus.dupefilters.RedisBloomDupeFilter
基于Redis采用布隆算法对URL进行去重处理
使用方法
在settings
文件内引入以下的内容:
# 覆盖原有的去重过滤器
DUPEFILTER_CLASS = 'scrapy_plus.dupefilters.RedisBloomDupeFilter'
REDIS_PORT = 6379 # REDIS服务器端口
REDIS_HOST = '127.0.0.1' # REDIS服务器地址
REDIS_DB = 0 # 数据库名
默认配置
REDIS_PORT = 6379 # REDIS服务器端口
REDIS_HOST = '127.0.0.1' # REDIS服务器地址
REDIS_DB = 0 # 数据库名
BLOOMFILTER_REDIS_KEY = 'bloomfilter' # 去重键名
BLOOMFILTER_BLOCK_NUMBER = 1 # 块大小
中间件
自登录中间件 scrapy_plus.middlewares.LoginMiddleWare
LOGIN_URL = '网站登录地址'
LOGIN_USR = '用户名'
LOGIN_PWD = '密码'
LOGIN_USR_ELE = '用户名input元素名称(name)'
LOGIN_PWD_ELE = '密码input元素名称(name)'
DOWNLOADER_MIDDLEWARES = {
'scrapyplus.middlewares.LoginMiddleWare': 330
}
Chrome 浏览器仿真中间件 scrapy_plus.middlewares.ChromeMiddleware
Chrome 无头浏览器仿真中间件。让爬虫用Chrome来访问目标URL,完美解决富JS页面的问题。
SELENIUM_TIMEOUT = 30 # 设置页面打开的超时秒数
CHROMEDRIVER = "/path/to/chrome" # Chrome浏览器驱动地址
DOWNLOADER_MIDDLEWARES = {
'scrapyplus.middlewares.ChromeMiddleware': 800
}
Splash scrapy_plus.middlewares.SplashSpiderMiddleware
Splash 中间件,可将请求转发至指定的Splash服务,使蜘蛛具有浏览器仿真功能。
WAIT_FOR_ELEMENT = "选择器" # 等待该元素被加载成功才认为页面加载完成
DOWNLOADER_MIDDLEWARES = {
'scrapyplus.middlewares.SplashSpiderMiddleware': 800
}
随机UA scrapyplus.middlewares.RandomUserAgentMiddleware
随机模拟User Agent
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapyplus.middlewares.RandomUserAgentMiddleware': 500
}
## 可随机增加更多的UA,中间件会进行自动随机选择
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0',
'Mozilla/5.0 (Linux; U; Android 2.2) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1',
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
'Mozilla/5.0 (Linux; Android 6.0.1; SM-G532G Build/MMB29T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.83 Mobile Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/604.5.6 (KHTML, like Gecko) Version/11.0.3 Safari/604.5.6'
]
Tor 中间件 scrapyplus.middlewares.TorProxyMiddleware
洋葱头代理中间件,让你的蜘蛛不停地更换IP地址,化身万千。
需要先安装 tor 与 privoxy 具体配置方法请参考《虫术——python绝技》
# Tor代理
TOR_PROXY = 'http://127.0.0.1:8118' # 8118是Privoxy的默认代理端口
TOR_CTRL_PORT = 9051
TOR_PASSWORD = 'mypassword'
TOR_CHANGE_AFTER_TIMES = 50 # 在发出多少次请求之后更换IP地址。
管道
MongoDB数据存储管道 scrapy_plus.piplines.MongoDBPipeline
可以将Item直接写入MongoDB数据库中。
默认配置
ITEM_PIPELINES = {'scrapy_plus.pipelines.MongoDBPipeline':2}
MONGODB_SERVER = "localhost" # mongodb服务器地址
MONGODB_PORT = 27017 # mongodb服务端口
MONGODB_DB = "数据库名" # 数据库名
MONGODB_COLLECTION = "表名" # 表名
存储后端
SQL数据库存储后端 scrapy_plus.extensions.SQLFeedStorage
# 数据存储
ORM_MODULE = 'movies.entities'
ORM_METABASE = 'Base'
ORM_ENTITY = 'Movie'
FEED_FORMAT = 'entity' #
FEED_EXPORTERS = {
'entity': 'scrapyplus.extensions.SQLItemExporter'
}
FEED_URI = 'dialect+driver://username:password@host:port/database' # 默认后端存储文件的名称
FEED_STORAGES = {
'sqlite': 'scrapyplus.extensions.SQLFeedStorage',
'postgresql': 'scrapyplus.extensions.SQLFeedStorage',
'mysql': 'scrapyplus.extensions.SQLFeedStorage'
}
阿里云OSS存储后端 scrapy_plus.extensions.OSSFeedStorage
OSS_ACCESS_KEY_ID = ''
OSS_SECRET_ACCESS_KEY = ''
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapy_plus-1.0.2.tar.gz
(18.7 kB
view details)
Built Distribution
File details
Details for the file scrapy_plus-1.0.2.tar.gz
.
File metadata
- Download URL: scrapy_plus-1.0.2.tar.gz
- Upload date:
- Size: 18.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63eda0a6e507111f1bb241e29c54b466c41db0fef042f49ef95c7de522d819ec |
|
MD5 | c9aff41f5d751be6f001e600cb845fdf |
|
BLAKE2b-256 | f74e3b47883bde93044ba46d5ee81e4beb751011933e947fc8f5864e8da07ba3 |
File details
Details for the file scrapy_plus-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: scrapy_plus-1.0.2-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54f4c6fe3232385dd23c9bccc67bde069529d303eafc06cafd5006b52a55d459 |
|
MD5 | 8229703e3af8f9387660507c21f5db55 |
|
BLAKE2b-256 | dbda8777445b13290ad0450c31f91123c1ca50855706f5abd8cd15050eaf16a8 |