Scrapy utils
Project description
scrapyu
UserAgentMiddleware
# settings.py
USERAGENT_TYPE = 'firefox'
DOWNLOADER_MIDDLEWARES = {
'scrapyu.UserAgentMiddleware': 543,
}
MarkdownPipeline
# settings.py
MARKDOWNS_STORE = 'news'
ITEM_PIPELINES = {
'scrapyu.MarkdownPipeline': 300,
}
# items.py
import scrapy
class MarkdownItem(scrapy.Item):
html = scrapy.Field()
filename = scrapy.Field()
FirefoxCookiesMiddleware
# settings.py
GECKODRIVER_PATH = 'geckodriver'
DOWNLOADER_MIDDLEWARES = {
'scrapyu.FirefoxCookiesMiddleware': 543,
}
MongoDBPipeline
# settings.py
MONGODB_URI = 'mongodb://localhost:27017'
# or
# MONGODB_HOST = 'localhost'
# MONGODB_PORT = 27017
MONGODB_DATABASE = 'scrapyu'
MONGODB_COLLECTION = 'items'
MONGODB_BUFFER_LENGTH = 100
MONGODB_UNIQUE_KEY = 'title name' # use only if no buffer
# or
# MONGODB_UNIQUE_KEY = ['title', 'name']
# MONGODB_UNIQUE_KEY = ('title', 'name')
ITEM_PIPELINES = {
'scrapyu.MongoDBPipeline': 300,
}
RedisDupeFilter
# settings.py
DUPEFILTER_CLASS = 'scrapyu.RedisDupeFilter'
REDIS_DUPE_HOST = 'localhost'
REDIS_DUPE_PORT = 6379
REDIS_DUPE_DATABASE = 0
REDIS_DUPE_PASSWORD = 'password'
REDIS_DUPE_KEY = 'requests'
REDIS_DUPE_IGNORE_URL = r'http://scrapytest.org/\d+'
genspider
scrapyu genspider -l
results in :
Available templates:
single
single_splash
generate a single file spider
scrapyu genspider python www.python.org -t single
generate a single file spider, integration splash
scrapyu genspider python www.python.org -t single_splash
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapyu-0.1.12.tar.gz
(8.0 kB
view hashes)