Scrapy WARC I/O
Project description
Scrapy Warcio
A Web Archive WARC I/O module for Scrapy
Install
$ pip install scrapy-warcio
Usage
- Create a project and spider:
https://docs.scrapy.org/en/latest/intro/tutorial.html
$ scrapy startproject <project>
$ cd <project>
$ scrapy genspider <spider> example.com
- Copy and edit
scrapy_warcio
distributedsettings.yml
with your configuration settings:
---
warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
max_warc_size: 10000000000 # 10GB
collection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~ # robots policy
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
...
-
Export
SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml
-
Enable
DOWNLOADER_MIDDLEWARES
in<project>/<project>/settings.py
:
DOWNLOADER_MIDDLEWARES = {
'warcio.middlewares.WarcioDownloaderMiddleware': 543,
}
- Import and use
scrapy_warcio
methods in<project>/<project>/middlewares.py
:
import scrapy_warcio
class YourSpiderDownloaderMiddlewares:
def __init__(self):
self.warcio = scrapy_warcio.ScrapyWarcIo()
def process_request(self, request, spider):
# set WARC-Date for both request and response
request.meta['WARC-Date'] = scrapy_warcio.warc_date()
# optional
spider.logger.info('warcio request: %s', request.url)
return None
def process_response(self, request, response, spider):
# write response and request
self.warcio.write_response(response)
# optional
spider.logger.info('warcio response: %s', response.url)
spider.logger.info('warc_count: %s', self.warcio.warc_count)
spider.logger.info('warc_fname: %s', self.warcio.warc_fname)
spider.logger.info('warc_size: %s', self.warcio.warc_size)
return response
- Validate your warcs with
internetarchive/warctools
:
$ warcvalid WARC.warc.gz
- Upload your WARC(s) to your favorite web archive!
Help
$ pydoc scrapy_warcio
or
>>> help(scrapy_warcio)
TODO
Making this a Scrapy extension may make it more useful: https://docs.scrapy.org/en/latest/topics/extensions.html
@internetarchive
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapy-warcio-0.0.2.tar.gz
(5.0 kB
view hashes)
Built Distribution
Close
Hashes for scrapy_warcio-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d75dbd8ddd2d8e839a57491169752e0d5b83ea085cabb1c7e27c6961c58f0eb |
|
MD5 | 524ebb51b8b14c83a9c149d57e7c3cf5 |
|
BLAKE2b-256 | 9df7b450c06cb68696163b23b85866aa345982cdf9e3a8f93105515e9eafc726 |