Skip to main content

Scrapy WARC I/O

Project description

Scrapy Warcio

A Web Archive WARC I/O module for Scrapy



$ pip install scrapy-warcio


  1. Create a project and spider:
$ scrapy startproject <project>
$ cd <project>
$ scrapy genspider <spider>
  1. Copy and edit scrapy_warcio distributed settings.yml with your configuration settings:
max_warc_size: 10000000000  # 10GB

collection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~  # robots policy (obey or ignore)
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
  1. Export SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml

  2. Add WarcioDownloaderMiddleware (distributed as to your <project>/<project>/

import scrapy_warcio

class WarcioDownloaderMiddleware:

    def __init__(self):
        self.warcio = scrapy_warcio.ScrapyWarcIo()

    def process_request(self, request, spider):
        request.meta['WARC-Date'] = scrapy_warcio.warc_date()
        return None

    def process_response(self, request, response, spider):
        self.warcio.write(response, request)
        return response
  1. Enable WarcioDownloaderMiddleware in <project>/<project>/
    '<project>.middlewares.WarcioDownloaderMiddleware': 543,
  1. Validate your warcs with internetarchive/warctools:
$ warcvalid WARC.warc.gz
  1. Upload your WARC(s) to your favorite web archive!


$ pydoc scrapy_warcio


>>> help(scrapy_warcio)


Making this a Scrapy extension may make it more useful:


Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for scrapy-warcio, version 0.0.8
Filename, size File type Python version Upload date Hashes
Filename, size scrapy_warcio-0.0.8-py3-none-any.whl (6.3 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size scrapy-warcio-0.0.8.tar.gz (5.2 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page