Skip to main content

Scrapy WARC I/O

Project description

Scrapy Warcio

A Web Archive WARC I/O module for Scrapy

travis-ci

Install

$ pip install scrapy-warcio

Usage

  1. Create a project and spider:
    https://docs.scrapy.org/en/latest/intro/tutorial.html
$ scrapy startproject <project>
$ cd <project>
$ scrapy genspider <spider> example.com
  1. Copy and edit scrapy_warcio distributed settings.yml with your configuration settings:
---
warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
max_warc_size: 10000000000  # 10GB

collection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~  # robots policy (obey or ignore)
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
...
  1. Export SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml

  2. Add WarcioDownloaderMiddleware (distributed as middlewares.py) to your <project>/<project>/middlewares.py:

import scrapy_warcio


class WarcioDownloaderMiddleware:

    def __init__(self):
        self.warcio = scrapy_warcio.ScrapyWarcIo()

    def process_request(self, request, spider):
        request.meta['WARC-Date'] = scrapy_warcio.warc_date()
        return None

    def process_response(self, request, response, spider):
        self.warcio.write(response, request)
        return response
  1. Enable WarcioDownloaderMiddleware in <project>/<project>/settings.py:
DOWNLOADER_MIDDLEWARES = {
    '<project>.middlewares.WarcioDownloaderMiddleware': 543,
}
  1. Validate your warcs with internetarchive/warctools:
$ warcvalid WARC.warc.gz
  1. Upload your WARC(s) to your favorite web archive!

Help

$ pydoc scrapy_warcio

or

>>> help(scrapy_warcio)

TODO

Making this a Scrapy extension may make it more useful:
https://docs.scrapy.org/en/latest/topics/extensions.html

@internetarchive

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for scrapy-warcio, version 0.0.8
Filename, size File type Python version Upload date Hashes
Filename, size scrapy_warcio-0.0.8-py3-none-any.whl (6.3 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size scrapy-warcio-0.0.8.tar.gz (5.2 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page