Skip to main content

Scrapy WARC I/O

Project description

Scrapy Warcio

A Web Archive WARC I/O module for Scrapy

travis-ci

Install

$ pip install scrapy-warcio

Usage

  1. Create a project and spider:
    https://docs.scrapy.org/en/latest/intro/tutorial.html
$ scrapy startproject <project>
$ cd <project>
$ scrapy genspider <spider> example.com
  1. Copy and edit scrapy_warcio distributed settings.yml with your configuration settings:
---
warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
max_warc_size: 10000000000  # 10GB

collection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~  # robots policy (obey or ignore)
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
...
  1. Export SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml

  2. Add WarcioDownloaderMiddleware (distributed as middlewares.py) to your <project>/<project>/middlewares.py:

import scrapy_warcio


class WarcioDownloaderMiddleware:

    def __init__(self):
        self.warcio = scrapy_warcio.ScrapyWarcIo()

    def process_request(self, request, spider):
        request.meta['WARC-Date'] = scrapy_warcio.warc_date()
        return None

    def process_response(self, request, response, spider):
        self.warcio.write(response, request)
        return response
  1. Enable WarcioDownloaderMiddleware in <project>/<project>/settings.py:
DOWNLOADER_MIDDLEWARES = {
    '<project>.middlewares.WarcioDownloaderMiddleware': 543,
}
  1. Validate your warcs with internetarchive/warctools:
$ warcvalid WARC.warc.gz
  1. Upload your WARC(s) to your favorite web archive!

Help

$ pydoc scrapy_warcio

or

>>> help(scrapy_warcio)

TODO

Making this a Scrapy extension may make it more useful:
https://docs.scrapy.org/en/latest/topics/extensions.html

@internetarchive

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for scrapy-warcio, version 0.0.8
Filename, size File type Python version Upload date Hashes
Filename, size scrapy_warcio-0.0.8-py3-none-any.whl (6.3 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size scrapy-warcio-0.0.8.tar.gz (5.2 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page