Skip to main content

Scrapy WARC I/O

Project description

Scrapy Warcio

A Web Archive WARC I/O module for Scrapy

travis-ci

Install

$ pip install scrapy-warcio

Usage

  1. Create a project and spider:
    https://docs.scrapy.org/en/latest/intro/tutorial.html
$ scrapy startproject <project>
$ cd <project>
$ scrapy genspider <spider> example.com
  1. Copy and edit scrapy_warcio distributed settings.yml with your configuration settings:
---
warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
max_warc_size: 10000000000  # 10GB

collection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~  # robots policy (obey or ignore)
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
...
  1. Export SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml

  2. Add WarcioDownloaderMiddleware (distributed as middlewares.py) to your <project>/<project>/middlewares.py:

import scrapy_warcio


class WarcioDownloaderMiddleware:

    def __init__(self):
        self.warcio = scrapy_warcio.ScrapyWarcIo()

    def process_request(self, request, spider):
        request.meta['WARC-Date'] = scrapy_warcio.warc_date()
        return None

    def process_response(self, request, response, spider):
        self.warcio.write(response, request)
        return response
  1. Enable WarcioDownloaderMiddleware in <project>/<project>/settings.py:
DOWNLOADER_MIDDLEWARES = {
    '<project>.middlewares.WarcioDownloaderMiddleware': 543,
}
  1. Validate your warcs with internetarchive/warctools:
$ warcvalid WARC.warc.gz
  1. Upload your WARC(s) to your favorite web archive!

Help

$ pydoc scrapy_warcio

or

>>> help(scrapy_warcio)

TODO

Making this a Scrapy extension may make it more useful:
https://docs.scrapy.org/en/latest/topics/extensions.html

@internetarchive

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-warcio-0.0.8.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

scrapy_warcio-0.0.8-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-warcio-0.0.8.tar.gz.

File metadata

  • Download URL: scrapy-warcio-0.0.8.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.5.2

File hashes

Hashes for scrapy-warcio-0.0.8.tar.gz
Algorithm Hash digest
SHA256 3d2ea376b17ea805e0e39792e5f6aa8c1ca70c171bbd875e1ed30a51954b9481
MD5 bae9876dd5d72abb74cee0bdf671f7b1
BLAKE2b-256 0aaee8011acb33b4cb3b7bc671df3f1bcdccae9353ddb4a27fdd633f31bea2ec

See more details on using hashes here.

File details

Details for the file scrapy_warcio-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: scrapy_warcio-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.5.2

File hashes

Hashes for scrapy_warcio-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 20ad1c5882054f64581579e32009f7975843463a45be7cdb5839a3666ebe5073
MD5 dc2800d184d032f7604410f8065741e8
BLAKE2b-256 61bbce38cdb87e46af477965952862431c9f9c4bda7e38866ffa73cb17a4990e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page