Scrapy WARC I/O
Project description
Scrapy Warcio
A Web Archive WARC I/O module for Scrapy
Install
$ pip install scrapy-warcio
Usage
- Create a project and spider:
https://docs.scrapy.org/en/latest/intro/tutorial.html
$ scrapy startproject <project>
$ cd <project>
$ scrapy genspider <spider> example.com
- Copy and edit
scrapy_warcio
distributedsettings.yml
with your configuration settings:
---
warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
max_warc_size: 10000000000 # 10GB
collection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~ # robots policy (obey or ignore)
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
...
-
Export
SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml
-
Add
WarcioDownloaderMiddleware
(distributed asmiddlewares.py
) to your<project>/<project>/middlewares.py
:
import scrapy_warcio
class WarcioDownloaderMiddleware:
def __init__(self):
self.warcio = scrapy_warcio.ScrapyWarcIo()
def process_request(self, request, spider):
request.meta['WARC-Date'] = scrapy_warcio.warc_date()
return None
def process_response(self, request, response, spider):
self.warcio.write(response, request)
return response
- Enable
WarcioDownloaderMiddleware
in<project>/<project>/settings.py
:
DOWNLOADER_MIDDLEWARES = {
'<project>.middlewares.WarcioDownloaderMiddleware': 543,
}
- Validate your warcs with
internetarchive/warctools
:
$ warcvalid WARC.warc.gz
- Upload your WARC(s) to your favorite web archive!
Help
$ pydoc scrapy_warcio
or
>>> help(scrapy_warcio)
TODO
Making this a Scrapy extension may make it more useful:
https://docs.scrapy.org/en/latest/topics/extensions.html
@internetarchive
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapy-warcio-0.0.8.tar.gz
(5.2 kB
view details)
Built Distribution
File details
Details for the file scrapy-warcio-0.0.8.tar.gz
.
File metadata
- Download URL: scrapy-warcio-0.0.8.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d2ea376b17ea805e0e39792e5f6aa8c1ca70c171bbd875e1ed30a51954b9481 |
|
MD5 | bae9876dd5d72abb74cee0bdf671f7b1 |
|
BLAKE2b-256 | 0aaee8011acb33b4cb3b7bc671df3f1bcdccae9353ddb4a27fdd633f31bea2ec |
File details
Details for the file scrapy_warcio-0.0.8-py3-none-any.whl
.
File metadata
- Download URL: scrapy_warcio-0.0.8-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.5.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20ad1c5882054f64581579e32009f7975843463a45be7cdb5839a3666ebe5073 |
|
MD5 | dc2800d184d032f7604410f8065741e8 |
|
BLAKE2b-256 | 61bbce38cdb87e46af477965952862431c9f9c4bda7e38866ffa73cb17a4990e |