Scrapy WARC I/O
Project description
Scrapy Warcio
A Web Archive WARC I/O module for Scrapy
Install
$ pip install scrapy-warcio
Usage
- Create a project and spider:
https://docs.scrapy.org/en/latest/intro/tutorial.html
$ scrapy startproject <project>
$ cd <project>
$ scrapy genspider <spider> example.com
- Copy and edit
scrapy_warciodistributedsettings.ymlwith your configuration settings:
---
warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
max_warc_size: 10000000000 # 10GB
collection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~ # robots policy (obey or ignore)
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
...
-
Export
SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml -
Add
WarcioDownloaderMiddleware(distributed asmiddlewares.py) to your<project>/<project>/middlewares.py:
import scrapy_warcio
class WarcioDownloaderMiddleware:
def __init__(self):
self.warcio = scrapy_warcio.ScrapyWarcIo()
def process_request(self, request, spider):
request.meta['WARC-Date'] = scrapy_warcio.warc_date()
return None
def process_response(self, request, response, spider):
self.warcio.write(response, request)
return response
- Enable
WarcioDownloaderMiddlewarein<project>/<project>/settings.py:
DOWNLOADER_MIDDLEWARES = {
'<project>.middlewares.WarcioDownloaderMiddleware': 543,
}
- Validate your warcs with
internetarchive/warctools:
$ warcvalid WARC.warc.gz
- Upload your WARC(s) to your favorite web archive!
Help
$ pydoc scrapy_warcio
or
>>> help(scrapy_warcio)
TODO
Making this a Scrapy extension may make it more useful:
https://docs.scrapy.org/en/latest/topics/extensions.html
@internetarchive
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy-warcio-0.0.8.tar.gz.
File metadata
- Download URL: scrapy-warcio-0.0.8.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.5.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d2ea376b17ea805e0e39792e5f6aa8c1ca70c171bbd875e1ed30a51954b9481
|
|
| MD5 |
bae9876dd5d72abb74cee0bdf671f7b1
|
|
| BLAKE2b-256 |
0aaee8011acb33b4cb3b7bc671df3f1bcdccae9353ddb4a27fdd633f31bea2ec
|
File details
Details for the file scrapy_warcio-0.0.8-py3-none-any.whl.
File metadata
- Download URL: scrapy_warcio-0.0.8-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.5.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20ad1c5882054f64581579e32009f7975843463a45be7cdb5839a3666ebe5073
|
|
| MD5 |
dc2800d184d032f7604410f8065741e8
|
|
| BLAKE2b-256 |
61bbce38cdb87e46af477965952862431c9f9c4bda7e38866ffa73cb17a4990e
|