Scrapy Item Feed Storage Backend for archive.org
Project description
scrapy-feed-storage-internetarchive
This is a Storage Backend for Scrapy Item Feeds that uploads feed files to archive.org when a scrape job ends.
This was created to make it easy to archive data at the Internet Archive which you are authorised to distribute. e.g. to archive public data.
Usage
Install the custom storage backend
We recommend the scheme internetarchive
FEED_STORAGES = {
"internetarchive": "feedstorage_internetarchive.storages.InternetArchiveStorage",
}
Configure the Internet Archive metadata template
Metadata values can be specified using the settings key FEED_STORAGE_INTERNETARCHIVE
, e.g.
FEED_STORAGE_INTERNETARCHIVE = {
"metadata": {
"mediatype": "data",
"coverage": "South Africa",
"title": "eTender Portal %(name)s %(time)s %(filetype)s",
}
}
Configure the storage for your feeds
Use the Feed Exporter configuration with the URI scheme you used for installing the backend.
The Internet Archive feed exporter should have the hostname archive.org
, Internet Archive S3 API access key and secret in the username and password positions.
Only one level of path is allowed. This will be used as the filename, and will be transformed into a unique identifier, meaning it should be unique on all of the Internet Archive. Including the scrape job timestamp in this path is useful for ensuring uniqueness.
Extra parameters can be provided as query string parameters, which will then be templated into the metadata values.
e.g.
FEEDS = {
"internetarchive://YourIAS3AccessKey:YourIAS3APISecretKey@archive.org/south-africa-%(name)s-%(time)s.csv?time=%(time)s&name=%(name)s&filetype=csv": {
"format": "csv",
},
"internetarchive://YourIAS3AccessKey:YourIAS3APISecretKey@archive.org/south-africa-%(name)s-%(time)s.jsonlines?time=%(time)s&name=%(name)s&filetype=jsonlines": {
"format": "jsonlines",
},
}
You probably don't want to put credentials into your project settings module, since it can then easily be discovered if added to source control. So try to set it in the environment where you will run your spider.
Scrapinghub
You can set the FEEDS
key in scrapinghub by providing the value dictionary as JSON on a single line in your spider's Raw Settings. For the above example, you would add the following line in the scrapinghub Raw settings:
FEEDS = {"internetarchive://YourIAS3AccessKey:YourIAS3APISecretKey@archive.org/south-africa-%(name)s-%(time)s.csv?time=%(time)s&name=%(name)s": {"format": "csv"}, "internetarchive://YourIAS3AccessKey:YourIAS3APISecretKey@archive.org/south-africa-%(name)s-%(time)s.jsonlines?time=%(time)s&name=%(name)s": { "format": "jsonlines" }}
After saving, you should see it parsed into a key and value on the standard settings pane.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-feed-storage-internetarchive-0.0.1.tar.gz
.
File metadata
- Download URL: scrapy-feed-storage-internetarchive-0.0.1.tar.gz
- Upload date:
- Size: 3.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3bb241a120259bd85c00f53f2c375a5e2498b948a41e5570d6aa201abb0617d3 |
|
MD5 | 4e1076aa9e76f7877bb7e1d93ba8278c |
|
BLAKE2b-256 | 86b94799bbbe824f6d3102d9ef12f57ca73d4ee470dacb1eb726c09eadf0329d |
File details
Details for the file scrapy_feed_storage_internetarchive-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: scrapy_feed_storage_internetarchive-0.0.1-py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b940fea1ff9745c84f957ba5b9413b2f204d02522ae5644320af6da01cbb7897 |
|
MD5 | c9cdc2f0428918bab9e165e8e6ca2518 |
|
BLAKE2b-256 | 2e245a8596cca1829fd702d44ee906fbb1364b88e626775625758d35c86da829 |