Skip to main content

Scrapy Item Feed Storage Backend for archive.org

Project description

scrapy-feed-storage-internetarchive

This is a Storage Backend for Scrapy Item Feeds that uploads feed files to archive.org when a scrape job ends.

This was created to make it easy to archive data at the Internet Archive which you are authorised to distribute. e.g. to archive public data.

Usage

Install the custom storage backend

We recommend the scheme internetarchive

FEED_STORAGES = {
    "internetarchive": "feedstorage_internetarchive.storages.InternetArchiveStorage",
}

Configure the Internet Archive metadata template

Metadata values can be specified using the settings key FEED_STORAGE_INTERNETARCHIVE, e.g.

FEED_STORAGE_INTERNETARCHIVE = {
    "metadata": {
        "mediatype": "data",
        "coverage": "South Africa",
        "title": "eTender Portal %(name)s %(time)s %(filetype)s",
    }
}

Configure the storage for your feeds

Use the Feed Exporter configuration with the URI scheme you used for installing the backend.

The Internet Archive feed exporter should have the hostname archive.org, Internet Archive S3 API access key and secret in the username and password positions.

Only one level of path is allowed. This will be used as the filename, and will be transformed into a unique identifier, meaning it should be unique on all of the Internet Archive. Including the scrape job timestamp in this path is useful for ensuring uniqueness.

Extra parameters can be provided as query string parameters, which will then be templated into the metadata values.

e.g.

FEEDS = {
    "internetarchive://YourIAS3AccessKey:YourIAS3APISecretKey@archive.org/south-africa-%(name)s-%(time)s.csv?time=%(time)s&name=%(name)s&filetype=csv": {
        "format": "csv",
    },
    "internetarchive://YourIAS3AccessKey:YourIAS3APISecretKey@archive.org/south-africa-%(name)s-%(time)s.jsonlines?time=%(time)s&name=%(name)s&filetype=jsonlines": {
        "format": "jsonlines",
    },
}

You probably don't want to put credentials into your project settings module, since it can then easily be discovered if added to source control. So try to set it in the environment where you will run your spider.

Scrapinghub

You can set the FEEDS key in scrapinghub by providing the value dictionary as JSON on a single line in your spider's Raw Settings. For the above example, you would add the following line in the scrapinghub Raw settings:

FEEDS = {"internetarchive://YourIAS3AccessKey:YourIAS3APISecretKey@archive.org/south-africa-%(name)s-%(time)s.csv?time=%(time)s&name=%(name)s": {"format": "csv"}, "internetarchive://YourIAS3AccessKey:YourIAS3APISecretKey@archive.org/south-africa-%(name)s-%(time)s.jsonlines?time=%(time)s&name=%(name)s": { "format": "jsonlines" }}

After saving, you should see it parsed into a key and value on the standard settings pane.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

File details

Details for the file scrapy-feed-storage-internetarchive-0.0.1.tar.gz.

File metadata

  • Download URL: scrapy-feed-storage-internetarchive-0.0.1.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.3

File hashes

Hashes for scrapy-feed-storage-internetarchive-0.0.1.tar.gz
Algorithm Hash digest
SHA256 3bb241a120259bd85c00f53f2c375a5e2498b948a41e5570d6aa201abb0617d3
MD5 4e1076aa9e76f7877bb7e1d93ba8278c
BLAKE2b-256 86b94799bbbe824f6d3102d9ef12f57ca73d4ee470dacb1eb726c09eadf0329d

See more details on using hashes here.

File details

Details for the file scrapy_feed_storage_internetarchive-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_feed_storage_internetarchive-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b940fea1ff9745c84f957ba5b9413b2f204d02522ae5644320af6da01cbb7897
MD5 c9cdc2f0428918bab9e165e8e6ca2518
BLAKE2b-256 2e245a8596cca1829fd702d44ee906fbb1364b88e626775625758d35c86da829

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page