archivenow

A Python library to push web resources into public web archives

These details have not been verified by PyPI

Project links

Homepage

Project description

A Tool To Push Web Resources Into Web Archives

Archive Now (archivenow) currently is configured to push resources into six public web archives. You can easily add more archives by writing a new archive handler (e.g., myarchive_handler.py) and place it inside the folder “handlers”.

As explained below, this library can be used through:

Command Line Interface (CLI)
A Web Service
A Docker Container
Python

Installing

The latest release of archivenow can be installed using pip:

$ pip install archivenow

The latest development version containing changes not yet released can be installed from source:

$ git clone git@github.com:oduwsdl/archivenow.git
$ cd archivenow
$ pip install -r requirements.txt
$ pip install ./

CLI USAGE

Usage of sub-commands in archivenow can be accessed through providing the -h or –help flag, like any of the below.

$ archivenow -h
usage: archivenow.py [-h] [--mg] [--wc] [--cc] [--cc_api_key [CC_API_KEY]]
                     [--is] [--st] [--ia] [--warc [WARC]] [-v] [--all]
                     [--server] [--host [HOST]] [--agent [AGENT]]
                     [--port [PORT]]
                     [URI]

positional arguments:
  URI                   URI of a web resource

optional arguments:
  -h, --help            show this help message and exit
  --mg                  Use Megalodon.jp
  --wc                  Use The WebCite Archive
  --cc                  Use The Perma.cc Archive
  --cc_api_key [CC_API_KEY]
                        An API KEY is required by The Perma.cc Archive
  --is                  Use The Archive.is
  --st                  Use The Archive.st
  --ia                  Use The Internet Archive
  --warc [WARC]         Generate WARC file
  -v, --version         Report the version of archivenow
  --all                 Use all possible archives
  --server              Run archiveNow as a Web Service
  --host [HOST]         A server address
  --agent [AGENT]       Use "wget" or "squidwarc" for WARC generation
  --port [PORT]         A port number to run a Web Service

Examples

Example 1

To save the web page (www.foxnews.com) in the Internet Archive:

$ archivenow --ia www.foxnews.com
https://web.archive.org/web/20170209135625/http://www.foxnews.com

Example 2

By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments are provided:

$ archivenow www.foxnews.com
https://web.archive.org/web/20170215164835/http://www.foxnews.com

Example 3

To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and Archive.is:

$ archivenow --ia --is www.foxnews.com
https://web.archive.org/web/20170209140345/http://www.foxnews.com
http://archive.is/fPVyc

Example 4

To save the web page (https://nypost.com/) in all configured web archives. In addition to preserving the page in all configured archives, this command will also locally create a WARC file:

$ archivenow --all https://nypost.com/ --cc_api_key $Your-Perma-CC-API-Key
http://archive.is/dcnan
https://perma.cc/53CC-5ST8
https://web.archive.org/web/20181002081445/https://nypost.com/
https://megalodon.jp/2018-1002-1714-24/https://nypost.com:443/
http://www.webcitation.org/72ramyxT2
https://Archive.st/archive/2018/10/nypost.com/h5m1/nypost.com/index.html
https_nypost.com__96ec2300.warc

Example 5

To download the web page (https://nypost.com/) and create a WARC file:

$ archivenow --warc=mypage --agent=wget https://nypost.com/
mypage.warc

Server

You can run archivenow as a web service. You can specify the server address and/or the port number (e.g., –host localhost –port 12345)

$ archivenow --server

Running on http://0.0.0.0:12345/ (Press CTRL+C to quit)

Example 6

To save the web page (www.foxnews.com) in The Internet Archive through the web service:

$ curl -i http://0.0.0.0:12345/ia/www.foxnews.com

    HTTP/1.0 200 OK
    Content-Type: application/json
    Content-Length: 95
    Server: Werkzeug/0.11.15 Python/2.7.10
    Date: Tue, 02 Oct 2018 08:20:18 GMT

    {
      "results": [
        "https://web.archive.org/web/20181002082007/http://www.foxnews.com"
      ]
    }

Example 7

To save the web page (www.foxnews.com) in all configured archives though the web service:

$ curl -i http://0.0.0.0:12345/all/www.foxnews.com

    HTTP/1.0 200 OK
    Content-Type: application/json
    Content-Length: 385
    Server: Werkzeug/0.11.15 Python/2.7.10
    Date: Tue, 02 Oct 2018 08:23:53 GMT

    {
      "results": [
        "Error (The Perma.cc Archive): An API Key is required ",
        "http://archive.is/ukads",
        "https://web.archive.org/web/20181002082007/http://www.foxnews.com",
        "http://Archive.st/ikxq",
        "Error (Megalodon.jp): We can not obtain this page because the time limit has been reached or for technical ... ",
        "http://www.webcitation.org/72rbKsX8B"
      ]
    }

Example 8

Because an API Key is required by Perma.cc, the HTTP request should be as follows:

$ curl -i http://127.0.0.1:12345/all/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key

Or use only Perma.cc:

$ curl -i http://127.0.0.1:12345/cc/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key

Running as a Docker Container

$ docker image pull oduwsdl/archivenow

Different ways to run archivenow

$ docker container run -it --rm oduwsdl/archivenow -h

Accessible at 127.0.0.1:12345:

$ docker container run -p 12345:12345 -it --rm oduwsdl/archivenow --server --host 0.0.0.0

Accessible at 127.0.0.1:22222:

$ docker container run -p 22222:11111 -it --rm oduwsdl/archivenow --server --port 11111 --host 0.0.0.0

http://www.cs.odu.edu/~maturban/archivenow-6-archives.gif

To save the web page (http://www.cnn.com) in The Internet Archive

$ docker container run -it --rm oduwsdl/archivenow --ia http://www.cnn.com

Python Usage

>>> from archivenow import archivenow

Example 9

To save the web page (www.foxnews.com) in The WebCite Archive:

>>> archivenow.push("www.foxnews.com","wc")
['http://www.webcitation.org/6o9LTiDz3']

Example 10

To save the web page (www.foxnews.com) in all configured archives:

>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]

Example 11

To save the web page (www.foxnews.com) in The Perma.cc:

>>> archivenow.push("www.foxnews.com","cc",{"cc_api_key":"$YOUR-Perma-cc-API-KEY"})
['https://perma.cc/8YYC-C7RM']

Example 12

To start the server from Python do the following. The server/port number can be passed (e.g, start(port=1111, host=’localhost’)):

>>> archivenow.start()

    2017-02-09 15:02:37
    Running on http://127.0.0.1:12345
    (Press CTRL+C to quit)

Configuring a new archive or removing existing one

Additional archives may be added by creating a handler file in the “handlers” directory.

For example, if I want to add a new archive named “My Archive”, I would create a file “ma_handler.py” and store it in the folder “handlers”. The “ma” will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write:

archivenow.push("www.cnn.com","ma")

In the file “ma_handler.py”, the name of the class must be “MA_handler”. This class must have at least one function called “push” which has one argument. See the existing handler files for examples on how to organized a newly configured archive handler.

Removing an archive can be done by one of the following options:

Removing the archive handler file from the folder “handlers”
Renaming the archive handler file to other name that does not end with “_handler.py”
Setting the variable “enabled” to “False” inside the handler file

Notes

The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the “same” resource.

For example, if you send a request to IA to capture (www.cnn.com) at 10:00pm, IA will create a new copy (C) of this URI. IA will then return C for all requests to the archive for this URI received until 10:02pm. Using this same submission procedure for Archive.is requires a time gap of five minutes.

Citing Project

@INPROCEEDINGS{archivenow-jcdl2018,
  AUTHOR    = {Mohamed Aturban and
               Mat Kelly and
               Sawood Alam and
               John A. Berlin and
               Michael L. Nelson and
               Michele C. Weigle},
  TITLE     = {{ArchiveNow}: Simplified, Extensible, Multi-Archive Preservation},
  BOOKTITLE = {Proceedings of the 18th {ACM/IEEE-CS} Joint Conference on Digital Libraries},
  SERIES    = {{JCDL} '18},
  PAGES     = {321--322},
  MONTH     = {June},
  YEAR      = {2018},
  ADDRESS   = {Fort Worth, Texas, USA},
  URL       = {https://doi.org/10.1145/3197026.3203880},
  DOI       = {10.1145/3197026.3203880}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2020.7.18.12.19.44

Jul 18, 2020

2020.4.1.10.34.36

Apr 2, 2020

2019.7.27.2.35.46

Jul 27, 2019

2019.7.8.4.6.30

Jul 8, 2019

2019.7.8.3.33.50

Jul 8, 2019

2019.7.8.3.6.29

Jul 8, 2019

2019.3.14.11.22.16

Mar 15, 2019

2019.3.10.3.54.28

Mar 10, 2019

2019.2.24.11.10.10

Feb 25, 2019

2019.1.5.2.19.34

Jan 5, 2019

2019.1.2.10.28.10

Jan 3, 2019

2018.12.30.17.28.25

Dec 31, 2018

2018.12.30.11.48.12

Dec 30, 2018

2018.12.29.12.42.8

Dec 30, 2018

2018.12.29.11.47.19

Dec 30, 2018

2018.12.11.12.53.30

Dec 11, 2018

2018.10.2.19.46.5

Oct 2, 2018

2018.10.2.7.6.41

Oct 2, 2018

2018.10.2.4.43.4

Oct 2, 2018

2018.5.26.11.11.31

May 26, 2018

2018.5.26.8.55.59

May 26, 2018

2018.1.2.10.28.10

Jan 3, 2019

2017.11.21.10.50.27

Nov 22, 2017

2017.11.21.10.29.18

Nov 22, 2017

2017.11.21.2.15.55

Nov 22, 2017

2017.11.21.2.15.53

Nov 21, 2017

2017.11.20.5.39.33

Nov 20, 2017

2017.11.19.7.26.19

Nov 19, 2017

2017.11.12.10.6.44

Nov 12, 2017

2017.11.11.23.26.23

Nov 12, 2017

2017.11.11.10.32.19

Nov 11, 2017

2017.11.11.10.1.36

Nov 11, 2017

2017.8.8.10.24.20

Aug 9, 2017

2017.3.1.4.15.20

Mar 1, 2017

2017.2.20.22.48.41

Feb 21, 2017

2017.2.20.6.8.18

Feb 20, 2017

2017.2.20.3.40.24

Feb 20, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archivenow-2020.7.18.12.19.44.tar.gz (20.2 kB view details)

Uploaded Jul 18, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

archivenow-2020.7.18.12.19.44-py2.py3-none-any.whl (21.6 kB view details)

Uploaded Jul 18, 2020 Python 2Python 3

File details

Details for the file archivenow-2020.7.18.12.19.44.tar.gz.

File metadata

Download URL: archivenow-2020.7.18.12.19.44.tar.gz
Upload date: Jul 18, 2020
Size: 20.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.13.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.0

File hashes

Hashes for archivenow-2020.7.18.12.19.44.tar.gz
Algorithm	Hash digest
SHA256	`72cd40b24dcaa4734c8842db364f81d136537d91eac2de5483f8ead746fb3035`
MD5	`e14be6f179a9471c19aa01cf8a63c328`
BLAKE2b-256	`157ad158dd5f548a3610246cf64e110327e1e7337ad34be821e3f32957395d07`

See more details on using hashes here.

File details

Details for the file archivenow-2020.7.18.12.19.44-py2.py3-none-any.whl.

File metadata

Download URL: archivenow-2020.7.18.12.19.44-py2.py3-none-any.whl
Upload date: Jul 18, 2020
Size: 21.6 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.13.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.0

File hashes

Hashes for archivenow-2020.7.18.12.19.44-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`768ed65f17c108c06f74a08d4703b4eac15dbfc68b0b143174afdc87cbc0132a`
MD5	`71e330c881ce3d1fefb382f08929ec3a`
BLAKE2b-256	`8618b1478ba43b285666c302cf7b0c2d1865b8150284c2790d2c15d457e16f5c`

See more details on using hashes here.

archivenow 2020.7.18.12.19.44

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A Tool To Push Web Resources Into Web Archives

Installing

CLI USAGE

Examples

Example 1

Example 2

Example 3

Example 4

Example 5

Server

Example 6

Example 7

Example 8

Running as a Docker Container

Python Usage

Example 9

Example 10

Example 11

Example 12

Configuring a new archive or removing existing one

Notes

Citing Project

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes