Skip to main content

Archive web articles

Project description

webarchive

Webarchive is command line web pages extractor which producesa readable contents of requested web pages. It works with URLs, local file paths and standard input.

Features

The following commands show how webarchive can be feeded with web page content:

$ webarchive https://example.com

$ webarchive "$HOME/index.html"

$ webarchive - < "$HOME/index.html"

It then outputs text in various formats:

  • Markdown
  • HTML
  • Plain text

If readability algorithms don't work for a particular web page, webarchive can use an external command which provides textual dumps of pages. Examples of such programs are command line web browsers like links or w3m.

$ webarchive https://example.com -t dump --dump-cmd "w3m -dump"

Webarchive automatically detects and provides contextualized informations like page titles, which can be prepended in YAML Front Matter. It's useful if webarchive output is later processed by other tools which understand YML Front Matter, such as pandoc:

$ webarchive https://example.com -t md | \
    pandoc -f markdown --standalone > article.html
$ ebook-convert article.html article.epub  # ebook-convert is part of Calibre

Additionally, a GUI wrapper is provided, which is also script-friendly as it prints all saved files to standard output.

#!/bin/sh

for f in `webarchive-qt`; do
  pandoc "$f" --standalone > article.html
  ebook-convert article.html article.epub
  mutt -a "article.epub" -s "Good article I found" -- alice@example.com
  rm -f "article.html" "article.epub" "$f"
done

It's small, but quite powerful:

  • allows editing of parsed pages
  • automatically detects URLs in system clipboard and fills address bar with them
  • current URL contents are cached until URL is changed - changing output format won't download the whole page again.
  • defines several keyboard shortcuts (ctrl-s for save, enter for page re-downloading)

Installation

$ pip3 install webarchive

To install dependencies for GUI wrapper (webarchive-qt):

$ pip3 install webarchive[gui]

You can use tools such as pipx and pipsi to automatically install webarchive and its dependencies to isolated environment:

$ pipx install 'webarchive[gui]'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webarchive-0.4.0.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

webarchive-0.4.0-py3-none-any.whl (19.4 kB view details)

Uploaded Python 3

File details

Details for the file webarchive-0.4.0.tar.gz.

File metadata

  • Download URL: webarchive-0.4.0.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.3

File hashes

Hashes for webarchive-0.4.0.tar.gz
Algorithm Hash digest
SHA256 9516909de5a30166d5dc4849332fbeeaaf0233ac92893737325ff891d5a3b02a
MD5 f5b56391a65a791ef82bb6489709c730
BLAKE2b-256 d19a1758fd66aa7159775d187c47cff2cd2c6da0c8c71da8e54bb565be0b4112

See more details on using hashes here.

File details

Details for the file webarchive-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: webarchive-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 19.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.3

File hashes

Hashes for webarchive-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6244dfc67bbb9a8fa5fe208022424d0de42a060665eb5ad7f234e388e5209c17
MD5 3f2f1b05739693679cc7a871e45c0143
BLAKE2b-256 c6a9a0a869ac8781035278cd1651b43a624e69e75d6d9a34ebd6e91ac8bafe08

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page