Archive web articles
Project description
webarchive
Webarchive is command line web pages extractor which producesa readable contents of requested web pages. It works with URLs, local file paths and standard input.
Features
The following commands show how webarchive can be feeded with web page content:
$ webarchive https://example.com
$ webarchive "$HOME/index.html"
$ webarchive - < "$HOME/index.html"
It then outputs text in various formats:
- Markdown
- HTML
- Plain text
If readability algorithms don't work for a particular web page, webarchive can use an external command which provides textual dumps of pages. Examples of such programs are command line web browsers like links or w3m.
$ webarchive https://example.com -t dump --dump-cmd "w3m -dump"
Webarchive automatically detects and provides contextualized informations like page titles, which can be prepended in YAML Front Matter. It's useful if webarchive output is later processed by other tools which understand YML Front Matter, such as pandoc:
$ webarchive https://example.com -t md | \
pandoc -f markdown --standalone > article.html
$ ebook-convert article.html article.epub # ebook-convert is part of Calibre
Additionally, a GUI wrapper is provided, which is also script-friendly as it prints all saved files to standard output.
#!/bin/sh
for f in `webarchive-qt`; do
pandoc "$f" --standalone > article.html
ebook-convert article.html article.epub
mutt -a "article.epub" -s "Good article I found" -- alice@example.com
rm -f "article.html" "article.epub" "$f"
done
It's small, but quite powerful:
- allows editing of parsed pages
- automatically detects URLs in system clipboard and fills address bar with them
- current URL contents are cached until URL is changed - changing output format won't download the whole page again.
- defines several keyboard shortcuts (ctrl-s for save, enter for page re-downloading)
Installation
$ pip3 install webarchive
To install dependencies for GUI wrapper (webarchive-qt):
$ pip3 install webarchive[gui]
You can use tools such as pipx and pipsi to automatically install webarchive and its dependencies to isolated environment:
$ pipx install 'webarchive[gui]'
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file webarchive-0.4.0.tar.gz
.
File metadata
- Download URL: webarchive-0.4.0.tar.gz
- Upload date:
- Size: 20.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9516909de5a30166d5dc4849332fbeeaaf0233ac92893737325ff891d5a3b02a |
|
MD5 | f5b56391a65a791ef82bb6489709c730 |
|
BLAKE2b-256 | d19a1758fd66aa7159775d187c47cff2cd2c6da0c8c71da8e54bb565be0b4112 |
File details
Details for the file webarchive-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: webarchive-0.4.0-py3-none-any.whl
- Upload date:
- Size: 19.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6244dfc67bbb9a8fa5fe208022424d0de42a060665eb5ad7f234e388e5209c17 |
|
MD5 | 3f2f1b05739693679cc7a871e45c0143 |
|
BLAKE2b-256 | c6a9a0a869ac8781035278cd1651b43a624e69e75d6d9a34ebd6e91ac8bafe08 |