Skip to main content

Offline website cloner, updater, and packager

Project description

WebCloner

Clone, update, package & serve websites for offline use – all from one tiny Python script.


Made by Synthfax


Features

Command What it does
clone Recursively downloads a live site to a local folder and rewrites internal links.
run Fires up a lightweight Flask web‑server that serves a cloned repo.
update Refreshes an existing repo safely by cloning into a temp dir and syncing changes
savewcof Bundles an entire repo into a single .wcof archive (ZIP under the hood).
runwcof Serves a .wcof file directly – no manual extraction required.

Additional niceties:

  • Progress bars via tqdm so you’re never in the dark.
  • Domain‑locked crawling – stays on the origin host.
  • Depth limiter so you don’t mirror the whole internet by accident.
  • Pure‑Python – works on Windows, macOS & Linux (incl. WSL & Termux).

Requirements

  • Python ≥ 3.8

  • The following PyPI packages (automatically pulled in by pip install):

    • requests
    • beautifulsoup4
    • tqdm
    • flask

Installation

🔌 One‑liner (recommended)

python -m pip install webcloner

(Replace python with python3 on some systems.)

🛠️ From source (for bleeding‑edge or hacking)

git clone https://github.com/yourname/webcloner.git
cd webcloner
python -m pip install -r requirements.txt
# Make the script globally available
python setup.py install  # or `pip install -e .` for editable mode

The installer drops a console entry‑point named webcloner into your PATH.


Quick Start

# 1. Mirror the site into ./offline_copy (max 2 levels deep)
webcloner clone https://example.com ./offline_copy --depth 2

# 2. Take a look in your browser
webcloner run ./offline_copy 8000  # -> http://localhost:8000

# 3. Package the repo into a single file you can email or stick on a USB drive
webcloner savewcof mysite.wcof ./offline_copy

# 4. Hand the .wcof to a friend – they can serve it instantly:
webcloner runwcof mysite.wcof 8080

Detailed Command Guide

clone

webcloner clone <url> <output_dir> [--depth N]
  • url – starting page (must include protocol).
  • output_dir – destination folder (will be created if missing).
  • --depth – recursion limit (default 2). Set to 0 for only the start page.

Behind the scenes the crawler:

  1. Downloads the page.
  2. Parses the HTML with BeautifulSoup.
  3. Rewrites internal links (href, src) to point at local paths.
  4. Enqueues discovered same‑domain assets & pages until the depth limit.

run

webcloner run <repo_dir> <port> [--host 0.0.0.0]

Serves static files out of repo_dir using Flask. Perfect for quick checks or sharing over LAN.

update

webcloner update <url> <repo_dir> [--depth N]

Safely refreshes an existing repo:

  • Clones the live site into a temporary directory.
  • Compares modification times and copies newer/added files back.
  • Leaves untouched anything that the live site no longer has (in case you keep local notes).

savewcof

webcloner savewcof <filename.wcof> <dest_dir> <repo_dir>

Creates a zip‑compressed Web Cloner Offline File. Think of it as a self‑contained website in a single file.

runwcof

webcloner runwcof <file.wcof> <port> [--host 0.0.0.0]

Extracts the archive to a temp folder in memory and launches the server – super handy for throw‑and‑go demos.


Typical Workflows

Archiving a Documentation Site

webcloner clone https://docs.oldsoftware.com ./docs --depth 3
webcloner savewcof docs_2025-06-25.wcof ./dist ./docs

Transfer the .wcof to any air‑gapped machine and serve:

webcloner runwcof docs_2025-06-25.wcof 7000

Keeping a Local Mirror Fresh

# Nightly cron job (Linux/macOS)
0 3 * * * webcloner update https://myblog.com /srv/mirrors/myblog --depth 2 >> /var/log/webcloner.log 2>&1

How It Works

  1. URL Normalisation – Strips query/fragment, treats a bare path as /index.html.
  2. Same‑Domain Filter – No cross‑site requests (stops runaway downloads).
  3. Breadth‑first Crawl – Queue of (url, depth); avoids recursion stack blow‑ups.
  4. HTML Re‑write – Converts each internal link to a relative filesystem path so that the site works off‑disk.
  5. Asset Handling – Non‑HTML responses are stored verbatim (images, CSS, JS, etc.).
  6. Packaging – A .wcof is just a ZIP with your folder structure – the magic is knowing to look for index.html when serving.

FAQ & Troubleshooting

Question Answer
It’s downloading external CDNs! Only same‑host links are followed, but CSS/JS may reference offsite assets. Consider using a CSS post‑processor or mirror those domains separately.
Pages show garbled characters Force UTF‑8 decoding with --encoding utf-8 (coming soon) or file an issue.
Can I clone sites that need login? Currently no – but you can proxy the session by editing cloner.py to inject cookies into requests.Session().
Is JavaScript executed? No. This is a static grabber. SPA sites that build HTML client‑side will download, but you’ll only get the bare JS/JSON, not the rendered pages.

Contributing

Pull requests are welcome! If you spot a bug or have a feature idea:

  1. Open an issue with steps to reproduce.
  2. Fork & create a topic branch.
  3. Run black cloner.py && flake8 before pushing.
  4. Submit a PR – CI will run unit tests automatically.

License

This project is licensed under the Apache License 2.0 – see LICENSE for full terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webcloner-1.0.1.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webcloner-1.0.1-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file webcloner-1.0.1.tar.gz.

File metadata

  • Download URL: webcloner-1.0.1.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for webcloner-1.0.1.tar.gz
Algorithm Hash digest
SHA256 515da721535acdd80a3f40a9ed66dcbf73c69f4a5ba6a30a52bdb74aa62add7c
MD5 68198659adbb91be232dbc8597e14173
BLAKE2b-256 52e11585956ea7aff36cf45125d3546dd6316ea47123e0fb496c478112af3ca6

See more details on using hashes here.

File details

Details for the file webcloner-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: webcloner-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.4

File hashes

Hashes for webcloner-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 92e7e458bad2804ee8b0ddf3fbf646c0b65a78296eeec4bad52f3d562a547d6b
MD5 8299d6fb76aad89bb96be046272f2770
BLAKE2b-256 1d5be5e42c7ab0018688ba2d0412d3e43e9841b028769c004eb84ac78284387b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page