Offline website cloner, updater, and packager
Project description
WebCloner
Clone, update, package & serve websites for offline use – all from one tiny Python script.
Made by Synthfax
Features
| Command | What it does |
|---|---|
| clone | Recursively downloads a live site to a local folder and rewrites internal links. |
| run | Fires up a lightweight Flask web‑server that serves a cloned repo. |
| update | Refreshes an existing repo safely by cloning into a temp dir and syncing changes |
| savewcof | Bundles an entire repo into a single .wcof archive (ZIP under the hood). |
| runwcof | Serves a .wcof file directly – no manual extraction required. |
Additional niceties:
- Progress bars via tqdm so you’re never in the dark.
- Domain‑locked crawling – stays on the origin host.
- Depth limiter so you don’t mirror the whole internet by accident.
- Pure‑Python – works on Windows, macOS & Linux (incl. WSL & Termux).
Requirements
-
Python ≥ 3.8
-
The following PyPI packages (automatically pulled in by
pip install):requestsbeautifulsoup4tqdmflask
Installation
🔌 One‑liner (recommended)
python -m pip install webcloner
(Replace python with python3 on some systems.)
🛠️ From source (for bleeding‑edge or hacking)
git clone https://github.com/yourname/webcloner.git
cd webcloner
python -m pip install -r requirements.txt
# Make the script globally available
python setup.py install # or `pip install -e .` for editable mode
The installer drops a console entry‑point named webcloner into your PATH.
Quick Start
# 1. Mirror the site into ./offline_copy (max 2 levels deep)
webcloner clone https://example.com ./offline_copy --depth 2
# 2. Take a look in your browser
webcloner run ./offline_copy 8000 # -> http://localhost:8000
# 3. Package the repo into a single file you can email or stick on a USB drive
webcloner savewcof mysite.wcof ./offline_copy
# 4. Hand the .wcof to a friend – they can serve it instantly:
webcloner runwcof mysite.wcof 8080
Detailed Command Guide
clone
webcloner clone <url> <output_dir> [--depth N]
url– starting page (must include protocol).output_dir– destination folder (will be created if missing).--depth– recursion limit (default 2). Set to 0 for only the start page.
Behind the scenes the crawler:
- Downloads the page.
- Parses the HTML with BeautifulSoup.
- Rewrites internal links (
href,src) to point at local paths. - Enqueues discovered same‑domain assets & pages until the depth limit.
run
webcloner run <repo_dir> <port> [--host 0.0.0.0]
Serves static files out of repo_dir using Flask. Perfect for quick checks or sharing over LAN.
update
webcloner update <url> <repo_dir> [--depth N]
Safely refreshes an existing repo:
- Clones the live site into a temporary directory.
- Compares modification times and copies newer/added files back.
- Leaves untouched anything that the live site no longer has (in case you keep local notes).
savewcof
webcloner savewcof <filename.wcof> <dest_dir> <repo_dir>
Creates a zip‑compressed Web Cloner Offline File. Think of it as a self‑contained website in a single file.
runwcof
webcloner runwcof <file.wcof> <port> [--host 0.0.0.0]
Extracts the archive to a temp folder in memory and launches the server – super handy for throw‑and‑go demos.
Typical Workflows
Archiving a Documentation Site
webcloner clone https://docs.oldsoftware.com ./docs --depth 3
webcloner savewcof docs_2025-06-25.wcof ./dist ./docs
Transfer the .wcof to any air‑gapped machine and serve:
webcloner runwcof docs_2025-06-25.wcof 7000
Keeping a Local Mirror Fresh
# Nightly cron job (Linux/macOS)
0 3 * * * webcloner update https://myblog.com /srv/mirrors/myblog --depth 2 >> /var/log/webcloner.log 2>&1
How It Works
- URL Normalisation – Strips query/fragment, treats a bare path as
/index.html. - Same‑Domain Filter – No cross‑site requests (stops runaway downloads).
- Breadth‑first Crawl – Queue of
(url, depth); avoids recursion stack blow‑ups. - HTML Re‑write – Converts each internal link to a relative filesystem path so that the site works off‑disk.
- Asset Handling – Non‑HTML responses are stored verbatim (images, CSS, JS, etc.).
- Packaging – A
.wcofis just a ZIP with your folder structure – the magic is knowing to look forindex.htmlwhen serving.
FAQ & Troubleshooting
| Question | Answer |
|---|---|
| It’s downloading external CDNs! | Only same‑host links are followed, but CSS/JS may reference offsite assets. Consider using a CSS post‑processor or mirror those domains separately. |
| Pages show garbled characters | Force UTF‑8 decoding with --encoding utf-8 (coming soon) or file an issue. |
| Can I clone sites that need login? | Currently no – but you can proxy the session by editing cloner.py to inject cookies into requests.Session(). |
| Is JavaScript executed? | No. This is a static grabber. SPA sites that build HTML client‑side will download, but you’ll only get the bare JS/JSON, not the rendered pages. |
Contributing
Pull requests are welcome! If you spot a bug or have a feature idea:
- Open an issue with steps to reproduce.
- Fork & create a topic branch.
- Run
black cloner.py && flake8before pushing. - Submit a PR – CI will run unit tests automatically.
License
This project is licensed under the Apache License 2.0 – see LICENSE for full terms.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webcloner-1.0.1.tar.gz.
File metadata
- Download URL: webcloner-1.0.1.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
515da721535acdd80a3f40a9ed66dcbf73c69f4a5ba6a30a52bdb74aa62add7c
|
|
| MD5 |
68198659adbb91be232dbc8597e14173
|
|
| BLAKE2b-256 |
52e11585956ea7aff36cf45125d3546dd6316ea47123e0fb496c478112af3ca6
|
File details
Details for the file webcloner-1.0.1-py3-none-any.whl.
File metadata
- Download URL: webcloner-1.0.1-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92e7e458bad2804ee8b0ddf3fbf646c0b65a78296eeec4bad52f3d562a547d6b
|
|
| MD5 |
8299d6fb76aad89bb96be046272f2770
|
|
| BLAKE2b-256 |
1d5be5e42c7ab0018688ba2d0412d3e43e9841b028769c004eb84ac78284387b
|