Skip to main content

A tool to download website contents through Tor with German exit nodes and extract images

Project description

download-webpage-data

A Python tool to download website contents through Tor with German exit nodes and extract images from downloaded websites.

Prerequisites

  • Python 3.8 or higher
  • Tor service installed on your system
  • torrc configuration file (will be created automatically)

Installation

  1. Clone this repository
  2. Install dependencies:
pip install .

Features

Website Downloader

  • Routes all traffic through Tor
  • Uses German exit nodes exclusively
  • Downloads complete website contents
  • Preserves website structure
  • Handles errors gracefully
  • Supports sites with invalid SSL certificates
  • Retries failed downloads with new Tor identity

Image Extractor

  • Extracts all images from downloaded websites
  • Supports multiple image formats (jpg, jpeg, png, gif, webp, svg, ico)
  • Finds both direct image files and HTML-referenced images
  • Preserves original filenames
  • Creates organized output structure
  • Handles duplicate files

Usage

Downloading Websites

  1. Ensure Tor service is running on your system:
# On Manjaro/Arch:
sudo systemctl start tor
  1. Download a website:
# Interactive mode
python -m download_webpage_data

# Direct URL mode
python -m download_webpage_data -u https://example.com

# With SSL verification
python -m download_webpage_data --verify-ssl -u https://example.com

Extracting Images

  1. After downloading one or more websites, run:
python -m download_webpage_data.extract_images
  1. Select the website from the list
  2. Images will be extracted to images/<website>/ directory

Command-line Options

Website Downloader

  • -u, --url: URL to download (if not provided, will prompt)
  • --verify-ssl: Enable SSL certificate verification (disabled by default)

Image Extractor

  • Interactive menu to select from downloaded websites
  • Press 'q' to quit at any time

Directory Structure

.
├── downloads/           # Downloaded websites
│   └── example.com/    # Website content
└── images/             # Extracted images
    └── example.com/    # Images from website

Security Note

This tool is for legitimate use only. Ensure you have permission to download website contents before using this tool.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

download_webpage_data-1.0.2.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

download_webpage_data-1.0.2-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file download_webpage_data-1.0.2.tar.gz.

File metadata

  • Download URL: download_webpage_data-1.0.2.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.8

File hashes

Hashes for download_webpage_data-1.0.2.tar.gz
Algorithm Hash digest
SHA256 9f3957d2d9d6159680599c8cde5f40a44aa4b65f86f11a842cc26f5174548276
MD5 96f1dd5c6ff2de867309fe9e1971aafd
BLAKE2b-256 3244c1e748a40d0c13c9a0bb6bb94b958d9fed3733ba3ea46c8fd8ec3e3d0c82

See more details on using hashes here.

File details

Details for the file download_webpage_data-1.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for download_webpage_data-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d2eee7096917f00ff27044e1b57c5dc79b236f36bc3fa0808d6f4fbea2eb486e
MD5 09c005f2cb4574f18c5e81081a73005d
BLAKE2b-256 19bd02ad9f65716f4deeca8c80108bdb9a1c9a7202c9ee2a34c410077345e688

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page