Skip to main content

A tool to download website contents through Tor with German exit nodes and extract images

Project description

download-webpage-data

A Python tool to download website contents through Tor with German exit nodes and extract images from downloaded websites.

Prerequisites

  • Python 3.8 or higher
  • Tor service installed on your system
  • torrc configuration file (will be created automatically)

Installation

  1. Clone this repository
  2. Install dependencies:
pip install .

Features

Website Downloader

  • Routes all traffic through Tor
  • Uses German exit nodes exclusively
  • Downloads complete website contents
  • Preserves website structure
  • Handles errors gracefully
  • Supports sites with invalid SSL certificates
  • Retries failed downloads with new Tor identity

Image Extractor

  • Extracts all images from downloaded websites
  • Supports multiple image formats (jpg, jpeg, png, gif, webp, svg, ico)
  • Finds both direct image files and HTML-referenced images
  • Preserves original filenames
  • Creates organized output structure
  • Handles duplicate files

Usage

Downloading Websites

  1. Ensure Tor service is running on your system:
# On Manjaro/Arch:
sudo systemctl start tor
  1. Download a website:
# Interactive mode
python -m download_webpage_data

# Direct URL mode
python -m download_webpage_data -u https://example.com

# With SSL verification
python -m download_webpage_data --verify-ssl -u https://example.com

Extracting Images

  1. After downloading one or more websites, run:
python -m download_webpage_data.extract_images
  1. Select the website from the list
  2. Images will be extracted to images/<website>/ directory

Command-line Options

Website Downloader

  • -u, --url: URL to download (if not provided, will prompt)
  • --verify-ssl: Enable SSL certificate verification (disabled by default)

Image Extractor

  • Interactive menu to select from downloaded websites
  • Press 'q' to quit at any time

Directory Structure

.
├── downloads/           # Downloaded websites
│   └── example.com/    # Website content
└── images/             # Extracted images
    └── example.com/    # Images from website

Security Note

This tool is for legitimate use only. Ensure you have permission to download website contents before using this tool.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

download_webpage_data-1.0.0.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

download_webpage_data-1.0.0-py3-none-any.whl (2.6 kB view details)

Uploaded Python 3

File details

Details for the file download_webpage_data-1.0.0.tar.gz.

File metadata

  • Download URL: download_webpage_data-1.0.0.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for download_webpage_data-1.0.0.tar.gz
Algorithm Hash digest
SHA256 2191ac4636aa241472c67dcc48e458c9d839d31d455aab46f38e1c25baf1784f
MD5 b772eed1d9e6d40ba20a7965e89345de
BLAKE2b-256 c82f4e75c0358ecba9095cf1ce2432d8326d4a84f3c3baa74ec9cee5982fb852

See more details on using hashes here.

File details

Details for the file download_webpage_data-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for download_webpage_data-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c03e8c3df42e4dac24b02af4831d44a2a4bda5f332d257f52863e7343c485b44
MD5 de5be0c158c5f76b881ba27c0e3f68cf
BLAKE2b-256 8b83cbb97b93fc9840f6d859f41bf1f51005942a21f3a246213d7c6f5c0a25cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page