Skip to main content

A tool to download website contents through Tor with German exit nodes and extract images

Project description

download-webpage-data

A Python tool to download website contents through Tor with German exit nodes and extract images from downloaded websites.

Prerequisites

  • Python 3.8 or higher
  • Tor service installed on your system
  • torrc configuration file (will be created automatically)

Installation

  1. Clone this repository
  2. Install dependencies:
pip install .

Features

Website Downloader

  • Routes all traffic through Tor
  • Uses German exit nodes exclusively
  • Downloads complete website contents
  • Preserves website structure
  • Handles errors gracefully
  • Supports sites with invalid SSL certificates
  • Retries failed downloads with new Tor identity

Image Extractor

  • Extracts all images from downloaded websites
  • Supports multiple image formats (jpg, jpeg, png, gif, webp, svg, ico)
  • Finds both direct image files and HTML-referenced images
  • Preserves original filenames
  • Creates organized output structure
  • Handles duplicate files

Usage

Downloading Websites

  1. Ensure Tor service is running on your system:
# On Manjaro/Arch:
sudo systemctl start tor
  1. Download a website:
# Interactive mode
python -m download_webpage_data

# Direct URL mode
python -m download_webpage_data -u https://example.com

# With SSL verification
python -m download_webpage_data --verify-ssl -u https://example.com

Extracting Images

  1. After downloading one or more websites, run:
python -m download_webpage_data.extract_images
  1. Select the website from the list
  2. Images will be extracted to images/<website>/ directory

Command-line Options

Website Downloader

  • -u, --url: URL to download (if not provided, will prompt)
  • --verify-ssl: Enable SSL certificate verification (disabled by default)

Image Extractor

  • Interactive menu to select from downloaded websites
  • Press 'q' to quit at any time

Directory Structure

.
├── downloads/           # Downloaded websites
│   └── example.com/    # Website content
└── images/             # Extracted images
    └── example.com/    # Images from website

Security Note

This tool is for legitimate use only. Ensure you have permission to download website contents before using this tool.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

download_webpage_data-1.0.1.tar.gz (2.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

download_webpage_data-1.0.1-py3-none-any.whl (2.6 kB view details)

Uploaded Python 3

File details

Details for the file download_webpage_data-1.0.1.tar.gz.

File metadata

  • Download URL: download_webpage_data-1.0.1.tar.gz
  • Upload date:
  • Size: 2.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for download_webpage_data-1.0.1.tar.gz
Algorithm Hash digest
SHA256 e5e1932feb4406cfc5582b55da0467aaf814ea0608058181d4d2664deb090ed2
MD5 04c7b4cac4058cf654d94d7601c877d8
BLAKE2b-256 236485976c4a6d41584daeb129daf48a9824bb1db376375c461159c6a8688ba9

See more details on using hashes here.

File details

Details for the file download_webpage_data-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for download_webpage_data-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c64bf3c671849bf55959c9917e4375548d6cc5127980ff2f35b222e181f5674f
MD5 935f18807b3791a9a077faee68fe001a
BLAKE2b-256 a17129568a7f716d096156b8c5a08be46ca505e7ce6c9abeb4ba9449cfbfd09d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page