Skip to main content

A web scraper that uses Tor for anonymity and supports media extraction

Project description

Scrape Simple

A web scraper that uses Tor for anonymity and supports text and media extraction.

Features

  • Tor integration for anonymous web scraping
  • Extract text content from web pages
  • Extract media files (images, videos) above a specified size
  • Optional Russian text simplification using Natasha
  • Optional AI-based image description using BLIP

Installation

pip install scrape-simple

Optional Dependencies

For Russian text simplification:

pip install scrape-simple[russian]

For AI image descriptions:

pip install scrape-simple[ai]

For all features:

pip install scrape-simple[russian,ai]

Usage

Command Line

# Basic usage
scrape-simple https://example.com

# Advanced usage
scrape-simple https://example.com --depth 3 --min-media-size 20480 --simplify-ru --ai-describe-media

Python API

from scrape_simple import WebScraper, SiteContent

# Create scraper
scraper = WebScraper(
    root_url="https://example.com",
    max_depth=2,
    use_existing_tor=True,
    min_media_size=10240,  # 10KB minimum for media files
    simplify_ru=False,
    ai_describe_media=False
)

# Start scraping
site_content = scraper.start()

# Access results
for page in site_content.TextPages:
    print(f"Page: {page.url}, Content length: {len(page.content)}")

for media in site_content.MediaContentList:
    print(f"Media: {media.url}, Type: {media.media_type}, Description: {media.description}")

# Create scraper with media extraction disabled
scraper = WebScraper(
    root_url="https://example.com",
    max_depth=2,
    use_existing_tor=True,
    skip_media=True  # Disable media extraction
)

Requirements

  • Python 3.6+
  • Tor (must be installed separately)

Command Line Arguments

Argument Description
url The URL of the site to scrape
--depth, -d The depth level for crawling (default: 2)
--use-existing-tor, -t Use existing Tor instance if available
--output, -o Output JSON file (default: output.json)
--history-file File to store visited URLs for this run (default: .scrape_history)
--simplify-ru Simplify Russian text using Natasha
--min-media-size Minimum file size for media in bytes (default: 100KB)
--ai-describe-media Use AI to generate descriptions for media files
--skip-media Disable media extraction completely
--max-retries Maximum number of retries for failed requests (default: 3)

Anti-Bot Protection Handling

Scrape Simple includes features to bypass common anti-bot protections:

  • Browser-like request headers
  • Random delays between requests
  • Tor IP rotation for 403/429 errors
  • Configurable retry mechanism

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_simple-0.1.3.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrape_simple-0.1.3-py3-none-any.whl (18.0 kB view details)

Uploaded Python 3

File details

Details for the file scrape_simple-0.1.3.tar.gz.

File metadata

  • Download URL: scrape_simple-0.1.3.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for scrape_simple-0.1.3.tar.gz
Algorithm Hash digest
SHA256 dce1946e2ac6209a01c241a55c6cfc1b3dad16c081680b0b7948ede426bcc10a
MD5 66d0dc5db04b0601dc316f8846faca98
BLAKE2b-256 1378e061498d89abd10f34d9b2abd4c5b6785cafbfa6f3051c7d8ad02da4b774

See more details on using hashes here.

File details

Details for the file scrape_simple-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: scrape_simple-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 18.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for scrape_simple-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 dcdf6e3430523d5ee5fc8030ae09efd1a55b261115f812d784a74f97de295b90
MD5 eaedaf86f51195df11c233c7f72b090e
BLAKE2b-256 b2903f6ff8890bb4476c72b7fd2fe5887065964a0730b2d7b845febd4523d944

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page