A web scraper that uses Tor for anonymity and supports media extraction
Project description
Scrape Simple
A web scraper that uses Tor for anonymity and supports text and media extraction.
Features
- Tor integration for anonymous web scraping
- Extract text content from web pages
- Extract media files (images, videos) above a specified size
- Optional Russian text simplification using Natasha
- Optional AI-based image description using BLIP
Installation
pip install scrape-simple
Optional Dependencies
For Russian text simplification:
pip install scrape-simple[russian]
For AI image descriptions:
pip install scrape-simple[ai]
For all features:
pip install scrape-simple[russian,ai]
Usage
Command Line
# Basic usage
scrape-simple https://example.com
# Advanced usage
scrape-simple https://example.com --depth 3 --min-media-size 20480 --simplify-ru --ai-describe-media
Python API
from scrape_simple import WebScraper, SiteContent
# Create scraper
scraper = WebScraper(
root_url="https://example.com",
max_depth=2,
use_existing_tor=True,
min_media_size=10240, # 10KB minimum for media files
simplify_ru=False,
ai_describe_media=False
)
# Start scraping
site_content = scraper.start()
# Access results
for page in site_content.TextPages:
print(f"Page: {page.url}, Content length: {len(page.content)}")
for media in site_content.MediaContentList:
print(f"Media: {media.url}, Type: {media.media_type}, Description: {media.description}")
# Create scraper with media extraction disabled
scraper = WebScraper(
root_url="https://example.com",
max_depth=2,
use_existing_tor=True,
skip_media=True # Disable media extraction
)
Requirements
- Python 3.6+
- Tor (must be installed separately)
Command Line Arguments
| Argument | Description |
|---|---|
url |
The URL of the site to scrape |
--depth, -d |
The depth level for crawling (default: 2) |
--use-existing-tor, -t |
Use existing Tor instance if available |
--output, -o |
Output JSON file (default: output.json) |
--history-file |
File to store visited URLs for this run (default: .scrape_history) |
--simplify-ru |
Simplify Russian text using Natasha |
--min-media-size |
Minimum file size for media in bytes (default: 100KB) |
--ai-describe-media |
Use AI to generate descriptions for media files |
--skip-media |
Disable media extraction completely |
--max-retries |
Maximum number of retries for failed requests (default: 3) |
Anti-Bot Protection Handling
Scrape Simple includes features to bypass common anti-bot protections:
- Browser-like request headers
- Random delays between requests
- Tor IP rotation for 403/429 errors
- Configurable retry mechanism
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrape_simple-0.1.3.tar.gz.
File metadata
- Download URL: scrape_simple-0.1.3.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dce1946e2ac6209a01c241a55c6cfc1b3dad16c081680b0b7948ede426bcc10a
|
|
| MD5 |
66d0dc5db04b0601dc316f8846faca98
|
|
| BLAKE2b-256 |
1378e061498d89abd10f34d9b2abd4c5b6785cafbfa6f3051c7d8ad02da4b774
|
File details
Details for the file scrape_simple-0.1.3-py3-none-any.whl.
File metadata
- Download URL: scrape_simple-0.1.3-py3-none-any.whl
- Upload date:
- Size: 18.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcdf6e3430523d5ee5fc8030ae09efd1a55b261115f812d784a74f97de295b90
|
|
| MD5 |
eaedaf86f51195df11c233c7f72b090e
|
|
| BLAKE2b-256 |
b2903f6ff8890bb4476c72b7fd2fe5887065964a0730b2d7b845febd4523d944
|