A web scraper that uses Tor for anonymity and supports media extraction
Project description
Scrape Simple
A web scraper that uses Tor for anonymity and supports text and media extraction.
Features
- Tor integration for anonymous web scraping
- Extract text content from web pages
- Extract media files (images, videos) above a specified size
- Optional Russian text simplification using Natasha
- Optional AI-based image description using BLIP
Installation
pip install scrape-simple
Optional Dependencies
For Russian text simplification:
pip install scrape-simple[russian]
For AI image descriptions:
pip install scrape-simple[ai]
For all features:
pip install scrape-simple[russian,ai]
Usage
Command Line
# Basic usage
scrape-simple https://example.com
# Advanced usage
scrape-simple https://example.com --depth 3 --min-media-size 20480 --simplify-ru --ai-describe-media
Python API
from scrape_simple import WebScraper, SiteContent
# Create scraper
scraper = WebScraper(
root_url="https://example.com",
max_depth=2,
use_existing_tor=True,
min_media_size=10240, # 10KB minimum for media files
simplify_ru=False,
ai_describe_media=False
)
# Start scraping
site_content = scraper.start()
# Access results
for page in site_content.TextPages:
print(f"Page: {page.url}, Content length: {len(page.content)}")
for media in site_content.MediaContentList:
print(f"Media: {media.url}, Type: {media.media_type}, Description: {media.description}")
# Create scraper with media extraction disabled
scraper = WebScraper(
root_url="https://example.com",
max_depth=2,
use_existing_tor=True,
skip_media=True # Disable media extraction
)
Requirements
- Python 3.6+
- Tor (must be installed separately)
Command Line Arguments
| Argument | Description |
|---|---|
url |
The URL of the site to scrape |
--depth, -d |
The depth level for crawling (default: 2) |
--use-existing-tor, -t |
Use existing Tor instance if available |
--output, -o |
Output JSON file (default: output.json) |
--history-file |
File to store visited URLs for this run (default: .scrape_history) |
--simplify-ru |
Simplify Russian text using Natasha |
--min-media-size |
Minimum file size for media in bytes (default: 100KB) |
--ai-describe-media |
Use AI to generate descriptions for media files |
--skip-media |
Disable media extraction completely |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrape_simple-0.1.2.tar.gz
(14.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrape_simple-0.1.2.tar.gz.
File metadata
- Download URL: scrape_simple-0.1.2.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c4cff49b653488d65d91c526c9fd074e03c7b296b24deb67308fa6d5eab9bc0
|
|
| MD5 |
a247f11c92d363fceb1fcb37bfb33d3f
|
|
| BLAKE2b-256 |
51e12531e0e543e221a02367470c1e8ce291d4295c729b56afabf7e4dffae3e4
|
File details
Details for the file scrape_simple-0.1.2-py3-none-any.whl.
File metadata
- Download URL: scrape_simple-0.1.2-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dabc66f8af69bd506bd08e13c95e4b3c51e3e66ae700bdf48e6bf28c9ee1feb8
|
|
| MD5 |
b1e9fc020d6f0691e9ffd215f56c9439
|
|
| BLAKE2b-256 |
c490f586c79e1a0fa2855bc39545bf04b28f2abd019da8f80261c94bb89bd21a
|