A web scraper that uses Tor for anonymity and supports media extraction
Project description
Scrape Simple
A web scraper that uses Tor for anonymity and supports text and media extraction.
Features
- Tor integration for anonymous web scraping
- Extract text content from web pages
- Extract media files (images, videos) above a specified size
- Optional Russian text simplification using Natasha
- Optional AI-based image description using BLIP
Installation
pip install scrape-simple
Optional Dependencies
For Russian text simplification:
pip install scrape-simple[russian]
For AI image descriptions:
pip install scrape-simple[ai]
For all features:
pip install scrape-simple[russian,ai]
Usage
Command Line
# Basic usage
scrape-simple https://example.com
# Advanced usage
scrape-simple https://example.com --depth 3 --min-media-size 20480 --simplify-ru --ai-describe-media
Python API
from scrape_simple import WebScraper, SiteContent
# Create scraper
scraper = WebScraper(
root_url="https://example.com",
max_depth=2,
use_existing_tor=True,
min_media_size=10240, # 10KB minimum for media files
simplify_ru=False,
ai_describe_media=False
)
# Start scraping
site_content = scraper.start()
# Access results
for page in site_content.TextPages:
print(f"Page: {page.url}, Content length: {len(page.content)}")
for media in site_content.MediaContentList:
print(f"Media: {media.url}, Type: {media.media_type}, Description: {media.description}")
Requirements
- Python 3.6+
- Tor (must be installed separately)
Command Line Arguments
| Argument | Description |
|---|---|
url |
The URL of the site to scrape |
--depth, -d |
The depth level for crawling (default: 2) |
--use-existing-tor, -t |
Use existing Tor instance if available |
--output, -o |
Output JSON file (default: output.json) |
--history-file |
File to store visited URLs for this run (default: .scrape_history) |
--simplify-ru |
Simplify Russian text using Natasha |
--min-media-size |
Minimum file size for media in bytes (default: 100KB) |
--ai-describe-media |
Use AI to generate descriptions for media files |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrape_simple-0.1.0.tar.gz
(14.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrape_simple-0.1.0.tar.gz.
File metadata
- Download URL: scrape_simple-0.1.0.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4fe8c9d49091d8433ede959b3d08e3a3fc45fc916f174d5dc0b22ee2a82fc43
|
|
| MD5 |
07d46ab29526e58b0a8c281047ac2ae3
|
|
| BLAKE2b-256 |
ae181b5665f5d2a97c86cb27211aedee5b3284e3b17a853ac3f1b24d3f749e58
|
File details
Details for the file scrape_simple-0.1.0-py3-none-any.whl.
File metadata
- Download URL: scrape_simple-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b16283785631561d4d3ef851cb332492d5e4aeefe24e875589b6fa7b06e76280
|
|
| MD5 |
f6d1ca834faff27ea721b3409b262784
|
|
| BLAKE2b-256 |
d014cc5c5d04314e6421336eb3f32a40caca20610a12fbe1d3a66cd4bf152530
|