A web scraper that uses Tor for anonymity and supports media extraction
Project description
Scrape Simple
A web scraper that uses Tor for anonymity and supports text and media extraction.
Features
- Tor integration for anonymous web scraping
- Extract text content from web pages
- Extract media files (images, videos) above a specified size
- Optional Russian text simplification using Natasha
- Optional AI-based image description using BLIP
Installation
pip install scrape-simple
Optional Dependencies
For Russian text simplification:
pip install scrape-simple[russian]
For AI image descriptions:
pip install scrape-simple[ai]
For all features:
pip install scrape-simple[russian,ai]
Usage
Command Line
# Basic usage
scrape-simple https://example.com
# Advanced usage
scrape-simple https://example.com --depth 3 --min-media-size 20480 --simplify-ru --ai-describe-media
Python API
from scrape_simple import WebScraper, SiteContent
# Create scraper
scraper = WebScraper(
root_url="https://example.com",
max_depth=2,
use_existing_tor=True,
min_media_size=10240, # 10KB minimum for media files
simplify_ru=False,
ai_describe_media=False
)
# Start scraping
site_content = scraper.start()
# Access results
for page in site_content.TextPages:
print(f"Page: {page.url}, Content length: {len(page.content)}")
for media in site_content.MediaContentList:
print(f"Media: {media.url}, Type: {media.media_type}, Description: {media.description}")
# Create scraper with media extraction disabled
scraper = WebScraper(
root_url="https://example.com",
max_depth=2,
use_existing_tor=True,
skip_media=True # Disable media extraction
)
Requirements
- Python 3.6+
- Tor (must be installed separately)
Command Line Arguments
| Argument | Description |
|---|---|
url |
The URL of the site to scrape |
--depth, -d |
The depth level for crawling (default: 2) |
--use-existing-tor, -t |
Use existing Tor instance if available |
--output, -o |
Output JSON file (default: output.json) |
--history-file |
File to store visited URLs for this run (default: .scrape_history) |
--simplify-ru |
Simplify Russian text using Natasha |
--min-media-size |
Minimum file size for media in bytes (default: 100KB) |
--ai-describe-media |
Use AI to generate descriptions for media files |
--skip-media |
Disable media extraction completely |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrape_simple-0.1.1.tar.gz
(14.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrape_simple-0.1.1.tar.gz.
File metadata
- Download URL: scrape_simple-0.1.1.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4212ff3341c0d34aedd8672ec7afc1d09b8dcef7803a32a9a6a43e816cbbbd66
|
|
| MD5 |
01e1e10c1e6bcce9dbc6cddc146fd20b
|
|
| BLAKE2b-256 |
3d698b7853e8a802a99d25e9a7db462e88f2b9d6e14b2771621258743045fc28
|
File details
Details for the file scrape_simple-0.1.1-py3-none-any.whl.
File metadata
- Download URL: scrape_simple-0.1.1-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
faefbfd9984d7800cfbdae37a49951a4b8bc6a6615e1d1fe1571ae6b607e98ac
|
|
| MD5 |
3c1e9f6a09e660dda99adc3c4c09a54d
|
|
| BLAKE2b-256 |
72cc33a6cb5e3f72146c90a1b37cc3f26664918584eec4cc280d7921f359c07f
|