Web scraping automation system with intelligent fallback mechanisms
Project description
Scrapion - Web Scraping Automation System
A Python library for automated web scraping with intelligent fallback mechanisms and accessibility handling.
Features
- Dual Input Modes: Accept URLs directly or search queries
- Smart URL Management: Automatically split search results into main (1-5) and backup (6-10) lists
- Intelligent Fallback: Retry with backup URLs if primary URLs fail
- Content Extraction: Uses Playwright for robust web content retrieval
- Search Integration: DuckDuckGo search with human-like behavior to evade bot detection
- Structured Reports: JSON-formatted reports with success/failure tracking
- Flexible Output: Output to stdout or save to file
- Auto Browser Setup: Firefox browser automatically installs on first use
Installation
From PyPI (Recommended)
pip install scrapion
# Firefox browser will auto-install on first use
Build from Source
# Clone the repository
git clone https://github.com/aula-id/scrapion
cd scrapion
# Install in editable mode
pip install -e .
# Or install dependencies manually
pip install -r requirements.txt
Usage
As a Library
from scrapion import Client
# Create client (Firefox auto-installs if needed)
client = Client()
# Process single URL - report object contains all data
report = client.run("https://example.com")
# Access report data directly
print(f"Successful scrapes: {report.successful_scrapes}")
print(f"Results: {report.results}")
print(f"Report dict: {report.to_dict()}")
# Or output to stdout/file
client.output_report("stdio")
# Process search query
report = client.run("python async programming")
client.output_report("file", "./report.json")
# Skip browser check (useful in CI or when browser is pre-installed)
client = Client(skip_browser_check=True)
# Or via environment variable
# SCRAPION_SKIP_BROWSER_CHECK=1 python script.py
As a CLI Tool
# Output to stdout (JSON)
scrapion "https://example.com" --report stdio
scrapion "rust tutorial" --report stdio
# Save to file
scrapion "machine learning" --report file --output ./results.json
Architecture
Core Modules
- input_handler.py: Parse and validate user input (URL vs search query)
- list_manager.py: Manage URL lists (main list 1-5, backup list 6-10)
- search_engine.py: DuckDuckGo search with Playwright
- web_access.py: Fetch and convert web content to markdown
- report_generator.py: Generate JSON reports with metadata
- orchestrator.py: Main Client class workflow orchestrator (follows CONCEPT.md)
Workflow (CONCEPT.md)
User Input
↓
[Phase 1] Parse Input
├→ URL: Single URL mode
└→ Query: Multi-URL mode
[Phase 2] Search (if query)
├→ Execute DuckDuckGo search
├→ Extract 10 URLs
└→ Split into main (1-5) and backup (6-10)
[Phase 3] Scraping Loop
├→ Try main list (1-5)
│ ├→ Success: Report and exit
│ └→ Failure: Next from main
└→ Try backup list (6-10)
├→ Success: Report and exit
└→ Failure: Next from backup
[Phase 4] Report Generation
└→ Compile results and output
Report Object
The client.run() method returns a Report object with the following attributes:
# Directly access report data
report.query # Original input (URL or query)
report.mode # "single_url" or "multi_url"
report.successful_scrapes # Number of successful scrapes
report.failed_scrapes # Number of failed scrapes
report.results # List of ScrapeResult objects
report.failed_urls # List of failed URLs
# Convert to dict or JSON
report.to_dict() # Returns dictionary
report.to_json() # Returns JSON string
# Output methods
report.print_to_stdout() # Print JSON to stdout
report.save_to_file("path") # Save to JSON file
JSON Structure
{
"query": "search query or URL",
"mode": "single_url or multi_url",
"total_urls_attempted": 10,
"successful_scrapes": 3,
"failed_scrapes": 7,
"results": [
{
"url": "https://example.com",
"status": "success or failed",
"accessible": true,
"content": "scraped content...",
"source": "main_list, backup_list, or single_url",
"timestamp": "2025-10-31T08:39:07Z"
}
],
"failed_urls": ["url1", "url2"],
"generated_at": "2025-10-31T08:39:07Z"
}
Examples
See example.py in the source repository for detailed usage examples.
If you've built from source:
python3 example.py
Configuration
Browser Setup
Firefox browser is automatically installed on first use. To skip the browser check:
# Skip via constructor parameter
client = Client(skip_browser_check=True)
# Or via environment variable
export SCRAPION_SKIP_BROWSER_CHECK=1
Module Customization
Edit relevant modules to customize:
- Search engine (DuckDuckGo)
- Request timeouts
- Extraction rules
- Output formats
License
See LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapion-0.1.1.tar.gz.
File metadata
- Download URL: scrapion-0.1.1.tar.gz
- Upload date:
- Size: 29.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0155c65238b7f6dcc194c30052f4dc7f732da856bb814df35e9339429af71f49
|
|
| MD5 |
7c455ac187fce339b1b85d52aad72c01
|
|
| BLAKE2b-256 |
dcf6324993f5a563894e85b1e665529e551c66c055240d9fe0c06841ab5ea7e9
|
File details
Details for the file scrapion-0.1.1-py3-none-any.whl.
File metadata
- Download URL: scrapion-0.1.1-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0595f4ad289a30e979ff8cf077123b40a11633c2b7a8a9c0249c639afc3fbced
|
|
| MD5 |
6afddcc09d8e781e4e642109b88d881c
|
|
| BLAKE2b-256 |
64b00a99ca3740d40124d4f2682087fe75667734076a10567a3e2a7f3c657413
|