Skip to main content

Web scraping automation system with intelligent fallback mechanisms

Project description

Scrapion Logo

Scrapion - Web Scraping Automation System

A Python library for automated web scraping with intelligent fallback mechanisms and accessibility handling.

Features

  • Dual Input Modes: Accept URLs directly or search queries
  • Smart URL Management: Automatically split search results into main (1-5) and backup (6-10) lists
  • Intelligent Fallback: Retry with backup URLs if primary URLs fail
  • Content Extraction: Uses Playwright for robust web content retrieval
  • Search Integration: DuckDuckGo search with human-like behavior to evade bot detection
  • Structured Reports: JSON-formatted reports with success/failure tracking
  • Flexible Output: Output to stdout or save to file
  • Auto Browser Setup: Firefox browser automatically installs on first use

Installation

From PyPI (Recommended)

pip install scrapion

# Firefox browser will auto-install on first use

Build from Source

# Clone the repository
git clone https://github.com/aula-id/scrapion
cd scrapion

# Install in editable mode
pip install -e .

# Or install dependencies manually
pip install -r requirements.txt

Usage

As a Library

from scrapion import Client

# Create client (Firefox auto-installs if needed)
client = Client()

# Process single URL - report object contains all data
report = client.run("https://example.com")

# Access report data directly
print(f"Successful scrapes: {report.successful_scrapes}")
print(f"Results: {report.results}")
print(f"Report dict: {report.to_dict()}")

# Or output to stdout/file
client.output_report("stdio")

# Process search query
report = client.run("python async programming")
client.output_report("file", "./report.json")

# Skip browser check (useful in CI or when browser is pre-installed)
client = Client(skip_browser_check=True)
# Or via environment variable
# SCRAPION_SKIP_BROWSER_CHECK=1 python script.py

As a CLI Tool

# Output to stdout (JSON)
scrapion "https://example.com" --report stdio
scrapion "rust tutorial" --report stdio

# Save to file
scrapion "machine learning" --report file --output ./results.json

Architecture

Core Modules

  1. input_handler.py: Parse and validate user input (URL vs search query)
  2. list_manager.py: Manage URL lists (main list 1-5, backup list 6-10)
  3. search_engine.py: DuckDuckGo search with Playwright
  4. web_access.py: Fetch and convert web content to markdown
  5. report_generator.py: Generate JSON reports with metadata
  6. orchestrator.py: Main Client class workflow orchestrator (follows CONCEPT.md)

Workflow (CONCEPT.md)

User Input
    ↓
[Phase 1] Parse Input
    ├→ URL: Single URL mode
    └→ Query: Multi-URL mode

[Phase 2] Search (if query)
    ├→ Execute DuckDuckGo search
    ├→ Extract 10 URLs
    └→ Split into main (1-5) and backup (6-10)

[Phase 3] Scraping Loop
    ├→ Try main list (1-5)
    │  ├→ Success: Report and exit
    │  └→ Failure: Next from main
    └→ Try backup list (6-10)
       ├→ Success: Report and exit
       └→ Failure: Next from backup

[Phase 4] Report Generation
    └→ Compile results and output

Report Object

The client.run() method returns a Report object with the following attributes:

# Directly access report data
report.query                  # Original input (URL or query)
report.mode                   # "single_url" or "multi_url"
report.successful_scrapes     # Number of successful scrapes
report.failed_scrapes         # Number of failed scrapes
report.results                # List of ScrapeResult objects
report.failed_urls            # List of failed URLs

# Convert to dict or JSON
report.to_dict()              # Returns dictionary
report.to_json()              # Returns JSON string

# Output methods
report.print_to_stdout()      # Print JSON to stdout
report.save_to_file("path")   # Save to JSON file

JSON Structure

{
  "query": "search query or URL",
  "mode": "single_url or multi_url",
  "total_urls_attempted": 10,
  "successful_scrapes": 3,
  "failed_scrapes": 7,
  "results": [
    {
      "url": "https://example.com",
      "status": "success or failed",
      "accessible": true,
      "content": "scraped content...",
      "source": "main_list, backup_list, or single_url",
      "timestamp": "2025-10-31T08:39:07Z"
    }
  ],
  "failed_urls": ["url1", "url2"],
  "generated_at": "2025-10-31T08:39:07Z"
}

Examples

See example.py in the source repository for detailed usage examples.

If you've built from source:

python3 example.py

Configuration

Browser Setup

Firefox browser is automatically installed on first use. To skip the browser check:

# Skip via constructor parameter
client = Client(skip_browser_check=True)

# Or via environment variable
export SCRAPION_SKIP_BROWSER_CHECK=1

Module Customization

Edit relevant modules to customize:

  • Search engine (DuckDuckGo)
  • Request timeouts
  • Extraction rules
  • Output formats

License

See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapion-0.1.0.tar.gz (29.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapion-0.1.0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapion-0.1.0.tar.gz.

File metadata

  • Download URL: scrapion-0.1.0.tar.gz
  • Upload date:
  • Size: 29.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for scrapion-0.1.0.tar.gz
Algorithm Hash digest
SHA256 63f7d3a64b60c98b85c16c5bd72693eca4ff94aa66e65188a93f2443a6d02d28
MD5 7c5d17800a04eccfec04bd9dda29c14f
BLAKE2b-256 f5384c571bf2f1b78c1500bdbf3412309e666007af7279929df00ea59d3944a8

See more details on using hashes here.

File details

Details for the file scrapion-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scrapion-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for scrapion-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 524be76662e5d3d4445735691542e5290fcefebdb549f615adc6f2b4cefd3919
MD5 7aa578fda1f49f883e6d152f367b152b
BLAKE2b-256 91dac28e275b60ab444fbae704575f2a24a190d94845d4ab1ace95b621c01185

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page