Web scraping automation system with intelligent fallback mechanisms

These details have not been verified by PyPI

Project links

Project description

Scrapion - Web Scraping Automation System

A Python library for automated web scraping with intelligent fallback mechanisms and accessibility handling.

Features

Dual Input Modes: Accept URLs directly or search queries
Smart URL Management: Automatically split search results into main (1-5) and backup (6-10) lists
Intelligent Fallback: Retry with backup URLs if primary URLs fail
Content Extraction: Uses Playwright for robust web content retrieval
Search Integration: DuckDuckGo search with human-like behavior to evade bot detection
Structured Reports: JSON-formatted reports with success/failure tracking
Flexible Output: Output to stdout or save to file
Auto Browser Setup: Firefox browser automatically installs on first use

Installation

From PyPI (Recommended)

pip install scrapion

# Firefox browser will auto-install on first use

Build from Source

# Clone the repository
git clone https://github.com/aula-id/scrapion
cd scrapion

# Install in editable mode
pip install -e .

# Or install dependencies manually
pip install -r requirements.txt

Usage

As a Library

from scrapion import Client

# Create client (Firefox auto-installs if needed)
client = Client()

# Process single URL - report object contains all data
report = client.run("https://example.com")

# Access report data directly
print(f"Successful scrapes: {report.successful_scrapes}")
print(f"Results: {report.results}")
print(f"Report dict: {report.to_dict()}")

# Or output to stdout/file
client.output_report("stdio")

# Process search query
report = client.run("python async programming")
client.output_report("file", "./report.json")

# Skip browser check (useful in CI or when browser is pre-installed)
client = Client(skip_browser_check=True)
# Or via environment variable
# SCRAPION_SKIP_BROWSER_CHECK=1 python script.py

As a CLI Tool

# Output to stdout (JSON)
scrapion "https://example.com" --report stdio
scrapion "rust tutorial" --report stdio

# Save to file
scrapion "machine learning" --report file --output ./results.json

Architecture

Core Modules

input_handler.py: Parse and validate user input (URL vs search query)
list_manager.py: Manage URL lists (main list 1-5, backup list 6-10)
search_engine.py: DuckDuckGo search with Playwright
web_access.py: Fetch and convert web content to markdown
report_generator.py: Generate JSON reports with metadata
orchestrator.py: Main Client class workflow orchestrator (follows CONCEPT.md)

Workflow (CONCEPT.md)

User Input
    ↓
[Phase 1] Parse Input
    ├→ URL: Single URL mode
    └→ Query: Multi-URL mode

[Phase 2] Search (if query)
    ├→ Execute DuckDuckGo search
    ├→ Extract 10 URLs
    └→ Split into main (1-5) and backup (6-10)

[Phase 3] Scraping Loop
    ├→ Try main list (1-5)
    │  ├→ Success: Report and exit
    │  └→ Failure: Next from main
    └→ Try backup list (6-10)
       ├→ Success: Report and exit
       └→ Failure: Next from backup

[Phase 4] Report Generation
    └→ Compile results and output

Report Object

The client.run() method returns a Report object with the following attributes:

# Directly access report data
report.query                  # Original input (URL or query)
report.mode                   # "single_url" or "multi_url"
report.successful_scrapes     # Number of successful scrapes
report.failed_scrapes         # Number of failed scrapes
report.results                # List of ScrapeResult objects
report.failed_urls            # List of failed URLs

# Convert to dict or JSON
report.to_dict()              # Returns dictionary
report.to_json()              # Returns JSON string

# Output methods
report.print_to_stdout()      # Print JSON to stdout
report.save_to_file("path")   # Save to JSON file

JSON Structure

{
  "query": "search query or URL",
  "mode": "single_url or multi_url",
  "total_urls_attempted": 10,
  "successful_scrapes": 3,
  "failed_scrapes": 7,
  "results": [
    {
      "url": "https://example.com",
      "status": "success or failed",
      "accessible": true,
      "content": "scraped content...",
      "source": "main_list, backup_list, or single_url",
      "timestamp": "2025-10-31T08:39:07Z"
    }
  ],
  "failed_urls": ["url1", "url2"],
  "generated_at": "2025-10-31T08:39:07Z"
}

Examples

See example.py in the source repository for detailed usage examples.

If you've built from source:

python3 example.py

Configuration

Browser Setup

Firefox browser is automatically installed on first use. To skip the browser check:

# Skip via constructor parameter
client = Client(skip_browser_check=True)

# Or via environment variable
export SCRAPION_SKIP_BROWSER_CHECK=1

Module Customization

Edit relevant modules to customize:

Search engine (DuckDuckGo)
Request timeouts
Extraction rules
Output formats

License

See LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Oct 31, 2025

This version

0.1.0

Oct 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapion-0.1.0.tar.gz (29.8 kB view details)

Uploaded Oct 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapion-0.1.0-py3-none-any.whl (23.9 kB view details)

Uploaded Oct 31, 2025 Python 3

File details

Details for the file scrapion-0.1.0.tar.gz.

File metadata

Download URL: scrapion-0.1.0.tar.gz
Upload date: Oct 31, 2025
Size: 29.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for scrapion-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`63f7d3a64b60c98b85c16c5bd72693eca4ff94aa66e65188a93f2443a6d02d28`
MD5	`7c5d17800a04eccfec04bd9dda29c14f`
BLAKE2b-256	`f5384c571bf2f1b78c1500bdbf3412309e666007af7279929df00ea59d3944a8`

See more details on using hashes here.

File details

Details for the file scrapion-0.1.0-py3-none-any.whl.

File metadata

Download URL: scrapion-0.1.0-py3-none-any.whl
Upload date: Oct 31, 2025
Size: 23.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for scrapion-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`524be76662e5d3d4445735691542e5290fcefebdb549f615adc6f2b4cefd3919`
MD5	`7aa578fda1f49f883e6d152f367b152b`
BLAKE2b-256	`91dac28e275b60ab444fbae704575f2a24a190d94845d4ab1ace95b621c01185`

See more details on using hashes here.

scrapion 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Scrapion - Web Scraping Automation System

Features

Installation

From PyPI (Recommended)

Build from Source

Usage

As a Library

As a CLI Tool

Architecture

Core Modules

Workflow (CONCEPT.md)

Report Object

JSON Structure

Examples

Configuration

Browser Setup

Module Customization

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes