Skip to main content

A robust, multiprocessing-enabled web scraper

Project description

Website Scraper

A robust, multiprocessing-enabled web scraper that can be used both as a module and as a command-line tool. Features include rate limiting, bot detection avoidance, and comprehensive logging.

Features

  • Multiprocessing support for faster scraping
  • Rate limiting and random delays to avoid detection
  • Rotating User-Agents and browser fingerprints
  • Comprehensive logging system with separate debug and info logs
  • Progress tracking with progress bar
  • Both module and CLI interfaces
  • JSON output format
  • Configurable retry mechanism
  • XML content detection and proper handling
  • SSL verification options

Installation

From Source

  1. Clone the repository:

    git clone git@github.com:ml-lubich/website-scraper.git
    cd website-scraper
    
  2. Install the package:

    pip install .
    

From PyPI (coming soon)

pip install website-scraper

Usage

As a Command-Line Tool

The package installs a website-scraper command that can be used directly:

Basic usage:

website-scraper https://example.com

With options (long form):

website-scraper https://example.com \
    --min-delay 2 \
    --max-delay 5 \
    --workers 4 \
    --output results.json \
    --log-dir logs \
    --no-verify-ssl

With options (short form):

website-scraper https://example.com \
    -m 2 \
    -M 5 \
    -w 4 \
    -o results.json \
    -l logs \
    -k

Available options:

  • -m, --min-delay: Minimum delay between requests (seconds)
  • -M, --max-delay: Maximum delay between requests (seconds)
  • -r, --retries: Maximum number of retry attempts
  • -w, --workers: Number of worker processes
  • -l, --log-dir: Directory to store log files
  • -o, --output: Output file path for scraped data (JSON)
  • -q, --quiet: Suppress progress bar
  • -k, --no-verify-ssl: Disable SSL certificate verification (use with caution)

Output Handling

The scraper can handle output in two ways:

  1. Write to a file (when -o or --output is specified)
  2. Print to stdout (when no output file is specified)

This allows for flexible usage:

# Write to file
website-scraper example.com -o results.json

# Pipe to another command
website-scraper example.com | jq .

# Save output using shell redirection
website-scraper example.com > results.json

As a Python Package

from website_scraper import WebScraper

# Initialize the scraper
scraper = WebScraper(
    base_url="https://example.com",
    delay_range=(2, 5),
    max_retries=3,
    log_dir="logs",
    verify_ssl=True  # Set to False to disable SSL verification
)

# Start scraping
data, stats = scraper.scrape(show_progress=True)

# Process results
print(f"Scraped {stats['total_pages_scraped']} pages")
print(f"Processed {stats['total_urls_processed']} URLs")

Output Format

The scraper outputs JSON data in the following format:

{
    "data": {
        "url1": {
            "title": "Page Title",
            "text": "Page Content",
            "meta_description": "Meta Description"
        }
        // ... more URLs
    },
    "stats": {
        "total_pages_scraped": 10,
        "total_urls_processed": 12,
        "failed_urls": 2,
        "start_url": "https://example.com",
        "duration": "5 minutes",
        "success_rate": "83.3%"
    }
}

Development

  1. Clone the repository:

    git clone git@github.com:ml-lubich/website-scraper.git
    cd website-scraper
    
  2. Create a virtual environment:

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
  3. Install in development mode:

    pip install -e .
    

Logging

Logs are stored in the specified log directory (default: logs/). Two types of log files are generated:

  • [timestamp].log: Contains INFO level and above messages
  • debug_[timestamp].log: Contains detailed DEBUG level messages

The logs include:

  • Request attempts and responses
  • Pages being processed
  • Successful scrapes
  • Failed attempts
  • Progress updates
  • Error messages
  • Content type detection
  • Parser selection

Error Handling

  • Automatic retry mechanism for failed requests
  • Graceful handling of SSL certificate issues
  • Proper handling of XML vs HTML content
  • Rate limiting and timeout handling
  • Comprehensive error logging
  • All errors are logged but don't stop the scraping process

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

website_scraper-0.1.1.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

website_scraper-0.1.1-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file website_scraper-0.1.1.tar.gz.

File metadata

  • Download URL: website_scraper-0.1.1.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for website_scraper-0.1.1.tar.gz
Algorithm Hash digest
SHA256 15b864a2d6817a9d4b4be413bb64a2618c4a71e90f11742ed6894574bfd6594b
MD5 2e8ffedea8f79ad5dd24ea77d807235c
BLAKE2b-256 8c774e2708061a538827fcf86aeb15b46c66860517c5dbfb5ceda0bfe2f80e30

See more details on using hashes here.

Provenance

The following attestation bundles were made for website_scraper-0.1.1.tar.gz:

Publisher: publish.yml on ml-lubich/website-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file website_scraper-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for website_scraper-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 064edb69f33472d123f6c805f4b1fa5416b5d7883f9efe3187068330684f395e
MD5 f5cbd1db11c898cdc33ad2fe6f013aee
BLAKE2b-256 b482026e9dc6d6d69a151f1fc102b0eca2af075fb5325347dcb41fc9e9a4a8bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for website_scraper-0.1.1-py3-none-any.whl:

Publisher: publish.yml on ml-lubich/website-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page