A robust, multiprocessing-enabled web scraper

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

mlubich

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

Website Scraper

A robust, multiprocessing-enabled web scraper that can be used both as a module and as a command-line tool. Features include rate limiting, bot detection avoidance, and comprehensive logging.

Features

Multiprocessing support for faster scraping
Rate limiting and random delays to avoid detection
Rotating User-Agents and browser fingerprints
Comprehensive logging system with separate debug and info logs
Progress tracking with progress bar
Both module and CLI interfaces
JSON output format
Configurable retry mechanism
XML content detection and proper handling
SSL verification options

Installation

From Source

Clone the repository:

git clone git@github.com:ml-lubich/website-scraper.git
cd website-scraper

Install the package:
```
pip install .
```

From PyPI (coming soon)

pip install website-scraper

Usage

As a Command-Line Tool

The package installs a website-scraper command that can be used directly:

Basic usage:

website-scraper https://example.com

With options (long form):

website-scraper https://example.com \
    --min-delay 2 \
    --max-delay 5 \
    --workers 4 \
    --output results.json \
    --log-dir logs \
    --no-verify-ssl

With options (short form):

website-scraper https://example.com \
    -m 2 \
    -M 5 \
    -w 4 \
    -o results.json \
    -l logs \
    -k

Available options:

-m, --min-delay: Minimum delay between requests (seconds)
-M, --max-delay: Maximum delay between requests (seconds)
-r, --retries: Maximum number of retry attempts
-w, --workers: Number of worker processes
-l, --log-dir: Directory to store log files
-o, --output: Output file path for scraped data (JSON)
-q, --quiet: Suppress progress bar
-k, --no-verify-ssl: Disable SSL certificate verification (use with caution)

Output Handling

The scraper can handle output in two ways:

Write to a file (when -o or --output is specified)
Print to stdout (when no output file is specified)

This allows for flexible usage:

# Write to file
website-scraper example.com -o results.json

# Pipe to another command
website-scraper example.com | jq .

# Save output using shell redirection
website-scraper example.com > results.json

As a Python Package

from website_scraper import WebScraper

# Initialize the scraper
scraper = WebScraper(
    base_url="https://example.com",
    delay_range=(2, 5),
    max_retries=3,
    log_dir="logs",
    verify_ssl=True  # Set to False to disable SSL verification
)

# Start scraping
data, stats = scraper.scrape(show_progress=True)

# Process results
print(f"Scraped {stats['total_pages_scraped']} pages")
print(f"Processed {stats['total_urls_processed']} URLs")

Output Format

The scraper outputs JSON data in the following format:

{
    "data": {
        "url1": {
            "title": "Page Title",
            "text": "Page Content",
            "meta_description": "Meta Description"
        }
        // ... more URLs
    },
    "stats": {
        "total_pages_scraped": 10,
        "total_urls_processed": 12,
        "failed_urls": 2,
        "start_url": "https://example.com",
        "duration": "5 minutes",
        "success_rate": "83.3%"
    }
}

Development

Clone the repository:

git clone git@github.com:ml-lubich/website-scraper.git
cd website-scraper

Create a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install in development mode:
```
pip install -e .
```

Logging

Logs are stored in the specified log directory (default: logs/). Two types of log files are generated:

[timestamp].log: Contains INFO level and above messages
debug_[timestamp].log: Contains detailed DEBUG level messages

The logs include:

Request attempts and responses
Pages being processed
Successful scrapes
Failed attempts
Progress updates
Error messages
Content type detection
Parser selection

Error Handling

Automatic retry mechanism for failed requests
Graceful handling of SSL certificate issues
Proper handling of XML vs HTML content
Rate limiting and timeout handling
Comprehensive error logging
All errors are logged but don't stop the scraping process

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

mlubich

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.2.0

Dec 7, 2025

This version

0.1.1

Jan 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

website_scraper-0.1.1.tar.gz (12.9 kB view details)

Uploaded Jan 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

website_scraper-0.1.1-py3-none-any.whl (12.0 kB view details)

Uploaded Jan 10, 2025 Python 3

File details

Details for the file website_scraper-0.1.1.tar.gz.

File metadata

Download URL: website_scraper-0.1.1.tar.gz
Upload date: Jan 10, 2025
Size: 12.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for website_scraper-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`15b864a2d6817a9d4b4be413bb64a2618c4a71e90f11742ed6894574bfd6594b`
MD5	`2e8ffedea8f79ad5dd24ea77d807235c`
BLAKE2b-256	`8c774e2708061a538827fcf86aeb15b46c66860517c5dbfb5ceda0bfe2f80e30`

See more details on using hashes here.

Provenance

The following attestation bundles were made for website_scraper-0.1.1.tar.gz:

Publisher: publish.yml on ml-lubich/website-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: website_scraper-0.1.1.tar.gz
- Subject digest: 15b864a2d6817a9d4b4be413bb64a2618c4a71e90f11742ed6894574bfd6594b
- Sigstore transparency entry: 161341427
- Sigstore integration time: Jan 10, 2025
Source repository:
- Permalink: ml-lubich/website-scraper@5a2ea674170ed3726d821bb6ffad8b5b775ec67c
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/ml-lubich
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5a2ea674170ed3726d821bb6ffad8b5b775ec67c
- Trigger Event: release

File details

Details for the file website_scraper-0.1.1-py3-none-any.whl.

File metadata

Download URL: website_scraper-0.1.1-py3-none-any.whl
Upload date: Jan 10, 2025
Size: 12.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for website_scraper-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`064edb69f33472d123f6c805f4b1fa5416b5d7883f9efe3187068330684f395e`
MD5	`f5cbd1db11c898cdc33ad2fe6f013aee`
BLAKE2b-256	`b482026e9dc6d6d69a151f1fc102b0eca2af075fb5325347dcb41fc9e9a4a8bf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for website_scraper-0.1.1-py3-none-any.whl:

Publisher: publish.yml on ml-lubich/website-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: website_scraper-0.1.1-py3-none-any.whl
- Subject digest: 064edb69f33472d123f6c805f4b1fa5416b5d7883f9efe3187068330684f395e
- Sigstore transparency entry: 161341429
- Sigstore integration time: Jan 10, 2025
Source repository:
- Permalink: ml-lubich/website-scraper@5a2ea674170ed3726d821bb6ffad8b5b775ec67c
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/ml-lubich
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5a2ea674170ed3726d821bb6ffad8b5b775ec67c
- Trigger Event: release

website-scraper 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Website Scraper

Features

Installation

From Source

From PyPI (coming soon)

Usage

As a Command-Line Tool

Output Handling

As a Python Package

Output Format

Development

Logging

Error Handling

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance