A robust, multiprocessing-enabled web scraper
Project description
Website Scraper
A robust, multiprocessing-enabled web scraper that can be used both as a module and as a command-line tool. Features include rate limiting, bot detection avoidance, and comprehensive logging.
Features
- Multiprocessing support for faster scraping
- Rate limiting and random delays to avoid detection
- Rotating User-Agents and browser fingerprints
- Comprehensive logging system with separate debug and info logs
- Progress tracking with progress bar
- Both module and CLI interfaces
- JSON output format
- Configurable retry mechanism
- XML content detection and proper handling
- SSL verification options
Installation
From Source
-
Clone the repository:
git clone git@github.com:ml-lubich/website-scraper.git cd website-scraper
-
Install the package:
pip install .
From PyPI (coming soon)
pip install website-scraper
Usage
As a Command-Line Tool
The package installs a website-scraper command that can be used directly:
Basic usage:
website-scraper https://example.com
With options (long form):
website-scraper https://example.com \
--min-delay 2 \
--max-delay 5 \
--workers 4 \
--output results.json \
--log-dir logs \
--no-verify-ssl
With options (short form):
website-scraper https://example.com \
-m 2 \
-M 5 \
-w 4 \
-o results.json \
-l logs \
-k
Available options:
-m, --min-delay: Minimum delay between requests (seconds)-M, --max-delay: Maximum delay between requests (seconds)-r, --retries: Maximum number of retry attempts-w, --workers: Number of worker processes-l, --log-dir: Directory to store log files-o, --output: Output file path for scraped data (JSON)-q, --quiet: Suppress progress bar-k, --no-verify-ssl: Disable SSL certificate verification (use with caution)
Output Handling
The scraper can handle output in two ways:
- Write to a file (when
-oor--outputis specified) - Print to stdout (when no output file is specified)
This allows for flexible usage:
# Write to file
website-scraper example.com -o results.json
# Pipe to another command
website-scraper example.com | jq .
# Save output using shell redirection
website-scraper example.com > results.json
As a Python Package
from website_scraper import WebScraper
# Initialize the scraper
scraper = WebScraper(
base_url="https://example.com",
delay_range=(2, 5),
max_retries=3,
log_dir="logs",
verify_ssl=True # Set to False to disable SSL verification
)
# Start scraping
data, stats = scraper.scrape(show_progress=True)
# Process results
print(f"Scraped {stats['total_pages_scraped']} pages")
print(f"Processed {stats['total_urls_processed']} URLs")
Output Format
The scraper outputs JSON data in the following format:
{
"data": {
"url1": {
"title": "Page Title",
"text": "Page Content",
"meta_description": "Meta Description"
}
// ... more URLs
},
"stats": {
"total_pages_scraped": 10,
"total_urls_processed": 12,
"failed_urls": 2,
"start_url": "https://example.com",
"duration": "5 minutes",
"success_rate": "83.3%"
}
}
Development
-
Clone the repository:
git clone git@github.com:ml-lubich/website-scraper.git cd website-scraper
-
Create a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install in development mode:
pip install -e .
Logging
Logs are stored in the specified log directory (default: logs/). Two types of log files are generated:
[timestamp].log: Contains INFO level and above messagesdebug_[timestamp].log: Contains detailed DEBUG level messages
The logs include:
- Request attempts and responses
- Pages being processed
- Successful scrapes
- Failed attempts
- Progress updates
- Error messages
- Content type detection
- Parser selection
Error Handling
- Automatic retry mechanism for failed requests
- Graceful handling of SSL certificate issues
- Proper handling of XML vs HTML content
- Rate limiting and timeout handling
- Comprehensive error logging
- All errors are logged but don't stop the scraping process
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file website_scraper-0.1.1.tar.gz.
File metadata
- Download URL: website_scraper-0.1.1.tar.gz
- Upload date:
- Size: 12.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15b864a2d6817a9d4b4be413bb64a2618c4a71e90f11742ed6894574bfd6594b
|
|
| MD5 |
2e8ffedea8f79ad5dd24ea77d807235c
|
|
| BLAKE2b-256 |
8c774e2708061a538827fcf86aeb15b46c66860517c5dbfb5ceda0bfe2f80e30
|
Provenance
The following attestation bundles were made for website_scraper-0.1.1.tar.gz:
Publisher:
publish.yml on ml-lubich/website-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
website_scraper-0.1.1.tar.gz -
Subject digest:
15b864a2d6817a9d4b4be413bb64a2618c4a71e90f11742ed6894574bfd6594b - Sigstore transparency entry: 161341427
- Sigstore integration time:
-
Permalink:
ml-lubich/website-scraper@5a2ea674170ed3726d821bb6ffad8b5b775ec67c -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ml-lubich
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5a2ea674170ed3726d821bb6ffad8b5b775ec67c -
Trigger Event:
release
-
Statement type:
File details
Details for the file website_scraper-0.1.1-py3-none-any.whl.
File metadata
- Download URL: website_scraper-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
064edb69f33472d123f6c805f4b1fa5416b5d7883f9efe3187068330684f395e
|
|
| MD5 |
f5cbd1db11c898cdc33ad2fe6f013aee
|
|
| BLAKE2b-256 |
b482026e9dc6d6d69a151f1fc102b0eca2af075fb5325347dcb41fc9e9a4a8bf
|
Provenance
The following attestation bundles were made for website_scraper-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on ml-lubich/website-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
website_scraper-0.1.1-py3-none-any.whl -
Subject digest:
064edb69f33472d123f6c805f4b1fa5416b5d7883f9efe3187068330684f395e - Sigstore transparency entry: 161341429
- Sigstore integration time:
-
Permalink:
ml-lubich/website-scraper@5a2ea674170ed3726d821bb6ffad8b5b775ec67c -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ml-lubich
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5a2ea674170ed3726d821bb6ffad8b5b775ec67c -
Trigger Event:
release
-
Statement type: