Add your description here
Project description
ScraperLib
_____ _ _ _ / ____| | | (_) | | (___ ___ _ __ __ _ _ __ ___ _ __ | | _| |__ \___ \ / __| '__/ _` | '_ \ / _ \ '__|| | | | '_ \ ____) | (__| | | (_| | |_) | __/ | | |____| | |_) | |_____/ \___|_| \__,_| .__/ \___|_| |______|_|_.__/ | | |_| ============================================================== Starting download of ScraperLib ==============================================================
โจ Features
- Parallel Downloads: Uses Ray to download multiple files simultaneously, maximizing bandwidth and efficiency.
- 403 Avoidance: Rotates user-agents, sets referer headers, and uses session management to avoid being blocked.
- Incremental Mode: Optionally skip files already downloaded.
- Robust State Management: Tracks completed, failed, and skipped downloads with atomic file operations.
- Progress Visualization: Uses tqdm for beautiful progress bars.
- Comprehensive Reporting: Generates JSON reports and visualizations (if matplotlib is installed) of download delays and errors.
- Colorful Console Output: Uses colorama for clear, color-coded logs.
- Dual Logging: Terminal shows only relevant events (e.g.,
[DONE]for successful downloads), while the log file contains all attempts, retries, and errors for full traceability. - Highly Configurable CLI: All parameters (parallelism, chunk size, retry/backoff, output dirs, etc.) can be set via command line.
๐ฆ Installation
-
Clone the repository:
git clone https://github.com/yourusername/scraper-lib.git cd scraper-lib
-
Install dependencies:
pip install -r requirements.txt
Or, if you use Poetry:
poetry installOr, for faster installs (recommended for Linux/Mac):
pip install uv uv pip install -r requirements.txt
Main dependencies:
rayrequeststqdmcoloramabeautifulsoup4matplotlibnumpyportalocker
๐ Usage
CLI
python -m scraper_lib.cli --url <URL> --patterns .csv .zip --dir data --max-files 10
Main CLI options:
--url: Base URL to scrape for files.--patterns: List of file patterns to match (e.g. .csv .zip).--dir: Download directory.--incremental: Enable incremental download state.--max-files: Limit number of files to download.--max-concurrent: Max parallel downloads.--chunk-size: Chunk size for downloads (e.g. 1gb, 10mb, 8 bytes).--initial-delay: Initial delay between retries (seconds).--max-delay: Maximum delay between retries (seconds).--max-retries: Maximum number of download retries.--state-file: Path for download state file.--log-file: Path for main log file.--report-prefix: Prefix for report files.--headers: Path to JSON file with custom headers.--user-agents: Path to text file with custom user agents (one per line).--disable-logging: Disable all logging for production pipelines.--disable-terminal-logging: Disable terminal logging.--dataset-name: Dataset name for banner.--disable-progress-bar: Disable tqdm progress bar.--output-dir: Directory for report PNGs and JSON.--max-old-logs: Max old log files to keep (default: 25, None disables rotation).--max-old-runs: Max old report/png runs to keep (default: 25, None disables rotation).
See all options with:
python -m scraper_lib --help
Programmatic Usage
from ScraperLib import ScraperLib
scraper = ScraperLib(
base_url: str = "https://example.com/data",
file_patterns: List[str] = [".csv", ".parquet", ".zip"],
download_dir: str = "data",
incremental: bool = True,
max_files: Optional[int] = 2,
max_concurrent: Optional[int] = 16,
chunk_size: Union[str, int] = "10mb",
initial_delay: float = 1.0,
max_delay: float = 60.0,
max_retries: int = 5,
dataset_name: Optional[str] = "MY DATASET"
)
scraper.run()
๐ก๏ธ Anti-Blocking Protocols
- User-Agent Rotation: Randomizes user-agent strings on each request and after 403 errors.
- Referer Header: Sets a realistic referer to mimic browser behavior.
- Session Management: Uses a new HTTP session for each attempt.
- Exponential Backoff: Waits longer between retries to avoid rate-limiting.
๐ Reporting
After execution, a summary is printed to the console and a detailed report is saved as a JSON file. If matplotlib is installed, visualizations of download delays are also generated.
๐งช Testing
To run all tests:
pytest tests
๐ Project Structure
.
โโโ src/
โ โโโ __init__.py # Makes src a package
โ โโโ scraper_lib.py # Main library
โ โโโ DownloadState.py # Download state management
โ โโโ CustomLogger.py # Custom logger
โโโ example.py # Example usage (runnable from root)
โโโ requirements.txt # Dependencies
โโโ pyproject.toml # Project metadata
โโโ output/
โ โโโ pngs/ # Download delay analysis PNGs
โ โโโ reports/ # Download reports (JSON)
โโโ data/ # Downloaded files
โโโ logs/ # Log files
โโโ state/ # Download state (auto-generated)
โโโ tests/ # Unit tests
๐ค Contributing
Pull requests and suggestions are welcome! Please open an issue or submit a PR.
๐ License
This project is licensed under the MIT License.
๐ฌ Contact
Questions or suggestions? Open an issue or contact rmonteiropereira1@gmail.com.
Happy data hunting with ScraperLib! ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraper_lib_rmp-0.2.295.tar.gz.
File metadata
- Download URL: scraper_lib_rmp-0.2.295.tar.gz
- Upload date:
- Size: 24.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4687e8b2a83154f52d2a23728eb6b5dc9fa164ce89edfc4c752dbecda970e882
|
|
| MD5 |
56eb691aaf7e02d8f27c0c8450de5bdd
|
|
| BLAKE2b-256 |
e6fa86e9e1aeb5195f1b1d1c6c8f4c0f8034341fcaa457cc529d275c969dc22a
|
Provenance
The following attestation bundles were made for scraper_lib_rmp-0.2.295.tar.gz:
Publisher:
ci-cd.yml on rmonteiro-pereira/Scraper-Lib
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scraper_lib_rmp-0.2.295.tar.gz -
Subject digest:
4687e8b2a83154f52d2a23728eb6b5dc9fa164ce89edfc4c752dbecda970e882 - Sigstore transparency entry: 202744139
- Sigstore integration time:
-
Permalink:
rmonteiro-pereira/Scraper-Lib@f4ddc0735ae83ed1a700a7b40ab35925407e5bbd -
Branch / Tag:
refs/heads/master - Owner: https://github.com/rmonteiro-pereira
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci-cd.yml@f4ddc0735ae83ed1a700a7b40ab35925407e5bbd -
Trigger Event:
push
-
Statement type:
File details
Details for the file scraper_lib_rmp-0.2.295-py3-none-any.whl.
File metadata
- Download URL: scraper_lib_rmp-0.2.295-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5474b751a33692cf0d20d1e50e29eaef84305dacc1ed196491197f32afa8309e
|
|
| MD5 |
3dc67195149628a0ace3d56ac9342064
|
|
| BLAKE2b-256 |
1a7c87ddc4182f42873491fc28be92c2a54e349528638ddde0f330844a7fe7a1
|
Provenance
The following attestation bundles were made for scraper_lib_rmp-0.2.295-py3-none-any.whl:
Publisher:
ci-cd.yml on rmonteiro-pereira/Scraper-Lib
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scraper_lib_rmp-0.2.295-py3-none-any.whl -
Subject digest:
5474b751a33692cf0d20d1e50e29eaef84305dacc1ed196491197f32afa8309e - Sigstore transparency entry: 202744143
- Sigstore integration time:
-
Permalink:
rmonteiro-pereira/Scraper-Lib@f4ddc0735ae83ed1a700a7b40ab35925407e5bbd -
Branch / Tag:
refs/heads/master - Owner: https://github.com/rmonteiro-pereira
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci-cd.yml@f4ddc0735ae83ed1a700a7b40ab35925407e5bbd -
Trigger Event:
push
-
Statement type: