Add your description here

Project description

ScraperLib

Python Ray License

   _____                                 _      _ _     
  / ____|                               | |    (_) |    
 | (___   ___ _ __ __ _ _ __   ___ _ __ | |     _| |__  
  \___ \ / __| '__/ _` | '_ \ / _ \ '__|| |    | | '_ \ 
  ____) | (__| | | (_| | |_) |  __/ |   | |____| | |_) |
 |_____/ \___|_|  \__,_| .__/ \___|_|   |______|_|_.__/ 
                      | |                               
                      |_|                               

==============================================================                                  
         Starting download of ScraperLib
==============================================================

✨ Features

Parallel Downloads: Uses Ray to download multiple files simultaneously, maximizing bandwidth and efficiency.
403 Avoidance: Rotates user-agents, sets referer headers, and uses session management to avoid being blocked.
Incremental Mode: Optionally skip files already downloaded.
Robust State Management: Tracks completed, failed, and skipped downloads with atomic file operations.
Progress Visualization: Uses tqdm for beautiful progress bars.
Comprehensive Reporting: Generates JSON reports and visualizations (if matplotlib is installed) of download delays and errors.
Colorful Console Output: Uses colorama for clear, color-coded logs.
Dual Logging: Terminal shows only relevant events (e.g., [DONE] for successful downloads), while the log file contains all attempts, retries, and errors for full traceability.
Highly Configurable CLI: All parameters (parallelism, chunk size, retry/backoff, output dirs, etc.) can be set via command line.

📦 Installation

Clone the repository:

git clone https://github.com/yourusername/scraper-lib.git
cd scraper-lib

Install dependencies:
```
pip install -r requirements.txt
```
Or, if you use Poetry:
```
poetry install
```
Or, for faster installs (recommended for Linux/Mac):
```
pip install uv
uv pip install -r requirements.txt
```
Main dependencies:
- ray
- requests
- tqdm
- colorama
- beautifulsoup4
- matplotlib
- numpy
- portalocker

🚀 Usage

CLI

python -m scraper_lib.cli --url <URL> --patterns .csv .zip --dir data --max-files 10

Main CLI options:

--url: Base URL to scrape for files.
--patterns: List of file patterns to match (e.g. .csv .zip).
--dir: Download directory.
--incremental: Enable incremental download state.
--max-files: Limit number of files to download.
--max-concurrent: Max parallel downloads.
--chunk-size: Chunk size for downloads (e.g. 1gb, 10mb, 8 bytes).
--initial-delay: Initial delay between retries (seconds).
--max-delay: Maximum delay between retries (seconds).
--max-retries: Maximum number of download retries.
--state-file: Path for download state file.
--log-file: Path for main log file.
--report-prefix: Prefix for report files.
--headers: Path to JSON file with custom headers.
--user-agents: Path to text file with custom user agents (one per line).
--disable-logging: Disable all logging for production pipelines.
--disable-terminal-logging: Disable terminal logging.
--dataset-name: Dataset name for banner.
--disable-progress-bar: Disable tqdm progress bar.
--output-dir: Directory for report PNGs and JSON.
--max-old-logs: Max old log files to keep (default: 25, None disables rotation).
--max-old-runs: Max old report/png runs to keep (default: 25, None disables rotation).

See all options with:

python -m scraper_lib --help

Programmatic Usage

from ScraperLib import ScraperLib

scraper = ScraperLib(
    base_url: str = "https://example.com/data",
    file_patterns: List[str] = [".csv", ".parquet", ".zip"],
    download_dir: str = "data",
    incremental: bool = True,
    max_files: Optional[int] = 2,
    max_concurrent: Optional[int] = 16,
    chunk_size: Union[str, int] = "10mb",
    initial_delay: float = 1.0,
    max_delay: float = 60.0,
    max_retries: int = 5,
    dataset_name: Optional[str] = "MY DATASET"
)
scraper.run()

🛡️ Anti-Blocking Protocols

User-Agent Rotation: Randomizes user-agent strings on each request and after 403 errors.
Referer Header: Sets a realistic referer to mimic browser behavior.
Session Management: Uses a new HTTP session for each attempt.
Exponential Backoff: Waits longer between retries to avoid rate-limiting.

📊 Reporting

After execution, a summary is printed to the console and a detailed report is saved as a JSON file. If matplotlib is installed, visualizations of download delays are also generated.

🧪 Testing

To run all tests:

pytest tests

📝 Project Structure

.
├── src/
│   ├── __init__.py             # Makes src a package
│   ├── scraper_lib.py          # Main library
│   ├── DownloadState.py        # Download state management
│   ├── CustomLogger.py         # Custom logger
├── example.py                  # Example usage (runnable from root)
├── requirements.txt            # Dependencies
├── pyproject.toml              # Project metadata
├── output/
│   ├── pngs/                   # Download delay analysis PNGs
│   └── reports/                # Download reports (JSON)
├── data/                       # Downloaded files
├── logs/                       # Log files
├── state/                      # Download state (auto-generated)
├── tests/                      # Unit tests

🤝 Contributing

Pull requests and suggestions are welcome! Please open an issue or submit a PR.

📄 License

This project is licensed under the MIT License.

📬 Contact

Questions or suggestions? Open an issue or contact rmonteiropereira1@gmail.com.

Happy data hunting with ScraperLib! 🚀

Project details

Release history Release notifications | RSS feed

This version

0.2.295

Apr 25, 2025

0.2.294

Apr 25, 2025

0.2.293

Apr 25, 2025

0.2.292

Apr 25, 2025

0.2.291

Apr 25, 2025

0.2.290

Apr 25, 2025

0.2.289

Apr 25, 2025

0.2.288

Apr 25, 2025

0.2.287

Apr 25, 2025

0.2.286

Apr 25, 2025

0.2.285

Apr 25, 2025

0.2.284

Apr 25, 2025

0.2.283

Apr 25, 2025

0.2.282

Apr 25, 2025

0.2.281

Apr 25, 2025

0.2.280

Apr 25, 2025

0.2.279

Apr 25, 2025

0.2.278

Apr 25, 2025

0.2.277

Apr 25, 2025

0.2.276

Apr 25, 2025

0.2.275

Apr 25, 2025

0.2.274

Apr 25, 2025

0.2.273

Apr 25, 2025

0.2.272

Apr 25, 2025

0.2.271

Apr 25, 2025

0.2.270

Apr 25, 2025

0.2.269

Apr 25, 2025

0.2.268

Apr 25, 2025

0.2.267

Apr 25, 2025

0.2.266

Apr 25, 2025

0.2.265

Apr 25, 2025

0.2.264

Apr 25, 2025

0.2.263

Apr 25, 2025

0.2.262

Apr 25, 2025

0.2.261

Apr 25, 2025

0.2.260

Apr 25, 2025

0.2.259

Apr 25, 2025

0.2.258

Apr 25, 2025

0.2.257

Apr 25, 2025

0.2.256

Apr 25, 2025

0.2.255

Apr 25, 2025

0.2.254

Apr 25, 2025

0.2.253

Apr 25, 2025

0.2.252

Apr 25, 2025

0.2.251

Apr 25, 2025

0.2.250

Apr 25, 2025

0.2.249

Apr 25, 2025

0.2.248

Apr 25, 2025

0.2.247

Apr 25, 2025

0.2.246

Apr 25, 2025

0.2.245

Apr 25, 2025

0.2.244

Apr 25, 2025

0.2.243

Apr 25, 2025

0.2.242

Apr 25, 2025

0.2.241

Apr 25, 2025

0.2.240

Apr 25, 2025

0.2.239

Apr 25, 2025

0.2.238

Apr 25, 2025

0.2.237

Apr 25, 2025

0.2.236

Apr 25, 2025

0.2.235

Apr 25, 2025

0.2.234

Apr 25, 2025

0.2.233

Apr 25, 2025

0.2.232

Apr 25, 2025

0.2.231

Apr 25, 2025

0.2.230

Apr 25, 2025

0.2.229

Apr 25, 2025

0.2.228

Apr 25, 2025

0.2.227

Apr 25, 2025

0.2.226

Apr 25, 2025

0.2.225

Apr 25, 2025

0.2.224

Apr 25, 2025

0.2.223

Apr 25, 2025

0.2.222

Apr 25, 2025

0.2.221

Apr 25, 2025

0.2.220

Apr 25, 2025

0.2.219

Apr 25, 2025

0.2.218

Apr 25, 2025

0.2.217

Apr 25, 2025

0.2.216

Apr 25, 2025

0.2.215

Apr 25, 2025

0.2.214

Apr 25, 2025

0.2.213

Apr 25, 2025

0.2.212

Apr 25, 2025

0.2.211

Apr 25, 2025

0.2.210

Apr 25, 2025

0.2.209

Apr 25, 2025

0.2.208

Apr 25, 2025

0.2.207

Apr 25, 2025

0.2.206

Apr 25, 2025

0.2.205

Apr 25, 2025

0.2.204

Apr 25, 2025

0.2.203

Apr 25, 2025

0.2.202

Apr 25, 2025

0.2.201

Apr 25, 2025

0.2.200

Apr 25, 2025

0.2.199

Apr 25, 2025

0.2.198

Apr 25, 2025

0.2.197

Apr 25, 2025

0.2.196

Apr 25, 2025

0.2.195

Apr 25, 2025

0.2.194

Apr 25, 2025

0.2.193

Apr 25, 2025

0.2.192

Apr 25, 2025

0.2.191

Apr 25, 2025

0.2.190

Apr 25, 2025

0.2.189

Apr 25, 2025

0.2.188

Apr 25, 2025

0.2.187

Apr 25, 2025

0.2.186

Apr 25, 2025

0.2.185

Apr 25, 2025

0.2.184

Apr 25, 2025

0.2.183

Apr 25, 2025

0.2.182

Apr 25, 2025

0.2.181

Apr 25, 2025

0.2.180

Apr 25, 2025

0.2.179

Apr 25, 2025

0.2.178

Apr 25, 2025

0.2.177

Apr 25, 2025

0.2.176

Apr 25, 2025

0.2.175

Apr 25, 2025

0.2.174

Apr 25, 2025

0.2.173

Apr 25, 2025

0.2.172

Apr 25, 2025

0.2.171

Apr 25, 2025

0.2.170

Apr 25, 2025

0.2.169

Apr 25, 2025

0.2.168

Apr 25, 2025

0.2.167

Apr 25, 2025

0.2.166

Apr 25, 2025

0.2.165

Apr 25, 2025

0.2.164

Apr 25, 2025

0.2.163

Apr 25, 2025

0.2.162

Apr 25, 2025

0.2.161

Apr 25, 2025

0.2.160

Apr 25, 2025

0.2.159

Apr 25, 2025

0.2.158

Apr 25, 2025

0.2.157

Apr 25, 2025

0.2.156

Apr 25, 2025

0.2.155

Apr 25, 2025

0.2.154

Apr 25, 2025

0.2.153

Apr 25, 2025

0.2.152

Apr 25, 2025

0.2.151

Apr 25, 2025

0.2.150

Apr 25, 2025

0.2.149

Apr 25, 2025

0.2.148

Apr 25, 2025

0.2.147

Apr 25, 2025

0.2.146

Apr 25, 2025

0.2.145

Apr 25, 2025

0.2.144

Apr 25, 2025

0.2.143

Apr 25, 2025

0.2.142

Apr 25, 2025

0.2.141

Apr 25, 2025

0.2.140

Apr 25, 2025

0.2.139

Apr 25, 2025

0.2.138

Apr 25, 2025

0.2.137

Apr 25, 2025

0.2.136

Apr 25, 2025

0.2.135

Apr 25, 2025

0.2.134

Apr 25, 2025

0.2.133

Apr 25, 2025

0.2.131

Apr 25, 2025

0.2.130

Apr 25, 2025

0.2.129

Apr 25, 2025

0.2.128

Apr 25, 2025

0.2.127

Apr 25, 2025

0.2.126

Apr 25, 2025

0.2.125

Apr 25, 2025

0.2.124

Apr 25, 2025

0.2.123

Apr 25, 2025

0.2.122

Apr 25, 2025

0.2.121

Apr 25, 2025

0.2.120

Apr 25, 2025

0.2.119

Apr 25, 2025

0.2.118

Apr 25, 2025

0.2.117

Apr 25, 2025

0.2.116

Apr 25, 2025

0.2.115

Apr 25, 2025

0.2.114

Apr 25, 2025

0.2.113

Apr 25, 2025

0.2.112

Apr 25, 2025

0.2.111

Apr 25, 2025

0.2.110

Apr 25, 2025

0.2.109

Apr 25, 2025

0.2.108

Apr 25, 2025

0.2.107

Apr 25, 2025

0.2.106

Apr 25, 2025

0.2.105

Apr 25, 2025

0.2.104

Apr 25, 2025

0.2.103

Apr 25, 2025

0.2.102

Apr 25, 2025

0.2.101

Apr 25, 2025

0.2.100

Apr 25, 2025

0.2.99

Apr 25, 2025

0.2.98

Apr 25, 2025

0.2.97

Apr 25, 2025

0.2.96

Apr 25, 2025

0.2.95

Apr 25, 2025

0.2.94

Apr 25, 2025

0.2.93

Apr 25, 2025

0.2.92

Apr 25, 2025

0.2.91

Apr 25, 2025

0.2.90

Apr 25, 2025

0.2.89

Apr 25, 2025

0.2.88

Apr 25, 2025

0.2.87

Apr 25, 2025

0.2.86

Apr 25, 2025

0.2.85

Apr 25, 2025

0.2.84

Apr 25, 2025

0.2.83

Apr 25, 2025

0.2.82

Apr 25, 2025

0.2.81

Apr 25, 2025

0.2.80

Apr 25, 2025

0.2.79

Apr 25, 2025

0.2.78

Apr 25, 2025

0.2.77

Apr 25, 2025

0.2.76

Apr 25, 2025

0.2.75

Apr 25, 2025

0.2.74

Apr 25, 2025

0.2.73

Apr 25, 2025

0.2.72

Apr 25, 2025

0.2.71

Apr 25, 2025

0.2.70

Apr 25, 2025

0.2.69

Apr 25, 2025

0.2.68

Apr 25, 2025

0.2.67

Apr 25, 2025

0.2.66

Apr 25, 2025

0.2.65

Apr 25, 2025

0.2.64

Apr 25, 2025

0.2.63

Apr 25, 2025

0.2.62

Apr 25, 2025

0.2.61

Apr 25, 2025

0.2.60

Apr 25, 2025

0.2.59

Apr 25, 2025

0.2.58

Apr 25, 2025

0.2.57

Apr 25, 2025

0.2.56

Apr 25, 2025

0.2.55

Apr 25, 2025

0.2.54

Apr 25, 2025

0.2.53

Apr 25, 2025

0.2.52

Apr 25, 2025

0.2.51

Apr 25, 2025

0.2.50

Apr 25, 2025

0.2.49

Apr 25, 2025

0.2.48

Apr 25, 2025

0.2.47

Apr 25, 2025

0.2.46

Apr 25, 2025

0.2.45

Apr 25, 2025

0.2.44

Apr 25, 2025

0.2.43

Apr 25, 2025

0.2.42

Apr 25, 2025

0.2.41

Apr 25, 2025

0.2.40

Apr 25, 2025

0.2.39

Apr 25, 2025

0.2.38

Apr 25, 2025

0.2.37

Apr 25, 2025

0.2.36

Apr 25, 2025

0.2.35

Apr 25, 2025

0.2.34

Apr 25, 2025

0.2.33

Apr 25, 2025

0.2.32

Apr 25, 2025

0.2.31

Apr 25, 2025

0.2.30

Apr 25, 2025

0.2.29

Apr 25, 2025

0.2.28

Apr 25, 2025

0.2.27

Apr 25, 2025

0.2.26

Apr 25, 2025

0.2.25

Apr 25, 2025

0.2.24

Apr 25, 2025

0.2.23

Apr 25, 2025

0.2.22

Apr 25, 2025

0.2.21

Apr 25, 2025

0.2.20

Apr 25, 2025

0.2.19

Apr 25, 2025

0.2.18

Apr 25, 2025

0.2.17

Apr 25, 2025

0.2.16

Apr 25, 2025

0.2.15

Apr 25, 2025

0.2.14

Apr 25, 2025

0.2.13

Apr 25, 2025

0.2.12

Apr 25, 2025

0.2.11

Apr 25, 2025

0.2.10

Apr 25, 2025

0.2.9

Apr 25, 2025

0.2.8

Apr 25, 2025

0.2.6

Apr 25, 2025

0.2.5

Apr 25, 2025

0.2.4

Apr 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_lib_rmp-0.2.295.tar.gz (24.4 kB view details)

Uploaded Apr 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scraper_lib_rmp-0.2.295-py3-none-any.whl (20.4 kB view details)

Uploaded Apr 25, 2025 Python 3

File details

Details for the file scraper_lib_rmp-0.2.295.tar.gz.

File metadata

Download URL: scraper_lib_rmp-0.2.295.tar.gz
Upload date: Apr 25, 2025
Size: 24.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for scraper_lib_rmp-0.2.295.tar.gz
Algorithm	Hash digest
SHA256	`4687e8b2a83154f52d2a23728eb6b5dc9fa164ce89edfc4c752dbecda970e882`
MD5	`56eb691aaf7e02d8f27c0c8450de5bdd`
BLAKE2b-256	`e6fa86e9e1aeb5195f1b1d1c6c8f4c0f8034341fcaa457cc529d275c969dc22a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_lib_rmp-0.2.295.tar.gz:

Publisher: ci-cd.yml on rmonteiro-pereira/Scraper-Lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scraper_lib_rmp-0.2.295.tar.gz
- Subject digest: 4687e8b2a83154f52d2a23728eb6b5dc9fa164ce89edfc4c752dbecda970e882
- Sigstore transparency entry: 202744139
- Sigstore integration time: Apr 25, 2025
Source repository:
- Permalink: rmonteiro-pereira/Scraper-Lib@f4ddc0735ae83ed1a700a7b40ab35925407e5bbd
- Branch / Tag: refs/heads/master
- Owner: https://github.com/rmonteiro-pereira
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci-cd.yml@f4ddc0735ae83ed1a700a7b40ab35925407e5bbd
- Trigger Event: push

File details

Details for the file scraper_lib_rmp-0.2.295-py3-none-any.whl.

File metadata

Download URL: scraper_lib_rmp-0.2.295-py3-none-any.whl
Upload date: Apr 25, 2025
Size: 20.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for scraper_lib_rmp-0.2.295-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5474b751a33692cf0d20d1e50e29eaef84305dacc1ed196491197f32afa8309e`
MD5	`3dc67195149628a0ace3d56ac9342064`
BLAKE2b-256	`1a7c87ddc4182f42873491fc28be92c2a54e349528638ddde0f330844a7fe7a1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_lib_rmp-0.2.295-py3-none-any.whl:

Publisher: ci-cd.yml on rmonteiro-pereira/Scraper-Lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scraper_lib_rmp-0.2.295-py3-none-any.whl
- Subject digest: 5474b751a33692cf0d20d1e50e29eaef84305dacc1ed196491197f32afa8309e
- Sigstore transparency entry: 202744143
- Sigstore integration time: Apr 25, 2025
Source repository:
- Permalink: rmonteiro-pereira/Scraper-Lib@f4ddc0735ae83ed1a700a7b40ab35925407e5bbd
- Branch / Tag: refs/heads/master
- Owner: https://github.com/rmonteiro-pereira
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci-cd.yml@f4ddc0735ae83ed1a700a7b40ab35925407e5bbd
- Trigger Event: push

Scraper-Lib-RMP 0.2.295

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ScraperLib

✨ Features

📦 Installation

🚀 Usage

CLI

Programmatic Usage

🛡️ Anti-Blocking Protocols

📊 Reporting

🧪 Testing

📝 Project Structure

🤝 Contributing

📄 License

📬 Contact

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance