Skip to main content

Add your description here

Project description

ScraperLib

Python Ray License


๐Ÿ“š Documentation

   _____                                 _      _ _     
  / ____|                               | |    (_) |    
 | (___   ___ _ __ __ _ _ __   ___ _ __ | |     _| |__  
  \___ \ / __| '__/ _` | '_ \ / _ \ '__|| |    | | '_ \ 
  ____) | (__| | | (_| | |_) |  __/ |   | |____| | |_) |
 |_____/ \___|_|  \__,_| .__/ \___|_|   |______|_|_.__/ 
                      | |                               
                      |_|                               

==============================================================                                  
         Starting download of ScraperLib
==============================================================                                  

โœจ Features

  • Parallel Downloads: Uses Ray to download multiple files simultaneously, maximizing bandwidth and efficiency.
  • 403 Avoidance: Rotates user-agents, sets referer headers, and uses session management to avoid being blocked.
  • Incremental Mode: Optionally skip files already downloaded.
  • Robust State Management: Tracks completed, failed, and skipped downloads with atomic file operations.
  • Progress Visualization: Uses tqdm for beautiful progress bars.
  • Comprehensive Reporting: Generates JSON reports and visualizations (if matplotlib is installed) of download delays and errors.
  • Colorful Console Output: Uses colorama for clear, color-coded logs.
  • Dual Logging: Terminal shows only relevant events (e.g., [DONE] for successful downloads), while the log file contains all attempts, retries, and errors for full traceability.
  • Highly Configurable CLI: All parameters (parallelism, chunk size, retry/backoff, output dirs, etc.) can be set via command line.

๐Ÿ“ฆ Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/scraper-lib.git
    cd scraper-lib
    
  2. Install dependencies:

    pip install -r requirements.txt
    

    Or, if you use Poetry:

    poetry install
    

    Or, for faster installs (recommended for Linux/Mac):

    pip install uv
    uv pip install -r requirements.txt
    

    Main dependencies:

    • ray
    • requests
    • tqdm
    • colorama
    • beautifulsoup4
    • matplotlib
    • numpy
    • portalocker

๐Ÿš€ Usage

CLI

python -m scraper_lib.cli --url <URL> --patterns .csv .zip --dir data --max-files 10

Main CLI options:

  • --url: Base URL to scrape for files.
  • --patterns: List of file patterns to match (e.g. .csv .zip).
  • --dir: Download directory.
  • --incremental: Enable incremental download state.
  • --max-files: Limit number of files to download.
  • --max-concurrent: Max parallel downloads.
  • --chunk-size: Chunk size for downloads (e.g. 1gb, 10mb, 8 bytes).
  • --initial-delay: Initial delay between retries (seconds).
  • --max-delay: Maximum delay between retries (seconds).
  • --max-retries: Maximum number of download retries.
  • --state-file: Path for download state file.
  • --log-file: Path for main log file.
  • --report-prefix: Prefix for report files.
  • --headers: Path to JSON file with custom headers.
  • --user-agents: Path to text file with custom user agents (one per line).
  • --disable-logging: Disable all logging for production pipelines.
  • --disable-terminal-logging: Disable terminal logging.
  • --dataset-name: Dataset name for banner.
  • --disable-progress-bar: Disable tqdm progress bar.
  • --output-dir: Directory for report PNGs and JSON.
  • --max-old-logs: Max old log files to keep (default: 25, None disables rotation).
  • --max-old-runs: Max old report/png runs to keep (default: 25, None disables rotation).

See all options with:

python -m scraper_lib --help

Programmatic Usage

from ScraperLib import ScraperLib

scraper = ScraperLib(
    base_url: str = "https://example.com/data",
    file_patterns: List[str] = [".csv", ".parquet", ".zip"],
    download_dir: str = "data",
    incremental: bool = True,
    max_files: Optional[int] = 2,
    max_concurrent: Optional[int] = 16,
    chunk_size: Union[str, int] = "10mb",
    initial_delay: float = 1.0,
    max_delay: float = 60.0,
    max_retries: int = 5,
    dataset_name: Optional[str] = "MY DATASET"
)
scraper.run()

๐Ÿ›ก๏ธ Anti-Blocking Protocols

  • User-Agent Rotation: Randomizes user-agent strings on each request and after 403 errors.
  • Referer Header: Sets a realistic referer to mimic browser behavior.
  • Session Management: Uses a new HTTP session for each attempt.
  • Exponential Backoff: Waits longer between retries to avoid rate-limiting.

๐Ÿ“Š Reporting

After execution, a summary is printed to the console and a detailed report is saved as a JSON file. If matplotlib is installed, visualizations of download delays are also generated.


๐Ÿงช Testing

To run all tests:

pytest tests

๐Ÿ“ Project Structure

.
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ __init__.py             # Makes src a package
โ”‚   โ”œโ”€โ”€ scraper_lib.py          # Main library
โ”‚   โ”œโ”€โ”€ DownloadState.py        # Download state management
โ”‚   โ”œโ”€โ”€ CustomLogger.py         # Custom logger
โ”œโ”€โ”€ example.py                  # Example usage (runnable from root)
โ”œโ”€โ”€ requirements.txt            # Dependencies
โ”œโ”€โ”€ pyproject.toml              # Project metadata
โ”œโ”€โ”€ output/
โ”‚   โ”œโ”€โ”€ pngs/                   # Download delay analysis PNGs
โ”‚   โ””โ”€โ”€ reports/                # Download reports (JSON)
โ”œโ”€โ”€ data/                       # Downloaded files
โ”œโ”€โ”€ logs/                       # Log files
โ”œโ”€โ”€ state/                      # Download state (auto-generated)
โ”œโ”€โ”€ tests/                      # Unit tests

๐Ÿค Contributing

Pull requests and suggestions are welcome! Please open an issue or submit a PR.


๐Ÿ“„ License

This project is licensed under the MIT License.


๐Ÿ“ฌ Contact

Questions or suggestions? Open an issue or contact rmonteiropereira1@gmail.com.


Happy data hunting with ScraperLib! ๐Ÿš€

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_lib_rmp-0.2.295.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scraper_lib_rmp-0.2.295-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file scraper_lib_rmp-0.2.295.tar.gz.

File metadata

  • Download URL: scraper_lib_rmp-0.2.295.tar.gz
  • Upload date:
  • Size: 24.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for scraper_lib_rmp-0.2.295.tar.gz
Algorithm Hash digest
SHA256 4687e8b2a83154f52d2a23728eb6b5dc9fa164ce89edfc4c752dbecda970e882
MD5 56eb691aaf7e02d8f27c0c8450de5bdd
BLAKE2b-256 e6fa86e9e1aeb5195f1b1d1c6c8f4c0f8034341fcaa457cc529d275c969dc22a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_lib_rmp-0.2.295.tar.gz:

Publisher: ci-cd.yml on rmonteiro-pereira/Scraper-Lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_lib_rmp-0.2.295-py3-none-any.whl.

File metadata

File hashes

Hashes for scraper_lib_rmp-0.2.295-py3-none-any.whl
Algorithm Hash digest
SHA256 5474b751a33692cf0d20d1e50e29eaef84305dacc1ed196491197f32afa8309e
MD5 3dc67195149628a0ace3d56ac9342064
BLAKE2b-256 1a7c87ddc4182f42873491fc28be92c2a54e349528638ddde0f330844a7fe7a1

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_lib_rmp-0.2.295-py3-none-any.whl:

Publisher: ci-cd.yml on rmonteiro-pereira/Scraper-Lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page