Skip to main content

Configurable web scraping framework designed to automate data extraction from web pages

Project description

Rambot: Versatile Web Scraping Framework

Description

Rambot is a versatile and configurable web scraping framework designed to automate data extraction from web pages. It provides an intuitive structure for:

  • Managing different scraping modes.
  • Automating browser navigation.
  • Handling logs and errors.
  • Performing advanced HTTP requests to interact with APIs.

Installation

pip install rambot

ChromeDriver Dependency

Rambot uses ChromeDriver for automated browsing. Install it based on your operating system:

  • Windows: Download ChromeDriver here and add it to your PATH.
  • macOS: Install via Homebrew:
    brew install chromedriver
    
  • Linux: Install via APT:
    sudo apt install chromium-chromedriver
    

Key Features

1. Mode-Based Execution

  • Supports multiple scraping modes via ScraperModeManager.
  • Use @bind decorator or self.mode_manager.register() to associate functions with specific modes.

2. Headless Browser Control

  • Integrates with botasaurus for automation.
  • Advanced proxy management, image blocking, and extension loading.
  • Uses ChromeDriver to navigate and extract content.

3. Optimized Data Handling

  • Saves extracted data in JSON format.
  • Reads and processes existing data files as input.
  • Models structured data using Document.

4. Error Management & Logging

  • Centralized error handling with ErrorConfig.
  • Uses loguru for detailed and structured logging.

5. Scraping Throttling & Delays

  • Introduces randomized delays to mimic human behavior (wait()).
  • Ensures compliance with website rate limits.

6. Useful Decorators

  • @errors: Structured error handling.
  • @no_print: Suppresses unwanted output.
  • @scrape: Enforces function structure in scraping processes.

Basic Usage

1. Create a Scraper

from rambot.scraper import Scraper, bind
from rambot.scraper.models import Document
import typing

class App(Scraper):
    BASE_URL: str = "https://www.skipthedishes.com"

    @bind(mode="cities")
    def available_cities(self) -> typing.List[Document]:
        self.get("https://www.skipthedishes.com/canada-food-delivery")
        elements = self.find_all("h4 div a")
        return [
            Document(link=self.BASE_URL + href)
            for element in elements
            if (href := element.get_attribute("href"))
        ]

2. Run the Scraper

if __name__ == "__main__":
    app = App()
    app.run()  # Executes the mode registered in launch.json

3. Configure launch.json in VSCode

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "cities",
      "type": "python",
      "request": "launch",
      "program": "main.py",
      "justMyCode": false,
      "args": ["--mode", "cities"]
    }
  ]
}

4. Retrieve Results

Extracted data is saved in {mode}.json:

{
  "data": [
    {"link": "https://www.skipthedishes.com/cities/calgary"},
    {"link": "https://www.skipthedishes.com/cities/brandon"},
    {"link": "https://www.skipthedishes.com/cities/welland"}
  ],
  "run_stats": {"status": "success", "message": null}
}

HTTP Request Module

Description

This module allows sending HTTP requests with automatic error handling, logging, and retry attempts.

Example Usage

from module_name import request

response = request(
    method="GET",
    url="http://example.com",
    options={"headers": {"User-Agent": "CustomAgent"}, "timeout": 10},
    max_retry=3,
    retry_wait=2
)

Using Proxies and Custom Headers

response = request(
    method="POST",
    url="http://example.com/api",
    options={
        "proxies": {"http": "http://my-proxy.com:{port}", "https": "http://my-proxy.com:{port}"},
        "json": {"key": "value"},
        "headers": {"Authorization": "Bearer TOKEN"}
    },
    max_retry=5,
    retry_wait=3
)

Usage in a Scraper

from rambot.requests import requests
from rambot.scraper import Scraper, bind
from rambot.models import Document
import typing

class App(Scraper):
    def open(self, wait=True):
        if self.mode in ["cities"]:
            return  # Prevents browser from opening for this mode
        return super().open(wait)

    @bind(mode="cities")
    def cities(self) -> typing.List[Document]:
        response = requests.send(
            method="GET",
            url="https://www.skipthedishes.com/canada-food-delivery",
            options={"timeout": 15},
            max_retry=5,
            retry_wait=1.25
        )
        elements = response.select("h4 div a")
        return [
            Document(link=self.BASE_URL + href)
            for element in elements
            if (href := element.get("href"))
        ]

Advantages

  • Scraping without a browser: Reduces resource consumption.
  • Retry mechanism: Minimizes failures.
  • Fast data extraction: Parses HTML directly with requests.

With Rambot, automate and optimize your data extractions efficiently! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rambot-0.1.1.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rambot-0.1.1-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file rambot-0.1.1.tar.gz.

File metadata

  • Download URL: rambot-0.1.1.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rambot-0.1.1.tar.gz
Algorithm Hash digest
SHA256 da936df4d474bffcff92aea6095457d5df85c4ab2589e6c4a4eeaee0353beeee
MD5 fea1b9d8d351f99a4592c73e7eb044f1
BLAKE2b-256 fe35d9f896d106a28bf5ba2bb52d3facffe9239600e545f85e33ce24ba38bfc7

See more details on using hashes here.

Provenance

The following attestation bundles were made for rambot-0.1.1.tar.gz:

Publisher: python-publish.yml on AlexVachon/rambot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rambot-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: rambot-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 26.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rambot-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4485be242548b6f7c93b8db5caa6b3335580c9b7ce21205609b098da28f08b36
MD5 12ecce58ded20a1f216512a1029af5a3
BLAKE2b-256 34106848114cc1dc23275be7895c5c82731d8b26e400af3dac9b89a0b9501954

See more details on using hashes here.

Provenance

The following attestation bundles were made for rambot-0.1.1-py3-none-any.whl:

Publisher: python-publish.yml on AlexVachon/rambot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page