Skip to main content

Rambot is a versatile and configurable web scraping framework designed to automate data extraction from web pages. It provides an intuitive structure for managing different scraping modes, handling browser automation, and logging.

Project description

Rambot: Versatile Web Scraping Framework

Description

Rambot is a versatile and configurable web scraping framework designed to automate data extraction from web pages. It provides an intuitive structure for:

  • Managing different scraping modes.
  • Automating browser navigation.
  • Handling logs and errors.
  • Performing advanced HTTP requests to interact with APIs.

Installation

pip install rambot

ChromeDriver Dependency

Rambot uses ChromeDriver for automated browsing. Install it based on your operating system:

  • Windows: Download ChromeDriver here and add it to your PATH.
  • macOS: Install via Homebrew:
    brew install chromedriver
    
  • Linux: Install via APT:
    sudo apt install chromium-chromedriver
    

Key Features

1. Mode-Based Execution

  • Supports multiple scraping modes via ScraperModeManager.
  • Use @bind decorator or self.mode_manager.register() to associate functions with specific modes.

2. Headless Browser Control

  • Integrates with botasaurus for automation.
  • Advanced proxy management, image blocking, and extension loading.
  • Uses ChromeDriver to navigate and extract content.

3. Optimized Data Handling

  • Saves extracted data in JSON format.
  • Reads and processes existing data files as input.
  • Models structured data using Document.

4. Error Management & Logging

  • Centralized error handling with ErrorConfig.
  • Uses loguru for detailed and structured logging.

5. Scraping Throttling & Delays

  • Introduces randomized delays to mimic human behavior (wait()).
  • Ensures compliance with website rate limits.

6. Useful Decorators

  • @errors: Structured error handling.
  • @no_print: Suppresses unwanted output.
  • @scrape: Enforces function structure in scraping processes.

Basic Usage

1. Create a Scraper

from rambot.scraper import Scraper, bind
from rambot.scraper.models import Document
import typing

class App(Scraper):
    BASE_URL: str = "https://www.skipthedishes.com"

    @bind(mode="cities")
    def available_cities(self) -> typing.List[Document]:
        self.get("https://www.skipthedishes.com/canada-food-delivery")
        elements = self.find_all("h4 div a")
        return [
            Document(link=self.BASE_URL + href)
            for element in elements
            if (href := element.get_attribute("href"))
        ]

2. Run the Scraper

if __name__ == "__main__":
    app = App()
    app.run()  # Executes the mode registered in launch.json

3. Configure launch.json in VSCode

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "cities",
      "type": "python",
      "request": "launch",
      "program": "main.py",
      "justMyCode": false,
      "args": ["--mode", "cities"]
    }
  ]
}

4. Retrieve Results

Extracted data is saved in {mode}.json:

{
  "data": [
    {"link": "https://www.skipthedishes.com/cities/calgary"},
    {"link": "https://www.skipthedishes.com/cities/brandon"},
    {"link": "https://www.skipthedishes.com/cities/welland"}
  ],
  "run_stats": {"status": "success", "message": null}
}

HTTP Request Module

Description

This module allows sending HTTP requests with automatic error handling, logging, and retry attempts.

Example Usage

from module_name import request

response = request(
    method="GET",
    url="http://example.com",
    options={"headers": {"User-Agent": "CustomAgent"}, "timeout": 10},
    max_retry=3,
    retry_wait=2
)

Using Proxies and Custom Headers

response = request(
    method="POST",
    url="http://example.com/api",
    options={
        "proxies": {"http": "http://my-proxy.com:{port}", "https": "http://my-proxy.com:{port}"},
        "json": {"key": "value"},
        "headers": {"Authorization": "Bearer TOKEN"}
    },
    max_retry=5,
    retry_wait=3
)

Usage in a Scraper

from rambot.requests import requests
from rambot.scraper import Scraper, bind
from rambot.models import Document
import typing

class App(Scraper):
    def open(self, wait=True):
        if self.mode in ["cities"]:
            return  # Prevents browser from opening for this mode
        return super().open(wait)

    @bind(mode="cities")
    def cities(self) -> typing.List[Document]:
        response = requests.send(
            method="GET",
            url="https://www.skipthedishes.com/canada-food-delivery",
            options={"timeout": 15},
            max_retry=5,
            retry_wait=1.25
        )
        elements = response.select("h4 div a")
        return [
            Document(link=self.BASE_URL + href)
            for element in elements
            if (href := element.get("href"))
        ]

Advantages

  • Scraping without a browser: Reduces resource consumption.
  • Retry mechanism: Minimizes failures.
  • Fast data extraction: Parses HTML directly with requests.

With Rambot, automate and optimize your data extractions efficiently! 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rambot-0.1.0.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rambot-0.1.0-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file rambot-0.1.0.tar.gz.

File metadata

  • Download URL: rambot-0.1.0.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rambot-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9066f39fff94d05d84a720909f8558d045e64e1a6142852d9d5bc6ecb8b0ba20
MD5 22bef763e32f2bfedbce5934851f8e98
BLAKE2b-256 d29a037a4e234cad94b009a2a3dec5fcb578ec2d86b2c6cba0d3c4a7147440cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for rambot-0.1.0.tar.gz:

Publisher: python-publish.yml on AlexVachon/rambot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rambot-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rambot-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for rambot-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 32453c454314da54d0726ee80c050ad1285494444ca03dcc6265cb3ef76f0f77
MD5 dd2b2d9a3ec90ad7f5913d238ba7a29f
BLAKE2b-256 460b3b631ea882518897a6836b00ca23b7b7a392e80203ebf1a21c5c72d86328

See more details on using hashes here.

Provenance

The following attestation bundles were made for rambot-0.1.0-py3-none-any.whl:

Publisher: python-publish.yml on AlexVachon/rambot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page