Configurable web scraping framework designed to automate data extraction from web pages

These details have been verified by PyPI

Project links

Source

GitHub Statistics

Maintainers

DRuby977

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Rambot: Versatile Web Scraping Framework

Description

Rambot is a versatile and configurable web scraping framework designed to automate data extraction from web pages. It provides an intuitive structure for:

Mode Management: Orchestrate complex scraping workflows via a robust mode manager.
Browser Automation: High-level control of ChromeDriver via botasaurus.
Network Interception: Native integration with mitmproxy to capture and filter background XHR/Fetch requests.
Structured Data: Built-in Pydantic-based Document models for reliable data persistence.
Advanced HTTP: A standalone request module for high-speed scraping without a browser.

Installation

pip install --upgrade rambot

ChromeDriver Dependency

Rambot requires ChromeDriver. Install it based on your OS:

macOS: brew install chromedriver
Linux: sudo apt install chromium-chromedriver
Windows: Download from the Chrome for Testing page.

Key Features

1. Network Interception & Filtering

Capture real-time network traffic using the integrated mitmproxy backend.

Auto-categorization: Requests are typed as fetch, document, script, stylesheet, image, font, or manifest.
Dot Notation: Access data cleanly: req.response.status, req.url, req.is_fetch.
Zero-Config Export: Directly serializable with json.dump(self.interceptor.requests(), f).

2. Chained Execution Pipeline

Connect different scraping phases (e.g., Search -> Details -> Download) using the @bind decorator. Rambot automatically handles input/output JSON files between modes.

3. Optimized Performance

Resource Management: Easily toggle browser usage per mode to save CPU/RAM.
Throttling: Randomized wait() delays to mimic human behavior and avoid detection.

The `@bind` Decorator

The @bind decorator supports Automatic Dependency Discovery. It "spots" connections between modes by inspecting your Python type hints, making manual configuration optional for linear workflows.

Decorator Arguments

Argument	Type	Description
`mode`	`str`	Required. The CLI name (e.g., `--mode listing`). This also defines the output filename: `listing.json`.
`input`	`[str \| Callable]`	Optional. Manual override. Can be a filename (`"cities.json"`) or a function to fetch data.
`document_output`	`Type[Document]`	Optional. The class used to save results. Automatically detected from return type hints (e.g., `-> list[City]`).
`save`	`Callable`	Optional. A custom function to handle data persistence for this specific mode.
`enable_file_logging`	`bool`	If `True`, creates a dedicated log file for this mode session.
`log_directory`	`str`	Directory where mode-specific logs are stored. Defaults to `.`.

Usage Options

1. Automatic Discovery (The "Magic" Way)

Rambot uses an internal type registry to link modes together. If one mode returns a specific Document subclass and another mode expects it as an argument, Rambot connects them automatically.

from rambot import Scraper, bind
from rambot.scraper import Document

# Define specific subclasses to act as 'type keys'
class City(Document):
    name: str

class BasicScraper(Scraper):
    @bind("cities")
    def get_cities(self) -> list[City]:
        # Registers: City -> 'cities' mode (outputs cities.json)
        return [City(link="...", name="Vancouver")]

    @bind("listing")
    def get_listings(self, city: City):
        # 'listing' needs 'City', finds 'cities' mode, and loads 'cities.json'
        self.load_page(city.link)

2. Manual Override (For Generic Documents)

When multiple modes use the base Document class, you must manually specify the input file to avoid collisions.

    @bind("listing", input="cities.json")
    def listing(self, doc: Document) -> list[Document]:
        # Explicitly read from cities.json even if return hints are generic
        ...

3. Functional Input (Custom Fetching)

Instead of a file, you can pass a function to input to fetch data from a database, API, or external source.

def fetch_from_db(scraper):
    return [{"link": "https://example.com/1"}, {"link": "https://example.com/2"}]

class DatabaseScraper(Scraper):
    @bind("process", input=fetch_from_db)
    def process_data(self, doc: Document):
        self.load_page(doc.link)

Execution Logic & Priority

When a mode is launched via the CLI, Rambot determines the input data using this hierarchy:

CLI Override: --url <link> ignores all other inputs and processes that single URL.
Manual Input: If input is defined in @bind (file or function), it is used next.
Auto-Detection: Rambot searches the Type Registry for a mode producing the class in the method signature (e.g., city: City).
Empty Start: If no input is found, the mode runs once with no positional arguments.

Advanced Usage: Network Interceptor

Capture background API traffic while navigating. This is ideal for sites like SkipTheDishes that load menus via background JSON calls.

from pydantic import Field
from rambot import Scraper, bind
from rambot.scraper import Document

class ProductDoc(Document):
    price: float = Field(0.0)
    api_count: int = Field(0)

class InterceptorScraper(Scraper):
    @bind(mode="details", input="listing", document_output=ProductDoc)
    def details(self, doc) -> ProductDoc:
        self.load_page(doc.link)
        
        # Filter for API/Fetch calls only
        api_calls = self.interceptor.requests(lambda r: r.is_fetch)
        
        # Check for specific API errors
        errors = self.interceptor.requests(lambda r: r.response.is_error)
        
        doc.api_count = len(api_calls)
        return doc

Pro-Tips

Filtering: Use lambda r: r.resource_type == "image" to find specific assets.
Status Handling: Use req.response.ok to verify capture success.
DotDict: All captured requests inherit from dict, allowing json.dump(requests, f) with no extra code.

Configuration

VS Code Launch Setup

Use .vscode/launch.json to debug specific modes and URLs:

{
    "configurations": [
        {
            "name": "Scrape Details",
            "type": "python",
            "request": "launch",
            "program": "main.py",
            "args": [
                "--mode", "details",
                "--url", "https://example.com/target"
            ]
        }
    ]
}

HTTP Request Module

The rambot.http module provides a high-performance, standalone HTTP client for high-speed scraping without a browser. It is built on top of botasaurus and requests, offering automated retries, advanced header normalization, and seamless integration with browser-like configurations.

Core Features

Automated Retries: Built-in exponential backoff and retry logic via max_retry and retry_wait parameters.
Browser Impersonation: Easily simulate specific browsers (e.g., Chrome) and operating systems (e.g., Windows).
Advanced Header Handling: Automatically normalizes headers to match browser behaviors.
Response Parsing: Automatically parses responses into structured ResponseContent or returns raw objects.
Error Management: Robust exception handling for network failures, unsupported methods, and invalid configurations.

Usage Example

For rapid data extraction when a full browser session is not required:

from rambot.http import request
from rambot import Scraper, bind
from rambot.scraper import Document

class BasicDoc(Document):
    custom_data: dict

class BasicScraper(Scraper):
    def open_browser(self):
        # Prevents browser from opening for specific mode
        if self.mode == "basic":
            return
        super().open_browser()

    @bind("basic")
    def get_basic(self) -> BasicDoc:
        json_data = requests.request("GET", "https://api.example.com/details", max_retry=3, parsed=True)
        return BasicDoc(link="...", custom_data=json_data)

Function Signature: `request()`

Argument	Type	Description	Default
`method`	`Literal["GET", "POST"]`	HTTP verb to use	Required
`url`	`HttpUrl`	The target destination URL.	Required
`options`	`Union[Dict[str, Any], BeautifulSoup, str, Response]`	Dictionary containing headers, proxies, data, or browser settings.	`{}`
`max_retry`	`int`	Maximum number of attempts in case of failure.	`5`
`retry_wait`	`int`	Delay in seconds between retry attempts.	`5`
`parsed`	`bool`	If `False`, returns the raw response instead of a parsed object.	`False`

Error Handling

The module raises specific exceptions to help you debug scraping issues:

MethodError: Raised if an unsupported HTTP method is provided.
RequestFailure: Raised when the request fails due to network issues or status errors.
OptionsError: Raised if the provided options dictionary contains invalid types or configurations.

Project details

These details have been verified by PyPI

Project links

Source

GitHub Statistics

Maintainers

DRuby977

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.6

Jan 10, 2026

0.1.5

Jan 10, 2026

0.1.4

Jan 10, 2026

0.1.3

Jan 9, 2026

0.1.2

Mar 10, 2025

0.1.1

Mar 8, 2025

0.1.0

Mar 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rambot-0.1.6.tar.gz (44.6 kB view details)

Uploaded Jan 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rambot-0.1.6-py3-none-any.whl (52.2 kB view details)

Uploaded Jan 10, 2026 Python 3

File details

Details for the file rambot-0.1.6.tar.gz.

File metadata

Download URL: rambot-0.1.6.tar.gz
Upload date: Jan 10, 2026
Size: 44.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rambot-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`824eeb70c049aa3dc14a515cd3e0782f7e70f9109c12947fdf1d60fa0e3fe5d3`
MD5	`fe58552c7c336edde60f8e1b968d1413`
BLAKE2b-256	`2f1f4a789dff74a5669fd89531532275ae40913f3605ca8431c338d6b422c4e8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rambot-0.1.6.tar.gz:

Publisher: release.yml on AlexVachon/rambot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rambot-0.1.6.tar.gz
- Subject digest: 824eeb70c049aa3dc14a515cd3e0782f7e70f9109c12947fdf1d60fa0e3fe5d3
- Sigstore transparency entry: 812674492
- Sigstore integration time: Jan 10, 2026
Source repository:
- Permalink: AlexVachon/rambot@0c1dd2509058477bc2e47228979f27604b4cef0e
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/AlexVachon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0c1dd2509058477bc2e47228979f27604b4cef0e
- Trigger Event: release

File details

Details for the file rambot-0.1.6-py3-none-any.whl.

File metadata

Download URL: rambot-0.1.6-py3-none-any.whl
Upload date: Jan 10, 2026
Size: 52.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for rambot-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d768dfcb622ad199053b3cf9997c511cf5743d97a6df2294e1f8651a453e18e`
MD5	`c962695338cb8b928882eda1a06dbc92`
BLAKE2b-256	`100cb1e2986c79caff6b7fbdd3b63ac770260e9acb058c1d2c18e756fe93a302`

See more details on using hashes here.

Provenance

The following attestation bundles were made for rambot-0.1.6-py3-none-any.whl:

Publisher: release.yml on AlexVachon/rambot

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: rambot-0.1.6-py3-none-any.whl
- Subject digest: 9d768dfcb622ad199053b3cf9997c511cf5743d97a6df2294e1f8651a453e18e
- Sigstore transparency entry: 812674495
- Sigstore integration time: Jan 10, 2026
Source repository:
- Permalink: AlexVachon/rambot@0c1dd2509058477bc2e47228979f27604b4cef0e
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/AlexVachon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0c1dd2509058477bc2e47228979f27604b4cef0e
- Trigger Event: release

rambot 0.1.6

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Rambot: Versatile Web Scraping Framework

Description

Installation

ChromeDriver Dependency

Key Features

1. Network Interception & Filtering

2. Chained Execution Pipeline

3. Optimized Performance

The @bind Decorator

Decorator Arguments

Usage Options

1. Automatic Discovery (The "Magic" Way)

2. Manual Override (For Generic Documents)

3. Functional Input (Custom Fetching)

Execution Logic & Priority

Advanced Usage: Network Interceptor

Pro-Tips

Configuration

VS Code Launch Setup

HTTP Request Module

Core Features

Usage Example

Function Signature: request()

Error Handling

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

The `@bind` Decorator

Function Signature: `request()`