Configurable web scraping framework designed to automate data extraction from web pages
Project description
Rambot: Versatile Web Scraping Framework
Description
Rambot is a versatile and configurable web scraping framework designed to automate data extraction from web pages. It provides an intuitive structure for:
- Mode Management: Orchestrate complex scraping workflows via a robust mode manager.
- Browser Automation: High-level control of ChromeDriver via
botasaurus. - Network Interception: Native integration with
mitmproxyto capture and filter background XHR/Fetch requests. - Structured Data: Built-in Pydantic-based
Documentmodels for reliable data persistence. - Advanced HTTP: A standalone request module for high-speed scraping without a browser.
Installation
pip install --upgrade rambot
ChromeDriver Dependency
Rambot requires ChromeDriver. Install it based on your OS:
- macOS:
brew install chromedriver - Linux:
sudo apt install chromium-chromedriver - Windows: Download from the Chrome for Testing page.
Key Features
1. Network Interception & Filtering
Capture real-time network traffic using the integrated mitmproxy backend.
- Auto-categorization: Requests are typed as
fetch,document,script,stylesheet,image,font, ormanifest. - Dot Notation: Access data cleanly:
req.response.status,req.url,req.is_fetch. - Zero-Config Export: Directly serializable with
json.dump(self.interceptor.requests(), f).
2. Chained Execution Pipeline
Connect different scraping phases (e.g., Search -> Details -> Download) using the @bind decorator. Rambot automatically handles input/output JSON files between modes.
3. Optimized Performance
- Resource Management: Easily toggle browser usage per mode to save CPU/RAM.
- Throttling: Randomized
wait()delays to mimic human behavior and avoid detection.
The @bind Decorator
The @bind decorator supports Automatic Dependency Discovery. It "spots" connections between modes by inspecting your Python type hints, making manual configuration optional for linear workflows.
Decorator Arguments
| Argument | Type | Description |
|---|---|---|
mode |
str |
Required. The CLI name (e.g., --mode listing). This also defines the output filename: listing.json. |
input |
[str | Callable] |
Optional. Manual override. Can be a filename ("cities.json") or a function to fetch data. |
document_output |
Type[Document] |
Optional. The class used to save results. Automatically detected from return type hints (e.g., -> list[City]). |
save |
Callable |
Optional. A custom function to handle data persistence for this specific mode. |
enable_file_logging |
bool |
If True, creates a dedicated log file for this mode session. |
log_directory |
str |
Directory where mode-specific logs are stored. Defaults to .. |
Usage Options
1. Automatic Discovery (The "Magic" Way)
Rambot uses an internal type registry to link modes together. If one mode returns a specific Document subclass and another mode expects it as an argument, Rambot connects them automatically.
from rambot import Scraper, bind
from rambot.scraper import Document
# Define specific subclasses to act as 'type keys'
class City(Document):
name: str
class BasicScraper(Scraper):
@bind("cities")
def get_cities(self) -> list[City]:
# Registers: City -> 'cities' mode (outputs cities.json)
return [City(link="...", name="Vancouver")]
@bind("listing")
def get_listings(self, city: City):
# 'listing' needs 'City', finds 'cities' mode, and loads 'cities.json'
self.load_page(city.link)
2. Manual Override (For Generic Documents)
When multiple modes use the base Document class, you must manually specify the input file to avoid collisions.
@bind("listing", input="cities.json")
def listing(self, doc: Document) -> list[Document]:
# Explicitly read from cities.json even if return hints are generic
...
3. Functional Input (Custom Fetching)
Instead of a file, you can pass a function to input to fetch data from a database, API, or external source.
def fetch_from_db(scraper):
return [{"link": "https://example.com/1"}, {"link": "https://example.com/2"}]
class DatabaseScraper(Scraper):
@bind("process", input=fetch_from_db)
def process_data(self, doc: Document):
self.load_page(doc.link)
Execution Logic & Priority
When a mode is launched via the CLI, Rambot determines the input data using this hierarchy:
- CLI Override:
--url <link>ignores all other inputs and processes that single URL. - Manual Input: If
inputis defined in@bind(file or function), it is used next. - Auto-Detection: Rambot searches the Type Registry for a mode producing the class in the method signature (e.g.,
city: City). - Empty Start: If no input is found, the mode runs once with no positional arguments.
Advanced Usage: Network Interceptor
Capture background API traffic while navigating. This is ideal for sites like SkipTheDishes that load menus via background JSON calls.
from pydantic import Field
from rambot import Scraper, bind
from rambot.scraper import Document
class ProductDoc(Document):
price: float = Field(0.0)
api_count: int = Field(0)
class InterceptorScraper(Scraper):
@bind(mode="details", input="listing", document_output=ProductDoc)
def details(self, doc) -> ProductDoc:
self.load_page(doc.link)
# Filter for API/Fetch calls only
api_calls = self.interceptor.requests(lambda r: r.is_fetch)
# Check for specific API errors
errors = self.interceptor.requests(lambda r: r.response.is_error)
doc.api_count = len(api_calls)
return doc
Pro-Tips
- Filtering: Use
lambda r: r.resource_type == "image"to find specific assets. - Status Handling: Use
req.response.okto verify capture success. - DotDict: All captured requests inherit from
dict, allowingjson.dump(requests, f)with no extra code.
Configuration
VS Code Launch Setup
Use .vscode/launch.json to debug specific modes and URLs:
{
"configurations": [
{
"name": "Scrape Details",
"type": "python",
"request": "launch",
"program": "main.py",
"args": [
"--mode", "details",
"--url", "https://example.com/target"
]
}
]
}
HTTP Request Module
The rambot.http module provides a high-performance, standalone HTTP client for high-speed scraping without a browser. It is built on top of botasaurus and requests, offering automated retries, advanced header normalization, and seamless integration with browser-like configurations.
Core Features
- Automated Retries: Built-in exponential backoff and retry logic via
max_retryandretry_waitparameters. - Browser Impersonation: Easily simulate specific browsers (e.g., Chrome) and operating systems (e.g., Windows).
- Advanced Header Handling: Automatically normalizes headers to match browser behaviors.
- Response Parsing: Automatically parses responses into structured
ResponseContentor returns raw objects. - Error Management: Robust exception handling for network failures, unsupported methods, and invalid configurations.
Usage Example
For rapid data extraction when a full browser session is not required:
from rambot.http import request
from rambot import Scraper, bind
from rambot.scraper import Document
class BasicDoc(Document):
custom_data: dict
class BasicScraper(Scraper):
def open_browser(self):
# Prevents browser from opening for specific mode
if self.mode == "basic":
return
super().open_browser()
@bind("basic")
def get_basic(self) -> BasicDoc:
json_data = requests.request("GET", "https://api.example.com/details", max_retry=3, parsed=True)
return BasicDoc(link="...", custom_data=json_data)
Function Signature: request()
| Argument | Type | Description | Default |
|---|---|---|---|
method |
Literal["GET", "POST"] |
HTTP verb to use | Required |
url |
HttpUrl |
The target destination URL. | Required |
options |
Union[Dict[str, Any], BeautifulSoup, str, Response] |
Dictionary containing headers, proxies, data, or browser settings. | {} |
max_retry |
int |
Maximum number of attempts in case of failure. | 5 |
retry_wait |
int |
Delay in seconds between retry attempts. | 5 |
parsed |
bool |
If False, returns the raw response instead of a parsed object. |
False |
Error Handling
The module raises specific exceptions to help you debug scraping issues:
MethodError: Raised if an unsupported HTTP method is provided.RequestFailure: Raised when the request fails due to network issues or status errors.OptionsError: Raised if the providedoptionsdictionary contains invalid types or configurations.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rambot-0.1.6.tar.gz.
File metadata
- Download URL: rambot-0.1.6.tar.gz
- Upload date:
- Size: 44.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
824eeb70c049aa3dc14a515cd3e0782f7e70f9109c12947fdf1d60fa0e3fe5d3
|
|
| MD5 |
fe58552c7c336edde60f8e1b968d1413
|
|
| BLAKE2b-256 |
2f1f4a789dff74a5669fd89531532275ae40913f3605ca8431c338d6b422c4e8
|
Provenance
The following attestation bundles were made for rambot-0.1.6.tar.gz:
Publisher:
release.yml on AlexVachon/rambot
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rambot-0.1.6.tar.gz -
Subject digest:
824eeb70c049aa3dc14a515cd3e0782f7e70f9109c12947fdf1d60fa0e3fe5d3 - Sigstore transparency entry: 812674492
- Sigstore integration time:
-
Permalink:
AlexVachon/rambot@0c1dd2509058477bc2e47228979f27604b4cef0e -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/AlexVachon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0c1dd2509058477bc2e47228979f27604b4cef0e -
Trigger Event:
release
-
Statement type:
File details
Details for the file rambot-0.1.6-py3-none-any.whl.
File metadata
- Download URL: rambot-0.1.6-py3-none-any.whl
- Upload date:
- Size: 52.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d768dfcb622ad199053b3cf9997c511cf5743d97a6df2294e1f8651a453e18e
|
|
| MD5 |
c962695338cb8b928882eda1a06dbc92
|
|
| BLAKE2b-256 |
100cb1e2986c79caff6b7fbdd3b63ac770260e9acb058c1d2c18e756fe93a302
|
Provenance
The following attestation bundles were made for rambot-0.1.6-py3-none-any.whl:
Publisher:
release.yml on AlexVachon/rambot
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rambot-0.1.6-py3-none-any.whl -
Subject digest:
9d768dfcb622ad199053b3cf9997c511cf5743d97a6df2294e1f8651a453e18e - Sigstore transparency entry: 812674495
- Sigstore integration time:
-
Permalink:
AlexVachon/rambot@0c1dd2509058477bc2e47228979f27604b4cef0e -
Branch / Tag:
refs/tags/v0.1.6 - Owner: https://github.com/AlexVachon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0c1dd2509058477bc2e47228979f27604b4cef0e -
Trigger Event:
release
-
Statement type: