Configurable web scraping framework designed to automate data extraction from web pages
Project description
Rambot: Versatile Web Scraping Framework
Description
Rambot is a versatile and configurable web scraping framework designed to automate data extraction from web pages. It provides an intuitive structure for:
- Managing different scraping modes.
- Automating browser navigation.
- Handling logs and errors.
- Performing advanced HTTP requests to interact with APIs.
Installation
pip install rambot
ChromeDriver Dependency
Rambot uses ChromeDriver for automated browsing. Install it based on your operating system:
- Windows: Download ChromeDriver here and add it to your
PATH. - macOS: Install via Homebrew:
brew install chromedriver
- Linux: Install via APT:
sudo apt install chromium-chromedriver
Key Features
1. Mode-Based Execution
- Supports multiple scraping modes via
ScraperModeManager. - Use
@binddecorator orself.mode_manager.register()to associate functions with specific modes.
2. Headless Browser Control
- Integrates with
botasaurusfor automation. - Advanced proxy management, image blocking, and extension loading.
- Uses
ChromeDriverto navigate and extract content.
3. Optimized Data Handling
- Saves extracted data in JSON format.
- Reads and processes existing data files as input.
- Models structured data using
Document.
4. Error Management & Logging
- Centralized error handling with
ErrorConfig. - Uses
logurufor detailed and structured logging.
5. Scraping Throttling & Delays
- Introduces randomized delays to mimic human behavior (
wait()). - Ensures compliance with website rate limits.
6. Useful Decorators
@errors: Structured error handling.@no_print: Suppresses unwanted output.@scrape: Enforces function structure in scraping processes.
Basic Usage
1. Create a Scraper
from rambot.scraper import Scraper, bind
from rambot.scraper.models import Document
import typing
class App(Scraper):
BASE_URL: str = "https://www.skipthedishes.com"
@bind(mode="cities")
def available_cities(self) -> typing.List[Document]:
self.get("https://www.skipthedishes.com/canada-food-delivery")
elements = self.find_all("h4 div a")
return [
Document(link=self.BASE_URL + href)
for element in elements
if (href := element.get_attribute("href"))
]
2. Run the Scraper
if __name__ == "__main__":
app = App()
app.run() # Executes the mode registered in launch.json
3. Configure launch.json in VSCode
{
"version": "0.2.0",
"configurations": [
{
"name": "cities",
"type": "python",
"request": "launch",
"program": "main.py",
"justMyCode": false,
"args": ["--mode", "cities"]
}
]
}
4. Retrieve Results
Extracted data is saved in {mode}.json:
{
"data": [
{"link": "https://www.skipthedishes.com/cities/calgary"},
{"link": "https://www.skipthedishes.com/cities/brandon"},
{"link": "https://www.skipthedishes.com/cities/welland"}
],
"run_stats": {"status": "success", "message": null}
}
HTTP Request Module
Description
This module allows sending HTTP requests with automatic error handling, logging, and retry attempts.
Example Usage
from module_name import request
response = request(
method="GET",
url="http://example.com",
options={"headers": {"User-Agent": "CustomAgent"}, "timeout": 10},
max_retry=3,
retry_wait=2
)
Using Proxies and Custom Headers
response = request(
method="POST",
url="http://example.com/api",
options={
"proxies": {"http": "http://my-proxy.com:{port}", "https": "http://my-proxy.com:{port}"},
"json": {"key": "value"},
"headers": {"Authorization": "Bearer TOKEN"}
},
max_retry=5,
retry_wait=3
)
Usage in a Scraper
from rambot.requests import requests
from rambot.scraper import Scraper, bind
from rambot.models import Document
import typing
class App(Scraper):
def open(self, wait=True):
if self.mode in ["cities"]:
return # Prevents browser from opening for this mode
return super().open(wait)
@bind(mode="cities")
def cities(self) -> typing.List[Document]:
response = requests.send(
method="GET",
url="https://www.skipthedishes.com/canada-food-delivery",
options={"timeout": 15},
max_retry=5,
retry_wait=1.25
)
elements = response.select("h4 div a")
return [
Document(link=self.BASE_URL + href)
for element in elements
if (href := element.get("href"))
]
Advantages
- Scraping without a browser: Reduces resource consumption.
- Retry mechanism: Minimizes failures.
- Fast data extraction: Parses HTML directly with
requests.
With Rambot, automate and optimize your data extractions efficiently! 🚀
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rambot-0.1.1.tar.gz.
File metadata
- Download URL: rambot-0.1.1.tar.gz
- Upload date:
- Size: 23.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da936df4d474bffcff92aea6095457d5df85c4ab2589e6c4a4eeaee0353beeee
|
|
| MD5 |
fea1b9d8d351f99a4592c73e7eb044f1
|
|
| BLAKE2b-256 |
fe35d9f896d106a28bf5ba2bb52d3facffe9239600e545f85e33ce24ba38bfc7
|
Provenance
The following attestation bundles were made for rambot-0.1.1.tar.gz:
Publisher:
python-publish.yml on AlexVachon/rambot
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rambot-0.1.1.tar.gz -
Subject digest:
da936df4d474bffcff92aea6095457d5df85c4ab2589e6c4a4eeaee0353beeee - Sigstore transparency entry: 178898107
- Sigstore integration time:
-
Permalink:
AlexVachon/rambot@1a29077ec7c24159a8127aad04799039f213bbba -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/AlexVachon
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@1a29077ec7c24159a8127aad04799039f213bbba -
Trigger Event:
release
-
Statement type:
File details
Details for the file rambot-0.1.1-py3-none-any.whl.
File metadata
- Download URL: rambot-0.1.1-py3-none-any.whl
- Upload date:
- Size: 26.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4485be242548b6f7c93b8db5caa6b3335580c9b7ce21205609b098da28f08b36
|
|
| MD5 |
12ecce58ded20a1f216512a1029af5a3
|
|
| BLAKE2b-256 |
34106848114cc1dc23275be7895c5c82731d8b26e400af3dac9b89a0b9501954
|
Provenance
The following attestation bundles were made for rambot-0.1.1-py3-none-any.whl:
Publisher:
python-publish.yml on AlexVachon/rambot
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rambot-0.1.1-py3-none-any.whl -
Subject digest:
4485be242548b6f7c93b8db5caa6b3335580c9b7ce21205609b098da28f08b36 - Sigstore transparency entry: 178898115
- Sigstore integration time:
-
Permalink:
AlexVachon/rambot@1a29077ec7c24159a8127aad04799039f213bbba -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/AlexVachon
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@1a29077ec7c24159a8127aad04799039f213bbba -
Trigger Event:
release
-
Statement type: