A flexible web scraping toolkit with caching capabilities
Project description
Scraperator
A flexible web scraping toolkit with intelligent caching capabilities, supporting different fetching methods (Requests and Playwright) with automatic fallbacks, persistent caching, and Markdown conversion.
Features
- Multiple Scraping Methods: Choose between standard HTTP requests or browser automation via Playwright
- Smart Caching: Persistent cache for scraped content with TTL support
- Automatic Retries: Built-in retry mechanism with exponential backoff
- Concurrent Scraping: Asynchronous scraping with a simple API
- Content Processing: Convert HTML to clean Markdown for easier content extraction
- Flexible Configuration: Extensive customization options for each scraping method
Installation
pip install scraperator
Scraperator will automatically install the required browser binaries when they're first needed. No additional installation steps are required.
Note: When the browser is first used, there may be a brief delay as the appropriate binaries are downloaded and installed. If you encounter permission issues during automatic installation, you may need to manually install the browsers by running
playwright install chromium(or the browser of your choice) with administrator/sudo privileges.
Quick Start
from scraperator import Scraper
# Basic usage with Requests (default)
scraper = Scraper(url="https://example.com")
html = scraper.scrape()
print(scraper.markdown) # Get content as Markdown
# Using Playwright for JavaScript-heavy sites
pw_scraper = Scraper(
url="https://example.com/spa",
method="playwright",
headless=True
)
pw_scraper.scrape()
print(pw_scraper.get_status_code()) # Check status code
API Reference
Scraper Class
The main entry point for all scraping operations.
Constructor
Scraper(
url: str,
method: str = "requests",
cache_ttl: int = 1,
cache_directory: Optional[str] = None,
cache_id: Optional[str] = None,
browser_type: Literal["chromium", "firefox", "webkit"] = "chromium",
headless: bool = True,
max_retries: int = 3,
backoff_factor: float = 2.0,
markdown_options: Optional[Dict[str, Any]] = None,
**kwargs: Any
)
Parameters:
url: The URL to scrape. This is the only required parameter.method: Scraping method to use. Options are:"requests": Uses the requests library for simple HTTP requests (default)"playwright": Uses browser automation for JavaScript-heavy sites
cache_ttl: Time-to-live for cache in days (default: 1). Set how long scraped content remains valid in cache before refreshing.cache_directory: Custom directory for cache files. If not specified, defaults to "cache/scraper/{method}".cache_id: Custom identifier for cache entry. If not specified, an ID is generated from the URL.browser_type: Browser engine to use with Playwright. Options are "chromium" (default), "firefox", or "webkit".headless: Whether to run browser in headless mode (default: True). Set to False to see the browser while scraping.max_retries: Maximum number of retry attempts for failed requests (default: 3).backoff_factor: Factor for exponential backoff between retries (default: 2.0). Wait time is calculated as backoff_factor^attempt.markdown_options: Dictionary of options for Markdown conversion. See Markdown Options section.**kwargs: Additional options passed to the underlying scraper implementation.
Core Scraping Methods
scrape(async_mode: bool = False, force_refresh: bool = False) -> Optional[str]
Primary method to scrape the URL and return the HTML content.
Parameters:
async_mode: If True, scraping happens in the background and the method returns immediately (default: False).force_refresh: If True, ignores cache and forces a new fetch from the website (default: False).
Returns:
- HTML content as string if
async_modeis False - None if
async_modeis True (scraping happens in background)
Example:
# Synchronous scraping
html = scraper.scrape()
# Asynchronous scraping
scraper.scrape(async_mode=True)
# ... do other work ...
if scraper.is_complete():
html = scraper.get_html()
get_html(force_refresh: bool = False) -> str
Get the HTML content from the scraped URL. Will trigger scraping if not already done.
Parameters:
force_refresh: If True, ignores cache and forces a new fetch (default: False).
Returns:
- HTML content as string.
Example:
html = scraper.get_html()
# or force a refresh
fresh_html = scraper.get_html(force_refresh=True)
get_status_code() -> Optional[int]
Get the HTTP status code from the last scrape operation.
Returns:
- HTTP status code as integer (e.g., 200, 404, 500), or None if no scrape has been performed.
Example:
status = scraper.get_status_code()
if status == 200:
print("Scraping successful")
elif status >= 400:
print(f"Error during scraping: HTTP {status}")
Content Processing Methods
get_markdown() -> str
Get the scraped content converted to Markdown format. Applies the markdown_options specified during initialization.
Returns:
- Markdown content as string.
Example:
markdown_content = scraper.get_markdown()
with open("scraped_content.md", "w") as f:
f.write(markdown_content)
Asynchronous Operation Methods
is_complete() -> bool
Check if an asynchronous scraping operation is complete.
Returns:
- True if scraping is complete or hasn't started, False otherwise.
Example:
scraper.scrape(async_mode=True)
while not scraper.is_complete():
print("Still scraping...")
time.sleep(1)
print("Scraping complete!")
wait(timeout: Optional[float] = None) -> bool
Wait for an asynchronous scraping operation to complete.
Parameters:
timeout: Maximum time to wait in seconds, or None to wait indefinitely.
Returns:
- True if scraping completed successfully, False if timeout was reached.
Example:
scraper.scrape(async_mode=True)
if scraper.wait(timeout=10):
print("Scraping completed within timeout")
else:
print("Scraping timed out")
get_result(timeout: Optional[float] = None) -> Tuple[str, int]
Get the result of an asynchronous scraping operation. Blocks until complete or timeout reached.
Parameters:
timeout: Maximum time to wait in seconds, or None to wait indefinitely.
Returns:
- Tuple of (HTML content, HTTP status code).
Example:
scraper.scrape(async_mode=True)
try:
html, status_code = scraper.get_result(timeout=30)
print(f"Got result with status {status_code}")
except ValueError:
print("Scraping hasn't been started")
cancel() -> bool
Cancel an ongoing asynchronous scraping operation.
Returns:
- True if operation was canceled successfully, False if no operation was in progress.
Example:
scraper.scrape(async_mode=True)
time.sleep(2)
if scraper.cancel():
print("Scraping canceled")
Resource Management Methods
shutdown(wait: bool = True) -> None
Clean up resources used by the scraper. Important to call this method when done to prevent resource leaks.
Parameters:
wait: If True, wait for any pending operations to complete before shutting down (default: True).
Example:
scraper.scrape()
# Process the results
# ...
scraper.shutdown()
Properties
soup
BeautifulSoup object for the scraped HTML. Allows direct access to BeautifulSoup methods for content parsing.
Returns:
- BeautifulSoup object initialized with the scraped HTML.
Example:
scraper.scrape()
title = scraper.soup.title.string
all_links = [a['href'] for a in scraper.soup.find_all('a', href=True)]
markdown
Scraped content converted to Markdown. This is a convenience property that returns the same as get_markdown().
Returns:
- Markdown content as string.
Example:
scraper.scrape()
print(scraper.markdown)
Configuration Options
Scraping Methods
Requests (Default)
Good for simple websites without heavy JavaScript.
scraper = Scraper(
url="https://example.com",
method="requests",
headers={"User-Agent": "Custom User Agent"},
timeout=60
)
Additional options:
headers: Custom HTTP headerstimeout: Request timeout in seconds
Playwright
Recommended for JavaScript-heavy websites, SPAs, and sites with dynamic content.
scraper = Scraper(
url="https://example.com",
method="playwright",
browser_type="chromium", # or "firefox", "webkit"
headless=True,
wait_for_selectors=[".content", "#main-article"],
networkidle_timeout=15000,
load_timeout=45000
)
Additional options:
browser_type: Browser engine to use ("chromium", "firefox", or "webkit")headless: Whether to run browser in headless modewait_for_selectors: CSS selectors to wait for before considering the page loadednetworkidle_timeout: Time to wait for network to be idle (ms)load_timeout: Time to wait for page to load (ms)browser_args: Additional arguments for browser launchcontext_args: Additional arguments for browser context
Caching Options
scraper = Scraper(
url="https://example.com",
cache_ttl=7, # Cache for 7 days
cache_directory="custom/cache/dir",
cache_id="custom_identifier"
)
Markdown Conversion Options
scraper = Scraper(
url="https://example.com",
markdown_options={
"strip_tags": ["script", "style", "nav", "footer"],
"content_selectors": ["article", ".post-content"],
"preserve_images": True,
"include_title": True,
"compact_output": False
}
)
Advanced Usage
Asynchronous Scraping
from scraperator import Scraper
import time
scraper = Scraper(url="https://example.com")
# Start scraping in background
scraper.scrape(async_mode=True)
# Do other work
print("Doing other work while scraping...")
time.sleep(1)
# Check if scraping is finished
if scraper.is_complete():
html = scraper.get_html()
print("Scraping finished!")
else:
# Wait for scraping to complete with timeout
success = scraper.wait(timeout=10)
if success:
html = scraper.get_html()
else:
print("Scraping timed out")
Using as Context Manager
from scraperator import Scraper
with Scraper(url="https://example.com") as scraper:
html = scraper.scrape()
markdown = scraper.markdown
# Resources automatically cleaned up after block
Combining with BeautifulSoup
from scraperator import Scraper
scraper = Scraper(url="https://example.com")
scraper.scrape()
# Access the BeautifulSoup object
soup = scraper.soup
# Use BeautifulSoup methods
title = soup.title.string
links = [a['href'] for a in soup.find_all('a', href=True)]
Best Practices
-
Choose the right scraping method:
- Use
requestsfor simple static websites - Use
playwrightfor JavaScript-heavy sites or SPAs
- Use
-
Set appropriate cache TTL:
- Shorter TTL for frequently changing content
- Longer TTL for static or archival content
-
Handle resources properly:
- Use the context manager pattern with
withstatement - Or explicitly call
shutdown()when done
- Use the context manager pattern with
-
Respect website terms of service:
- Add delays between requests
- Consider implementing rate limiting
- Add proper user agent information
-
Optimize Playwright usage:
- Specify
wait_for_selectorsfor faster completion - Use headless mode unless debugging is needed
- Specify
Common Issues and Solutions
Connection Errors
If you're experiencing connection errors:
- Increase the
max_retriesparameter - Adjust the
backoff_factorfor longer waits between retries - Check network connectivity and website availability
Incomplete Content
If scraped content seems incomplete:
- Switch from
requeststoplaywrightmethod - Specify
wait_for_selectorsfor dynamic content - Increase
networkidle_timeoutandload_timeoutvalues
High Memory Usage
If memory usage is a concern:
- Call
shutdown()after scraping to release resources - Use context manager pattern (
withstatement) - Process and discard data in batches
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraperator-0.0.3.tar.gz.
File metadata
- Download URL: scraperator-0.0.3.tar.gz
- Upload date:
- Size: 18.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e72e7700e491eb3e6010d63208035fa0de9fa84cf72bbb2c229cbb31c16f4313
|
|
| MD5 |
11271239ae5573f9baf8cf19d129dd82
|
|
| BLAKE2b-256 |
654cbd875d177740e9fc3c3448a5831514f829972893539ef06f6297de1ca8ed
|
File details
Details for the file scraperator-0.0.3-py3-none-any.whl.
File metadata
- Download URL: scraperator-0.0.3-py3-none-any.whl
- Upload date:
- Size: 14.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3cb3e17a557f37300b19842aaa22e55e27267e682ca99e63030708abd049031
|
|
| MD5 |
7566a9a0c0f82f38b419c5ae2adf55cc
|
|
| BLAKE2b-256 |
c17b3351a9f0c99c2fd1369ae469a7ccc5f83d5659fdd9bc8f21244d7ec60c6e
|