A flexible web scraping toolkit with caching capabilities
Project description
Scraperator
A flexible web scraping toolkit with caching capabilities, supporting different fetching methods (Requests and Playwright) with intelligent fallbacks, caching, and Markdown conversion.
Features
- Multiple Scraping Methods: Choose between standard HTTP requests or browser automation via Playwright
- Smart Caching: Persistent cache for scraped content with TTL support
- Automatic Retries: Built-in retry mechanism with exponential backoff
- Concurrent Scraping: Asynchronous scraping with a simple API
- Content Processing: Convert HTML to clean Markdown for easier content extraction
- Flexible Configuration: Extensive customization options for each scraping method
Installation
pip install scraperator
Quick Start
from scraperator import Scraper
# Basic usage with Requests (default)
scraper = Scraper(url="https://example.com")
html = scraper.scrape()
print(scraper.markdown) # Get content as Markdown
# Using Playwright for JavaScript-heavy sites
pw_scraper = Scraper(
url="https://example.com/spa",
method="playwright",
headless=True
)
pw_scraper.scrape()
print(pw_scraper.get_status_code()) # Check status code
Advanced Usage
Configuring Cache
scraper = Scraper(
url="https://example.com",
cache_ttl=7, # Cache for 7 days
cache_directory="custom/cache/dir"
)
Playwright Options
scraper = Scraper(
url="https://example.com/complex-page",
method="playwright",
browser_type="firefox", # Use Firefox browser
headless=False, # Show browser window
wait_for_selectors=[".content", "#main-article"] # Wait for these elements
)
Async Scraping
scraper = Scraper(url="https://example.com")
# Start scraping in background
scraper.scrape(async_mode=True)
# Do other work...
print("Doing other work while scraping...")
# Check if scraping is finished
if scraper.is_complete():
print("Scraping finished!")
else:
# Wait for scraping to complete with timeout
scraper.wait(timeout=10)
html = scraper.get_html()
Markdown Conversion Options
scraper = Scraper(
url="https://example.com/blog",
markdown_options={
"strip_tags": ["script", "style", "nav"],
"content_selectors": ["article", ".post-content"],
"preserve_images": True,
"compact_output": True
}
)
scraper.scrape()
markdown = scraper.get_markdown()
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scraperator-0.0.2.tar.gz
(10.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraperator-0.0.2.tar.gz.
File metadata
- Download URL: scraperator-0.0.2.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e1cb6d92e31ce19f74587d14d0118f893f0436699be4e5d4605c0519156e102
|
|
| MD5 |
e3a3acf7b9ce14f89a41759bda290914
|
|
| BLAKE2b-256 |
fd5ea5fd16b9dd2b5a1de4a627f9034374719a34849a0df2650beca7ebc97be7
|
File details
Details for the file scraperator-0.0.2-py3-none-any.whl.
File metadata
- Download URL: scraperator-0.0.2-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c47088dd8fe96e7058c99d37299c7ec1a43d3c9debb75cbc062fc34a319d2cb
|
|
| MD5 |
a33c464fe934c46658331f3d18555602
|
|
| BLAKE2b-256 |
4876b9b4019471651ee823a6d96ee9f20f05931171493fe62ab1de2b3b71f2f2
|