A flexible web scraping toolkit with caching capabilities
Project description
Scraperator
A flexible web scraping toolkit with caching capabilities, supporting different fetching methods (Requests and Playwright) with intelligent fallbacks, caching, and Markdown conversion.
Features
- Multiple Scraping Methods: Choose between standard HTTP requests or browser automation via Playwright
- Smart Caching: Persistent cache for scraped content with TTL support
- Automatic Retries: Built-in retry mechanism with exponential backoff
- Concurrent Scraping: Asynchronous scraping with a simple API
- Content Processing: Convert HTML to clean Markdown for easier content extraction
- Flexible Configuration: Extensive customization options for each scraping method
Installation
pip install scraperator
Quick Start
from scraperator import Scraper
# Basic usage with Requests (default)
scraper = Scraper(url="https://example.com")
html = scraper.scrape()
print(scraper.markdown) # Get content as Markdown
# Using Playwright for JavaScript-heavy sites
pw_scraper = Scraper(
url="https://example.com/spa",
method="playwright",
headless=True
)
pw_scraper.scrape()
print(pw_scraper.get_status_code()) # Check status code
Advanced Usage
Configuring Cache
scraper = Scraper(
url="https://example.com",
cache_ttl=7, # Cache for 7 days
cache_directory="custom/cache/dir"
)
Playwright Options
scraper = Scraper(
url="https://example.com/complex-page",
method="playwright",
browser_type="firefox", # Use Firefox browser
headless=False, # Show browser window
wait_for_selectors=[".content", "#main-article"] # Wait for these elements
)
Async Scraping
scraper = Scraper(url="https://example.com")
# Start scraping in background
scraper.scrape(async_mode=True)
# Do other work...
print("Doing other work while scraping...")
# Check if scraping is finished
if scraper.is_complete():
print("Scraping finished!")
else:
# Wait for scraping to complete with timeout
scraper.wait(timeout=10)
html = scraper.get_html()
Markdown Conversion Options
scraper = Scraper(
url="https://example.com/blog",
markdown_options={
"strip_tags": ["script", "style", "nav"],
"content_selectors": ["article", ".post-content"],
"preserve_images": True,
"compact_output": True
}
)
scraper.scrape()
markdown = scraper.get_markdown()
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scraperator-0.0.1.tar.gz
(11.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraperator-0.0.1.tar.gz.
File metadata
- Download URL: scraperator-0.0.1.tar.gz
- Upload date:
- Size: 11.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97df7500808530b8a5418fdd325c7f76ae5f15927a4b5149777e3d2c8aaf92d3
|
|
| MD5 |
353fb821e4af8aea1281e83a44311452
|
|
| BLAKE2b-256 |
4235ce99dd7c377df5b592084caac520ec79a7e98bf3838f277fcfdde33bdfd8
|
File details
Details for the file scraperator-0.0.1-py3-none-any.whl.
File metadata
- Download URL: scraperator-0.0.1-py3-none-any.whl
- Upload date:
- Size: 11.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
edc39431f85697bb76344f3b4d0498b9a8362eda9977ea3087bc69892bd1beb1
|
|
| MD5 |
e612b5d7bf8c88a3f76e6b940e92b839
|
|
| BLAKE2b-256 |
c372026e64c256fa80992b1f4c7cf495fbcded761d3f9c5cce15bffd84604949
|