A mini web scraping utility package
Project description
pyminiscraper
Introduction
pyminiscraper is a lightweight Python library designed for easy web scraping. It provides a simple interface to extract data from web pages with minimal setup. Whether you are a beginner or an advanced user, pyminiscraper offers the flexibility to handle various scraping tasks efficiently.
Simplest Use Case
Here is a basic example of how to use pyminiscraper to scrape data from a web page:
from pyminiscraper.scraper import Scraper
from pyminiscraper.config import ScraperConfig
storage_dir.mkdir(parents=True, exist_ok=True)
click.echo(f"Storage directory set to: {storage_dir}")
scraper = Scraper(
ScraperConfig(
scraper_urls=[
ScraperUrl(
"https://www.anthropic.com/news", max_depth=2)
],
max_parallel_requests=16,
use_headless_browser=False,
timeout_seconds=30,
max_requests_per_hour=6*60,
only_sitemaps=False,
scraper_store_factory=FileStoreFactory(storage_dir.absolute().as_posix()),
),
)
await scraper.run()
Advanced Configuration Options
pyminiscraper also provides advanced configuration options to handle more complex scraping scenarios. Below are some of the options you can configure:
Scraper URLs
- Parameter:
scraper_urls: list[ScraperUrl] - Description: A list of
ScraperUrlobjects that define the URLs to be scraped and their respective configurations. - Example:
scraper_urls=[ ScraperUrl("https://www.example.com", max_depth=2) ]
Max Parallel Requests
- Parameter:
max_parallel_requests: int = 16 - Description: The maximum number of parallel requests that the scraper can make.
- Example:
max_parallel_requests=16
Use Headless Browser
- Parameter:
use_headless_browser: bool = False - Description: Whether to use a headless browser for scraping.
- Example:
use_headless_browser=False
Timeout Seconds
- Parameter:
timeout_seconds: int = 30 - Description: The timeout duration in seconds for each request.
- Example:
timeout_seconds=30
Only Sitemaps
- Parameter:
only_sitemaps: bool = True - Description: Whether to scrape only sitemap URLs.
- Example:
only_sitemaps=True
Max Requested URLs
- Parameter:
max_requested_urls: int = 64 * 1024 - Description: The maximum number of URLs that can be requested.
- Example:
max_requested_urls=64 * 1024
Max Back-to-Back Errors
- Parameter:
max_back_to_back_errors: int = 128 - Description: The maximum number of consecutive errors allowed before stopping the scraper.
- Example:
max_back_to_back_errors=128
Scraper Store Factory
- Parameter:
scraper_store_factory: ScraperStoreFactory - Description: The factory used to create the storage for scraped data.
- Example:
scraper_store_factory=FileStoreFactory("/path/to/storage")
Allow L2 Domains
- Parameter:
allow_l2_domains: bool = True - Description: Whether to allow scraping of second-level domains.
- Example:
allow_l2_domains=True
Scraper Callback
- Parameter:
scraper_callback: ScraperCallback | None = None - Description: A callback function that is called after each scraping operation.
- Example:
scraper_callback=my_callback_function
Max Depth
- Parameter:
max_depth: int = 16 - Description: The maximum depth to follow links from the initial URL.
- Example:
max_depth=16
Max Requests Per Hour
- Parameter:
max_requests_per_hour: float = 60*60*10 - Description: The maximum number of requests allowed per hour.
- Example:
max_requests_per_hour=60*60*10
Rerequest After Hours
- Parameter:
rerequest_after_hours: int = 24 - Description: The number of hours to wait before re-requesting a URL.
- Example:
rerequest_after_hours=24
No Page Store
- Parameter:
no_page_store: bool = False - Description: Whether to disable storing the scraped pages.
- Example:
no_page_store=False
User Agent
- Parameter:
user_agent: str = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' - Description: The user agent string to use for requests.
- Example:
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyminiscraper-1.0.0.tar.gz.
File metadata
- Download URL: pyminiscraper-1.0.0.tar.gz
- Upload date:
- Size: 27.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3d4c7f7a1d38cb1cc0b8653af17f357eb536be099e2721638c8c010e690885f
|
|
| MD5 |
3766b301ce382b3993337a56e4c77c64
|
|
| BLAKE2b-256 |
a6e1e0c6e1935ff90d3233ea53355838dfb09088295ad69a7ada7d5baab16325
|
Provenance
The following attestation bundles were made for pyminiscraper-1.0.0.tar.gz:
Publisher:
python-publish.yml on timurua/pyminiscraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyminiscraper-1.0.0.tar.gz -
Subject digest:
e3d4c7f7a1d38cb1cc0b8653af17f357eb536be099e2721638c8c010e690885f - Sigstore transparency entry: 164609004
- Sigstore integration time:
-
Permalink:
timurua/pyminiscraper@026c611d5eb3de22a0e2741e2c54e8181809744f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/timurua
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@026c611d5eb3de22a0e2741e2c54e8181809744f -
Trigger Event:
push
-
Statement type:
File details
Details for the file pyminiscraper-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pyminiscraper-1.0.0-py3-none-any.whl
- Upload date:
- Size: 31.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71d8fb7b6d6af62706bf8150236108bd1c90956b4416a57168136473324063e1
|
|
| MD5 |
964332382207d52ef0e1fdea284e0fa4
|
|
| BLAKE2b-256 |
52f69a25897e90c74c21057c9d6955e7157e009dd3579b2836f303287d126a0d
|
Provenance
The following attestation bundles were made for pyminiscraper-1.0.0-py3-none-any.whl:
Publisher:
python-publish.yml on timurua/pyminiscraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyminiscraper-1.0.0-py3-none-any.whl -
Subject digest:
71d8fb7b6d6af62706bf8150236108bd1c90956b4416a57168136473324063e1 - Sigstore transparency entry: 164609006
- Sigstore integration time:
-
Permalink:
timurua/pyminiscraper@026c611d5eb3de22a0e2741e2c54e8181809744f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/timurua
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@026c611d5eb3de22a0e2741e2c54e8181809744f -
Trigger Event:
push
-
Statement type: