A mini web scraping utility package
Project description
pyminiscraper
Introduction
pyminiscraper is a lightweight Python library designed for easy web scraping. It provides a simple interface to extract data from web pages with minimal setup. Whether you are a beginner or an advanced user, pyminiscraper offers the flexibility to handle various scraping tasks efficiently.
Features
| Feature | Description |
|---|---|
| Basic Web Page scraping | Scrape HTML content from web pages |
| Async scraping | Extremely scalable asynchronous scraping |
| Web Page spidering | Follow and scrape links from web pages |
| Parallel requests | Configure number of concurrent requests |
| Headless browser support | JavaScript rendering support |
| Robots.txt parsing | Respect robots.txt rules |
| Sitemap parsing | Parse and follow sitemap.xml |
| RSS/Atom parsing | Parse and follow RSS/Atom feeds |
| Open Graph parsing | Extract Open Graph metadata |
| Rate limiting | Configurable per-domain rate limiting |
| Error handling | Robust error handling with retry logic |
| Depth control | Control recursion depth for link following |
| Custom user agent | Set custom User-Agent strings |
| File storage | Built-in file storage system |
| Custom callbacks | Define custom processing logic |
| Domain restrictions | Control allowed/blocked domains |
| Request timeout | Configurable request timeouts |
| Page caching | Cache and reuse downloaded pages |
Installation
pip install pyminiscraper
How does it work
┌───────────────────┐
│ │
│ Initializing │
│ │
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ Download │
│ Robots.txt │
│ │
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ Queue for │
| Configurable |◀────┐
| Parallel | |
| Processing | |
└─────────┬─────────┘ |
│ |
▼ |
┌───────────────────┐ | ┌───────────────────┐
│ Scrape │ | │ │
| Web Pages, | | │ Loading │
│ RSS & Atom │──── | ───│ Saving │
│ │ | │ Web Pages │
└─────────┬─────────┘ | └───────────────────┘
│ |
▼ |
┌───────────────────┐ |
│ Discover │ |
│ Outgoing | |
| Web Page │─────┘
| RSS/Atom feed |
│ links │
└───────────────────┘
Basic Usage
Downloading Sitemap-Referenced Pages
This example shows how to scrape only pages referenced in a sitemap:
from pyminiscraper import Scraper, ScraperConfig, ScraperUrl, FileStore
from pyminiscraper.config import ScraperDomainConfig, ScraperDomainConfigMode
scraper = Scraper(
ScraperConfig(
seed_urls=[
ScraperUrl("https://www.example.com/", max_depth=2)
],
follow_sitemap_links=True,
follow_web_page_links=False,
follow_feed_links=False,
callback=FileStore(storage_dir),
),
)
await scraper.run()
Scraping RSS/Atom Feeds
Example of scraping content from RSS/Atom feeds:
from pyminiscraper import Scraper, ScraperConfig, ScraperUrl, ScraperUrlType, FileStore
scraper = Scraper(
ScraperConfig(
seed_urls=[
ScraperUrl(
"https://feeds.feedburner.com/PythonInsider",
type=ScraperUrlType.FEED
)
],
follow_sitemap_links=False,
follow_web_page_links=False,
follow_feed_links=True,
callback=FileStore(storage_dir),
),
)
await scraper.run()
Full Website Crawling
Example of comprehensive website crawling using all available sources:
scraper = Scraper(
ScraperConfig(
seed_urls=[
ScraperUrl("https://www.example.com/")
],
follow_sitemap_links=True,
follow_web_page_links=True,
follow_feed_links=True,
callback=FileStore(storage_dir),
),
)
await scraper.run()
Custom Processing with Callbacks
Example of custom processing using callbacks:
from pyminiscraper.config import ScraperCallback, ScraperContext
from pyminiscraper.model import ScraperWebPage, ScraperUrl
class CustomCallback(ScraperCallback):
async def on_web_page(self, context: ScraperContext, request: ScraperUrl, response: ScraperWebPage) -> None:
# Custom processing logic here
print(f"Processing {response.url}")
print(f"Title: {response.metadata_title}")
async def on_feed(self, context: ScraperContext, feed: Feed) -> None:
# Custom feed processing
for item in feed.items:
print(f"Feed item: {item.title}")
scraper = Scraper(
ScraperConfig(
seed_urls=[ScraperUrl("https://example.com")],
callback=CustomCallback(),
)
)
await scraper.run()
Configuration Options
Configuration for web scraping behavior.
Parameters:
- seed_urls (list[ScraperUrl]): Initial URLs to start scraping from
- callback (ScraperCallback): Callback for processing scraped content
- include_path_patterns (list[str]): URL paths to include (default: [])
- exclude_path_patterns (list[str]): URL paths to exclude (default: [])
- max_parallel_requests (int): Maximum concurrent requests (default: 16)
- use_headless_browser (bool): Use headless browser for JavaScript (default: False)
- request_timeout_seconds (int): Request timeout in seconds (default: 30)
- follow_web_page_links (bool): Follow links in web pages (default: False)
- follow_sitemap_links (bool): Follow sitemap.xml links (default: True)
- follow_feed_links (bool): Follow RSS/Atom feed links (default: True)
- prevent_default_queuing (bool): Disable automatic URL queuing (default: False)
- max_requested_urls (int): Maximum total URLs to request (default: 65536)
- max_back_to_back_errors (int): Consecutive errors before stopping (default: 128)
- on_response_callback (ScraperResponseCallback): Optional response callback
- max_depth (int): Maximum recursion depth for links (default: 16)
- crawl_delay_seconds (int): Delay between requests per domain (default: 1)
- domain_config (ScraperDomainConfig): Allowed/blocked domains configuration
- user_agent (str): User agent string (default: 'pyminiscraper')
- referer (str): Referer header (default: "https://www.google.com")
Domain Configuration
Control which domains are allowed or blocked:
from pyminiscraper.config import ScraperDomainConfig, ScraperDomainConfigMode
# Allow only specific domains
config = ScraperDomainConfig(
allowance=ScraperAllowedDomains(domains=["example.com", "api.example.com"]),
forbidden_domains=["ads.example.com"]
)
# Allow all domains
config = ScraperDomainConfig(
allowance=ScraperDomainConfigMode.ALLOW_ALL
)
# Allow only domains derived from seed URLs
config = ScraperDomainConfig(
allowance=ScraperDomainConfigMode.DERIVE_FROM_SEED_URLS
)
Error Handling
The scraper includes built-in error handling:
- Respects
max_back_to_back_errorsto stop after consecutive failures - Retries failed requests with exponential backoff
- Logs errors for debugging
- Continues operation after non-fatal errors
Performance Tips
- Adjust
max_parallel_requestsbased on your needs and server capacity - Use
crawl_delay_secondsto control request rate - Enable
use_headless_browseronly when JavaScript rendering is required - Implement caching in your callback to avoid re-downloading pages
- Use path patterns to filter URLs before downloading
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyminiscraper-2.0.7.tar.gz.
File metadata
- Download URL: pyminiscraper-2.0.7.tar.gz
- Upload date:
- Size: 32.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61116232be1450c2d1d14fcc1c67674259b4509dedaa6cdb39f280e1252b7434
|
|
| MD5 |
0344e96a7ec3f036b4fd428876cf0e92
|
|
| BLAKE2b-256 |
b0ec9eb7072ac797280c16dd649af3f123546a208c2f8d3b45ea18eb9f882d94
|
Provenance
The following attestation bundles were made for pyminiscraper-2.0.7.tar.gz:
Publisher:
python-publish.yml on timurua/pyminiscraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyminiscraper-2.0.7.tar.gz -
Subject digest:
61116232be1450c2d1d14fcc1c67674259b4509dedaa6cdb39f280e1252b7434 - Sigstore transparency entry: 170524054
- Sigstore integration time:
-
Permalink:
timurua/pyminiscraper@b231f7dd69615379082db48a9c59ca88309d679f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/timurua
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@b231f7dd69615379082db48a9c59ca88309d679f -
Trigger Event:
push
-
Statement type:
File details
Details for the file pyminiscraper-2.0.7-py3-none-any.whl.
File metadata
- Download URL: pyminiscraper-2.0.7-py3-none-any.whl
- Upload date:
- Size: 37.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d2a950cd59c14c275278c71d2feabd2a2acd9449349b5cbcec3fd335aec7b59
|
|
| MD5 |
39af1c732578e8214f8913e0046ffbc6
|
|
| BLAKE2b-256 |
5f08ed840cbf8564f0b383411d2d99d1f6e9666edb83ee43646c62dd4bf5187b
|
Provenance
The following attestation bundles were made for pyminiscraper-2.0.7-py3-none-any.whl:
Publisher:
python-publish.yml on timurua/pyminiscraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyminiscraper-2.0.7-py3-none-any.whl -
Subject digest:
9d2a950cd59c14c275278c71d2feabd2a2acd9449349b5cbcec3fd335aec7b59 - Sigstore transparency entry: 170524056
- Sigstore integration time:
-
Permalink:
timurua/pyminiscraper@b231f7dd69615379082db48a9c59ca88309d679f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/timurua
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@b231f7dd69615379082db48a9c59ca88309d679f -
Trigger Event:
push
-
Statement type: