A mini web scraping utility package
Project description
pyminiscraper
Introduction
pyminiscraper is a lightweight Python library designed for easy web scraping. It provides a simple interface to extract data from web pages with minimal setup. Whether you are a beginner or an advanced user, pyminiscraper offers the flexibility to handle various scraping tasks efficiently.
Features
| Feature | Implemented |
|---|---|
| Basic Web Page scraping | ✅ |
| Extremely scalable async scraping | ✅ |
| Web Page spidering | ✅ |
| Parallel requests | ✅ |
| Headless browser support | ✅ |
| Robots parsing | ✅ |
| Sitemap parsing | ✅ |
| RSS parsing | ✅ |
| Atom parsing | ✅ |
| Open Graph parsing | ✅ |
| Rate limiting | ✅ |
| Error handling | ✅ |
| Depth control | ✅ |
| Custom user agent | ✅ |
| File storage | ✅ |
| Custom callbacks | ✅ |
| Domain restrictions | ✅ |
| Request timeout | ✅ |
| Page caching | ✅ |
How does it work
┌───────────────────┐
│ │
│ Initializing │
│ │
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ Download │
│ Robots.txt │
│ │
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ Queue for │
| Configurable |◀────┐
| Parallel | |
| Processing | |
└─────────┬─────────┘ |
│ |
▼ |
┌───────────────────┐ | ┌───────────────────┐
│ Scrape │ | │ │
| Web Pages, | | │ Loading │
│ RSS & Atom │──── | ───│ Saving │
│ │ | │ Web Pages │
└─────────┬─────────┘ | └───────────────────┘
│ |
▼ |
┌───────────────────┐ |
│ Discover │ |
│ Outgoing | |
| Web Page │─────┘
| RSS/Atom feed |
│ links │
└───────────────────┘
Use Cases
Downloading only sitemap referenced web pages
Here is a basic example of how to use pyminiscraper to scrape
scraper = Scraper(
ScraperConfig(
seed_urls=[
ScraperUrl(
"https://www.anthropic.com/", max_depth=2, ScraperUrlType.HTML)
],
follow_sitemap_links=True,
follow_web_page_links=False,
follow_feed_links=False,
scraper_store_factory=FileStoreFactory(storage_dir),
),
)
await scraper.run()
Scraping pages referenced in Atom/RSS Feeds
Here is a basic example of how to use pyminiscraper to scrape
scraper = Scraper(
ScraperConfig(
seed_urls=[
ScraperUrl(
"https://feeds.feedburner.com/PythonInsider", type= ScraperUrlType.FEED)
],
follow_sitemap_links=False,
follow_web_page_links=False,
follow_feed_links=True,
scraper_store_factory=FileStoreFactory(storage_dir),
),
)
await scraper.run()
Full web site capture/spidering using all possible sources of references Sitemaps/Atom/RSS/links on Web Pages
Here is a basic example of how to use pyminiscraper to scrape
scraper = Scraper(
ScraperConfig(
seed_urls=[
ScraperUrl(
"https://www.anthropic.com/", type= ScraperUrlType.FEED)
],
follow_sitemap_links=True,
follow_web_page_links=True,
follow_feed_links=True,
scraper_store_factory=FileStoreFactory(storage_dir),
),
)
await scraper.run()
High volume scraping
Here is a basic example of how to use pyminiscraper to scrape
async def scrape_site(url: str)
scraper = Scraper(
ScraperConfig(
seed_urls=[
ScraperUrl(
url, type= ScraperUrlType.FEED)
],
follow_sitemap_links=True,
follow_web_page_links=True,
follow_feed_links=True,
scraper_store_factory=FileStoreFactory(storage_dir),
),
)
await scraper.run()
sites = [
"https://example1.com",
"https://example2.com",
"https://example3.com"
]
tasks = [scrape_site(url) for url in sites]
await asyncio.gather(*tasks)
Advanced Configuration Options
Configuration for web scraping behavior.
Parameters:
- max_parallel_requests (int): Maximum number of concurrent scraping requests
- max_requested_urls (int): Maximum total number of URLs to request before stopping
- max_depth (int): Maximum depth for recursively following links (0 means only scrape seed URLs)
- max_back_to_back_errors (int): Number of consecutive errors before terminating scraper
- crawl_delay_seconds (float): Minimum delay between requests to same domain
- request_timeout_seconds (float): Request timeout in seconds
- user_agent (str): User agent string to use in requests
- store_factory: Factory for creating storage backend
- seed_urls (List[ScraperUrl]): Initial URLs to start scraping from
- use_headless_browser (bool): Whether to use headless browser for JavaScript rendering
- follow_web_page_links (bool): Whether to follow links found in web pages
- follow_sitemap_links (bool): Whether to follow links found in sitemaps
- follow_feed_links (bool): Whether to follow links found in RSS/Atom feeds
- domain_config (DomainConfig): Configuration for allowed/blocked domains
- log (Callable): Logging function to use
The scraper will:
- Start with seed URLs and scrape them according to configuration
- Follow links up to max_depth if follow_web_page_links is True
- Follow sitemap.xml links if follow_sitemap_links is True
- Follow RSS/Atom feed links if follow_feed_links is True
- Respect robots.txt and crawl delay settings
- Store results using provided store_factory
- Stop when max_requested_urls is reached or max_back_to_back_errors occurs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyminiscraper-2.0.3.tar.gz.
File metadata
- Download URL: pyminiscraper-2.0.3.tar.gz
- Upload date:
- Size: 31.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32a15e4e5d7cf1b9d7177b8e078964792edc5965f45d9979c35c5bd90f3b18bc
|
|
| MD5 |
5d8d38e7fff683ef00b98d02d2000058
|
|
| BLAKE2b-256 |
730c29b1be19229f0dca3df7e81d5afc646c6bea024959d9c739843a5cdae725
|
Provenance
The following attestation bundles were made for pyminiscraper-2.0.3.tar.gz:
Publisher:
python-publish.yml on timurua/pyminiscraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyminiscraper-2.0.3.tar.gz -
Subject digest:
32a15e4e5d7cf1b9d7177b8e078964792edc5965f45d9979c35c5bd90f3b18bc - Sigstore transparency entry: 169960793
- Sigstore integration time:
-
Permalink:
timurua/pyminiscraper@774932180d263d41c77bab458b191df915870a3c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/timurua
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@774932180d263d41c77bab458b191df915870a3c -
Trigger Event:
push
-
Statement type:
File details
Details for the file pyminiscraper-2.0.3-py3-none-any.whl.
File metadata
- Download URL: pyminiscraper-2.0.3-py3-none-any.whl
- Upload date:
- Size: 36.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
580ab235d350bf5f97af6693a1fb76c7b222f68fdae33977e05a0391dbc480ec
|
|
| MD5 |
baa387cd716f6c05fd88f19b28b1a26b
|
|
| BLAKE2b-256 |
0106b1fbec5cc4e21faa448e3e4d97022c01c6b8a32d4cf0a8a455d8ad9fa713
|
Provenance
The following attestation bundles were made for pyminiscraper-2.0.3-py3-none-any.whl:
Publisher:
python-publish.yml on timurua/pyminiscraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyminiscraper-2.0.3-py3-none-any.whl -
Subject digest:
580ab235d350bf5f97af6693a1fb76c7b222f68fdae33977e05a0391dbc480ec - Sigstore transparency entry: 169960794
- Sigstore integration time:
-
Permalink:
timurua/pyminiscraper@774932180d263d41c77bab458b191df915870a3c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/timurua
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@774932180d263d41c77bab458b191df915870a3c -
Trigger Event:
push
-
Statement type: