Skip to main content

A mini web scraping utility package

Project description

pyminiscraper

Introduction

pyminiscraper is a lightweight Python library designed for easy web scraping. It provides a simple interface to extract data from web pages with minimal setup. Whether you are a beginner or an advanced user, pyminiscraper offers the flexibility to handle various scraping tasks efficiently.

Features

Feature Description
Basic Web Page scraping Scrape HTML content from web pages
Async scraping Extremely scalable asynchronous scraping
Web Page spidering Follow and scrape links from web pages
Parallel requests Configure number of concurrent requests
Headless browser support JavaScript rendering support
Robots.txt parsing Respect robots.txt rules
Sitemap parsing Parse and follow sitemap.xml
RSS/Atom parsing Parse and follow RSS/Atom feeds
Open Graph parsing Extract Open Graph metadata
Rate limiting Configurable per-domain rate limiting
Error handling Robust error handling with retry logic
Depth control Control recursion depth for link following
Custom user agent Set custom User-Agent strings
File storage Built-in file storage system
Custom callbacks Define custom processing logic
Domain restrictions Control allowed/blocked domains
Request timeout Configurable request timeouts
Page caching Cache and reuse downloaded pages

Installation

pip install pyminiscraper

How does it work

┌───────────────────┐
│                   │
│  Initializing     │
│                   │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│     Download      │
│     Robots.txt    │
│                   │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│     Queue for     │
|    Configurable   |◀────┐
|      Parallel     |     |
|     Processing    |     |
└─────────┬─────────┘     |
          │               |
          ▼               |
┌───────────────────┐     |    ┌───────────────────┐
│        Scrape     │     |    │                   │
|      Web Pages,   |     |    │    Loading        │
│     RSS & Atom    │──── | ───│    Saving         │
│                   │     |    │    Web Pages      │    
└─────────┬─────────┘     |    └───────────────────┘ 
          │               |
          ▼               |
┌───────────────────┐     |
│      Discover     │     |   
│      Outgoing     |     |
|     Web Page      │─────┘
|  RSS/Atom feed    |
│      links        │
└───────────────────┘

Basic Usage

Downloading Sitemap-Referenced Pages

This example shows how to scrape only pages referenced in a sitemap:

from pyminiscraper import Scraper, ScraperConfig, ScraperUrl, FileStore
from pyminiscraper.config import ScraperDomainConfig, ScraperDomainConfigMode

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl("https://www.example.com/", max_depth=2)
        ],
        follow_sitemap_links=True,
        follow_web_page_links=False,
        follow_feed_links=False,
        callback=FileStore(storage_dir),
    ),
)
await scraper.run()

Scraping RSS/Atom Feeds

Example of scraping content from RSS/Atom feeds:

from pyminiscraper import Scraper, ScraperConfig, ScraperUrl, ScraperUrlType, FileStore

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl(
                "https://feeds.feedburner.com/PythonInsider", 
                type=ScraperUrlType.FEED
            )
        ],
        follow_sitemap_links=False,
        follow_web_page_links=False,
        follow_feed_links=True,
        callback=FileStore(storage_dir),
    ),
)
await scraper.run()

Full Website Crawling

Example of comprehensive website crawling using all available sources:

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl("https://www.example.com/")
        ],
        follow_sitemap_links=True,
        follow_web_page_links=True,
        follow_feed_links=True,
        callback=FileStore(storage_dir),
    ),
)
await scraper.run()

Custom Processing with Callbacks

Example of custom processing using callbacks:

from pyminiscraper.config import ScraperCallback, ScraperContext
from pyminiscraper.model import ScraperWebPage, ScraperUrl

class CustomCallback(ScraperCallback):
    async def on_web_page(self, context: ScraperContext, request: ScraperUrl, response: ScraperWebPage) -> None:
        # Custom processing logic here
        print(f"Processing {response.url}")
        print(f"Title: {response.metadata_title}")
        
    async def on_feed(self, context: ScraperContext, feed: Feed) -> None:
        # Custom feed processing
        for item in feed.items:
            print(f"Feed item: {item.title}")

scraper = Scraper(
    ScraperConfig(
        seed_urls=[ScraperUrl("https://example.com")],
        callback=CustomCallback(),
    )
)
await scraper.run()

Configuration Options

Configuration for web scraping behavior.

Parameters:

  • seed_urls (list[ScraperUrl]): Initial URLs to start scraping from
  • callback (ScraperCallback): Callback for processing scraped content
  • include_path_patterns (list[str]): URL paths to include (default: [])
  • exclude_path_patterns (list[str]): URL paths to exclude (default: [])
  • max_parallel_requests (int): Maximum concurrent requests (default: 16)
  • use_headless_browser (bool): Use headless browser for JavaScript (default: False)
  • request_timeout_seconds (int): Request timeout in seconds (default: 30)
  • follow_web_page_links (bool): Follow links in web pages (default: False)
  • follow_sitemap_links (bool): Follow sitemap.xml links (default: True)
  • follow_feed_links (bool): Follow RSS/Atom feed links (default: True)
  • prevent_default_queuing (bool): Disable automatic URL queuing (default: False)
  • max_requested_urls (int): Maximum total URLs to request (default: 65536)
  • max_back_to_back_errors (int): Consecutive errors before stopping (default: 128)
  • on_response_callback (ScraperResponseCallback): Optional response callback
  • max_depth (int): Maximum recursion depth for links (default: 16)
  • crawl_delay_seconds (int): Delay between requests per domain (default: 1)
  • domain_config (ScraperDomainConfig): Allowed/blocked domains configuration
  • user_agent (str): User agent string (default: 'pyminiscraper')
  • referer (str): Referer header (default: "https://www.google.com")

Domain Configuration

Control which domains are allowed or blocked:

from pyminiscraper.config import ScraperDomainConfig, ScraperDomainConfigMode

# Allow only specific domains
config = ScraperDomainConfig(
    allowance=ScraperAllowedDomains(domains=["example.com", "api.example.com"]),
    forbidden_domains=["ads.example.com"]
)

# Allow all domains
config = ScraperDomainConfig(
    allowance=ScraperDomainConfigMode.ALLOW_ALL
)

# Allow only domains derived from seed URLs
config = ScraperDomainConfig(
    allowance=ScraperDomainConfigMode.DERIVE_FROM_SEED_URLS
)

Error Handling

The scraper includes built-in error handling:

  • Respects max_back_to_back_errors to stop after consecutive failures
  • Retries failed requests with exponential backoff
  • Logs errors for debugging
  • Continues operation after non-fatal errors

Performance Tips

  1. Adjust max_parallel_requests based on your needs and server capacity
  2. Use crawl_delay_seconds to control request rate
  3. Enable use_headless_browser only when JavaScript rendering is required
  4. Implement caching in your callback to avoid re-downloading pages
  5. Use path patterns to filter URLs before downloading

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyminiscraper-2.0.7.tar.gz (32.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyminiscraper-2.0.7-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file pyminiscraper-2.0.7.tar.gz.

File metadata

  • Download URL: pyminiscraper-2.0.7.tar.gz
  • Upload date:
  • Size: 32.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for pyminiscraper-2.0.7.tar.gz
Algorithm Hash digest
SHA256 61116232be1450c2d1d14fcc1c67674259b4509dedaa6cdb39f280e1252b7434
MD5 0344e96a7ec3f036b4fd428876cf0e92
BLAKE2b-256 b0ec9eb7072ac797280c16dd649af3f123546a208c2f8d3b45ea18eb9f882d94

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-2.0.7.tar.gz:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyminiscraper-2.0.7-py3-none-any.whl.

File metadata

  • Download URL: pyminiscraper-2.0.7-py3-none-any.whl
  • Upload date:
  • Size: 37.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for pyminiscraper-2.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 9d2a950cd59c14c275278c71d2feabd2a2acd9449349b5cbcec3fd335aec7b59
MD5 39af1c732578e8214f8913e0046ffbc6
BLAKE2b-256 5f08ed840cbf8564f0b383411d2d99d1f6e9666edb83ee43646c62dd4bf5187b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-2.0.7-py3-none-any.whl:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page