Skip to main content

A mini web scraping utility package

Project description

pyminiscraper

Introduction

pyminiscraper is a lightweight Python library designed for easy web scraping. It provides a simple interface to extract data from web pages with minimal setup. Whether you are a beginner or an advanced user, pyminiscraper offers the flexibility to handle various scraping tasks efficiently.

Features

Feature Implemented
Basic Web Page scraping
Extremely scalable async scraping
Web Page spidering
Parallel requests
Headless browser support
Robots parsing
Sitemap parsing
RSS parsing
Atom parsing
Open Graph parsing
Rate limiting
Error handling
Depth control
Custom user agent
File storage
Custom callbacks
Domain restrictions
Request timeout
Page caching

How does it work

┌───────────────────┐
│                   │
│  Initializing     │
│                   │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│     Download      │
│     Robots.txt    │
│                   │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│     Queue for     │
|    Configurable   |◀────┐
|      Parallel     |     |
|     Processing    |     |
└─────────┬─────────┘     |
          │               |
          ▼               |
┌───────────────────┐     |    ┌───────────────────┐
│        Scrape     │     |    │                   │
|      Web Pages,   |     |    │    Loading        │
│     RSS & Atom    │──── | ───│    Saving         │
│                   │     |    │    Web Pages      │    
└─────────┬─────────┘     |    └───────────────────┘ 
          │               |
          ▼               |
┌───────────────────┐     |
│      Discover     │     |   
│      Outgoing     |     |
|     Web Page      │─────┘
|  RSS/Atom feed    |
│      links        │
└───────────────────┘

Use Cases

Downloading only sitemap referenced web pages

Here is a basic example of how to use pyminiscraper to scrape

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl(
                "https://www.anthropic.com/", max_depth=2, ScraperUrlType.HTML)
        ],
        follow_sitemap_links=True,
        follow_web_page_links=False,
        follow_feed_links=False,
        scraper_store_factory=FileStoreFactory(storage_dir),
    ),
)
await scraper.run()

Scraping pages referenced in Atom/RSS Feeds

Here is a basic example of how to use pyminiscraper to scrape

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl(
                "https://feeds.feedburner.com/PythonInsider", type= ScraperUrlType.FEED)
        ],
        follow_sitemap_links=False,
        follow_web_page_links=False,
        follow_feed_links=True,
        scraper_store_factory=FileStoreFactory(storage_dir),
    ),
)
await scraper.run()

Full web site capture/spidering using all possible sources of references Sitemaps/Atom/RSS/links on Web Pages

Here is a basic example of how to use pyminiscraper to scrape

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl(
                "https://www.anthropic.com/", type= ScraperUrlType.FEED)
        ],
        follow_sitemap_links=True,
        follow_web_page_links=True,
        follow_feed_links=True,
        scraper_store_factory=FileStoreFactory(storage_dir),
    ),
)
await scraper.run()

High volume scraping

Here is a basic example of how to use pyminiscraper to scrape

async def scrape_site(url: str)
    scraper = Scraper(
        ScraperConfig(
            seed_urls=[
                ScraperUrl(
                    url, type= ScraperUrlType.FEED)
            ],
            follow_sitemap_links=True,
            follow_web_page_links=True,
            follow_feed_links=True,
            scraper_store_factory=FileStoreFactory(storage_dir),
        ),
    )
    await scraper.run()

sites = [
            "https://example1.com", 
            "https://example2.com", 
            "https://example3.com"
        ]
tasks = [scrape_site(url) for url in sites]
await asyncio.gather(*tasks)

Advanced Configuration Options

Configuration for web scraping behavior.

Parameters:

  • max_parallel_requests (int): Maximum number of concurrent scraping requests
  • max_requested_urls (int): Maximum total number of URLs to request before stopping
  • max_depth (int): Maximum depth for recursively following links (0 means only scrape seed URLs)
  • max_back_to_back_errors (int): Number of consecutive errors before terminating scraper
  • crawl_delay_seconds (float): Minimum delay between requests to same domain
  • request_timeout_seconds (float): Request timeout in seconds
  • user_agent (str): User agent string to use in requests
  • store_factory: Factory for creating storage backend
  • seed_urls (List[ScraperUrl]): Initial URLs to start scraping from
  • use_headless_browser (bool): Whether to use headless browser for JavaScript rendering
  • follow_web_page_links (bool): Whether to follow links found in web pages
  • follow_sitemap_links (bool): Whether to follow links found in sitemaps
  • follow_feed_links (bool): Whether to follow links found in RSS/Atom feeds
  • domain_config (DomainConfig): Configuration for allowed/blocked domains
  • log (Callable): Logging function to use

The scraper will:

  • Start with seed URLs and scrape them according to configuration
  • Follow links up to max_depth if follow_web_page_links is True
  • Follow sitemap.xml links if follow_sitemap_links is True
  • Follow RSS/Atom feed links if follow_feed_links is True
  • Respect robots.txt and crawl delay settings
  • Store results using provided store_factory
  • Stop when max_requested_urls is reached or max_back_to_back_errors occurs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyminiscraper-2.0.2.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyminiscraper-2.0.2-py3-none-any.whl (36.4 kB view details)

Uploaded Python 3

File details

Details for the file pyminiscraper-2.0.2.tar.gz.

File metadata

  • Download URL: pyminiscraper-2.0.2.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for pyminiscraper-2.0.2.tar.gz
Algorithm Hash digest
SHA256 d68182e9829b97cf4b56b71efd0fd1d29e1799faab9f5cfed0ee3dac9d837e71
MD5 57a64616491fd8e8b262b20bb7a1b88c
BLAKE2b-256 23dfb7426dd3d4e681c63aedd5e744584caf6db37f7a35453163913984f53810

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-2.0.2.tar.gz:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyminiscraper-2.0.2-py3-none-any.whl.

File metadata

  • Download URL: pyminiscraper-2.0.2-py3-none-any.whl
  • Upload date:
  • Size: 36.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for pyminiscraper-2.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1431d5ed0d9ca6fdf70eefd3a94157a0cac23a3d7e5435c43c35a2d3b61d306a
MD5 7a451a23d942283df1649a50f18be859
BLAKE2b-256 e4caaaefc0f2359992397264eb9172f2959a04f8b81126b124f080ee553aba8f

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-2.0.2-py3-none-any.whl:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page