Skip to main content

A mini web scraping utility package

Project description

pyminiscraper

Introduction

pyminiscraper is a lightweight Python library designed for easy web scraping. It provides a simple interface to extract data from web pages with minimal setup. Whether you are a beginner or an advanced user, pyminiscraper offers the flexibility to handle various scraping tasks efficiently.

Simplest Use Case

Here is a basic example of how to use pyminiscraper to scrape data from a web page:

from pyminiscraper.scraper import Scraper
from pyminiscraper.config import ScraperConfig

storage_dir.mkdir(parents=True, exist_ok=True)
click.echo(f"Storage directory set to: {storage_dir}")

scraper = Scraper(
    ScraperConfig(
        scraper_urls=[
            ScraperUrl(
                "https://www.anthropic.com/news", max_depth=2)
        ],
        max_parallel_requests=16,
        use_headless_browser=False,
        timeout_seconds=30,
        max_requests_per_hour=6*60,
        only_sitemaps=False,
        scraper_store_factory=FileStoreFactory(storage_dir.absolute().as_posix()),
    ),
)
await scraper.run()

Advanced Configuration Options

pyminiscraper also provides advanced configuration options to handle more complex scraping scenarios. Below are some of the options you can configure:

Scraper URLs

  • Parameter: scraper_urls: list[ScraperUrl]
  • Description: A list of ScraperUrl objects that define the URLs to be scraped and their respective configurations.
  • Example:
    scraper_urls=[
            ScraperUrl("https://www.example.com", max_depth=2)
    ]
    

Max Parallel Requests

  • Parameter: max_parallel_requests: int = 16
  • Description: The maximum number of parallel requests that the scraper can make.
  • Example:
    max_parallel_requests=16
    

Use Headless Browser

  • Parameter: use_headless_browser: bool = False
  • Description: Whether to use a headless browser for scraping.
  • Example:
    use_headless_browser=False
    

Timeout Seconds

  • Parameter: timeout_seconds: int = 30
  • Description: The timeout duration in seconds for each request.
  • Example:
    timeout_seconds=30
    

Only Sitemaps

  • Parameter: only_sitemaps: bool = True
  • Description: Whether to scrape only sitemap URLs.
  • Example:
    only_sitemaps=True
    

Max Requested URLs

  • Parameter: max_requested_urls: int = 64 * 1024
  • Description: The maximum number of URLs that can be requested.
  • Example:
    max_requested_urls=64 * 1024
    

Max Back-to-Back Errors

  • Parameter: max_back_to_back_errors: int = 128
  • Description: The maximum number of consecutive errors allowed before stopping the scraper.
  • Example:
    max_back_to_back_errors=128
    

Scraper Store Factory

  • Parameter: scraper_store_factory: ScraperStoreFactory
  • Description: The factory used to create the storage for scraped data.
  • Example:
    scraper_store_factory=FileStoreFactory("/path/to/storage")
    

Allow L2 Domains

  • Parameter: allow_l2_domains: bool = True
  • Description: Whether to allow scraping of second-level domains.
  • Example:
    allow_l2_domains=True
    

Scraper Callback

  • Parameter: scraper_callback: ScraperCallback | None = None
  • Description: A callback function that is called after each scraping operation.
  • Example:
    scraper_callback=my_callback_function
    

Max Depth

  • Parameter: max_depth: int = 16
  • Description: The maximum depth to follow links from the initial URL.
  • Example:
    max_depth=16
    

Max Requests Per Hour

  • Parameter: max_requests_per_hour: float = 60*60*10
  • Description: The maximum number of requests allowed per hour.
  • Example:
    max_requests_per_hour=60*60*10
    

Rerequest After Hours

  • Parameter: rerequest_after_hours: int = 24
  • Description: The number of hours to wait before re-requesting a URL.
  • Example:
    rerequest_after_hours=24
    

No Page Store

  • Parameter: no_page_store: bool = False
  • Description: Whether to disable storing the scraped pages.
  • Example:
    no_page_store=False
    

User Agent

  • Parameter: user_agent: str = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
  • Description: The user agent string to use for requests.
  • Example:
    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyminiscraper-0.1.1.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyminiscraper-0.1.1-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file pyminiscraper-0.1.1.tar.gz.

File metadata

  • Download URL: pyminiscraper-0.1.1.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for pyminiscraper-0.1.1.tar.gz
Algorithm Hash digest
SHA256 802916def3fea98173c745772e0cbb148daf21e624c16c40f63d2cf871eff39e
MD5 855e21f0813ff079fa98341bef5c669a
BLAKE2b-256 79caa6732e4367634d052b8b30890d243ef16aa04e36b2de2044288fa0396268

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-0.1.1.tar.gz:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyminiscraper-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pyminiscraper-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 26.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for pyminiscraper-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 37800e59ee2a16151a18e226006779f74275d045d3633aad511efcf10d75e089
MD5 25f3c452407002d7b43136133086b551
BLAKE2b-256 01b8910e39eddcab7bd699e91182d043f6e66b47a10b5ef8028dacf043bfd07b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-0.1.1-py3-none-any.whl:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page