Skip to main content

A mini web scraping utility package

Project description

pyminiscraper

Introduction

pyminiscraper is a lightweight Python library designed for easy web scraping. It provides a simple interface to extract data from web pages with minimal setup. Whether you are a beginner or an advanced user, pyminiscraper offers the flexibility to handle various scraping tasks efficiently.

Simplest Use Case

Here is a basic example of how to use pyminiscraper to scrape data from a web page:

from pyminiscraper.scraper import Scraper
from pyminiscraper.config import ScraperConfig

storage_dir.mkdir(parents=True, exist_ok=True)
click.echo(f"Storage directory set to: {storage_dir}")

scraper = Scraper(
    ScraperConfig(
        scraper_urls=[
            ScraperUrl(
                "https://www.anthropic.com/news", max_depth=2)
        ],
        max_parallel_requests=16,
        use_headless_browser=False,
        timeout_seconds=30,
        max_requests_per_hour=6*60,
        only_sitemaps=False,
        scraper_store_factory=FileStoreFactory(storage_dir.absolute().as_posix()),
    ),
)
await scraper.run()

Advanced Configuration Options

pyminiscraper also provides advanced configuration options to handle more complex scraping scenarios. Below are some of the options you can configure:

Scraper URLs

  • Parameter: scraper_urls: list[ScraperUrl]
  • Description: A list of ScraperUrl objects that define the URLs to be scraped and their respective configurations.
  • Example:
    scraper_urls=[
            ScraperUrl("https://www.example.com", max_depth=2)
    ]
    

Max Parallel Requests

  • Parameter: max_parallel_requests: int = 16
  • Description: The maximum number of parallel requests that the scraper can make.
  • Example:
    max_parallel_requests=16
    

Use Headless Browser

  • Parameter: use_headless_browser: bool = False
  • Description: Whether to use a headless browser for scraping.
  • Example:
    use_headless_browser=False
    

Timeout Seconds

  • Parameter: timeout_seconds: int = 30
  • Description: The timeout duration in seconds for each request.
  • Example:
    timeout_seconds=30
    

Only Sitemaps

  • Parameter: only_sitemaps: bool = True
  • Description: Whether to scrape only sitemap URLs.
  • Example:
    only_sitemaps=True
    

Max Requested URLs

  • Parameter: max_requested_urls: int = 64 * 1024
  • Description: The maximum number of URLs that can be requested.
  • Example:
    max_requested_urls=64 * 1024
    

Max Back-to-Back Errors

  • Parameter: max_back_to_back_errors: int = 128
  • Description: The maximum number of consecutive errors allowed before stopping the scraper.
  • Example:
    max_back_to_back_errors=128
    

Scraper Store Factory

  • Parameter: scraper_store_factory: ScraperStoreFactory
  • Description: The factory used to create the storage for scraped data.
  • Example:
    scraper_store_factory=FileStoreFactory("/path/to/storage")
    

Allow L2 Domains

  • Parameter: allow_l2_domains: bool = True
  • Description: Whether to allow scraping of second-level domains.
  • Example:
    allow_l2_domains=True
    

Scraper Callback

  • Parameter: scraper_callback: ScraperCallback | None = None
  • Description: A callback function that is called after each scraping operation.
  • Example:
    scraper_callback=my_callback_function
    

Max Depth

  • Parameter: max_depth: int = 16
  • Description: The maximum depth to follow links from the initial URL.
  • Example:
    max_depth=16
    

Max Requests Per Hour

  • Parameter: max_requests_per_hour: float = 60*60*10
  • Description: The maximum number of requests allowed per hour.
  • Example:
    max_requests_per_hour=60*60*10
    

Rerequest After Hours

  • Parameter: rerequest_after_hours: int = 24
  • Description: The number of hours to wait before re-requesting a URL.
  • Example:
    rerequest_after_hours=24
    

No Page Store

  • Parameter: no_page_store: bool = False
  • Description: Whether to disable storing the scraped pages.
  • Example:
    no_page_store=False
    

User Agent

  • Parameter: user_agent: str = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
  • Description: The user agent string to use for requests.
  • Example:
    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyminiscraper-1.0.0.tar.gz (27.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyminiscraper-1.0.0-py3-none-any.whl (31.9 kB view details)

Uploaded Python 3

File details

Details for the file pyminiscraper-1.0.0.tar.gz.

File metadata

  • Download URL: pyminiscraper-1.0.0.tar.gz
  • Upload date:
  • Size: 27.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for pyminiscraper-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e3d4c7f7a1d38cb1cc0b8653af17f357eb536be099e2721638c8c010e690885f
MD5 3766b301ce382b3993337a56e4c77c64
BLAKE2b-256 a6e1e0c6e1935ff90d3233ea53355838dfb09088295ad69a7ada7d5baab16325

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-1.0.0.tar.gz:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pyminiscraper-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pyminiscraper-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 31.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for pyminiscraper-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 71d8fb7b6d6af62706bf8150236108bd1c90956b4416a57168136473324063e1
MD5 964332382207d52ef0e1fdea284e0fa4
BLAKE2b-256 52f69a25897e90c74c21057c9d6955e7157e009dd3579b2836f303287d126a0d

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-1.0.0-py3-none-any.whl:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page