A mini web scraping utility package

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

timurua

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

pyminiscraper

Introduction

pyminiscraper is a lightweight Python library designed for easy web scraping. It provides a simple interface to extract data from web pages with minimal setup. Whether you are a beginner or an advanced user, pyminiscraper offers the flexibility to handle various scraping tasks efficiently.

Simplest Use Case

Here is a basic example of how to use pyminiscraper to scrape data from a web page:

from pyminiscraper.scraper import Scraper
from pyminiscraper.config import ScraperConfig

storage_dir.mkdir(parents=True, exist_ok=True)
click.echo(f"Storage directory set to: {storage_dir}")

scraper = Scraper(
    ScraperConfig(
        scraper_urls=[
            ScraperUrl(
                "https://www.anthropic.com/news", max_depth=2)
        ],
        max_parallel_requests=16,
        use_headless_browser=False,
        timeout_seconds=30,
        max_requests_per_hour=6*60,
        only_sitemaps=False,
        scraper_store_factory=FileStoreFactory(storage_dir.absolute().as_posix()),
    ),
)
await scraper.run()

Advanced Configuration Options

pyminiscraper also provides advanced configuration options to handle more complex scraping scenarios. Below are some of the options you can configure:

Scraper URLs

Parameter: scraper_urls: list[ScraperUrl]
Description: A list of ScraperUrl objects that define the URLs to be scraped and their respective configurations.

Example:

scraper_urls=[
        ScraperUrl("https://www.example.com", max_depth=2)
]

Max Parallel Requests

Parameter: max_parallel_requests: int = 16
Description: The maximum number of parallel requests that the scraper can make.
Example:
```
max_parallel_requests=16
```

Use Headless Browser

Parameter: use_headless_browser: bool = False
Description: Whether to use a headless browser for scraping.
Example:
```
use_headless_browser=False
```

Timeout Seconds

Parameter: timeout_seconds: int = 30
Description: The timeout duration in seconds for each request.
Example:
```
timeout_seconds=30
```

Only Sitemaps

Parameter: only_sitemaps: bool = True
Description: Whether to scrape only sitemap URLs.
Example:
```
only_sitemaps=True
```

Max Requested URLs

Parameter: max_requested_urls: int = 64 * 1024
Description: The maximum number of URLs that can be requested.
Example:
```
max_requested_urls=64 * 1024
```

Max Back-to-Back Errors

Parameter: max_back_to_back_errors: int = 128
Description: The maximum number of consecutive errors allowed before stopping the scraper.
Example:
```
max_back_to_back_errors=128
```

Scraper Store Factory

Parameter: scraper_store_factory: ScraperStoreFactory
Description: The factory used to create the storage for scraped data.

Example:

scraper_store_factory=FileStoreFactory("/path/to/storage")

Allow L2 Domains

Parameter: allow_l2_domains: bool = True
Description: Whether to allow scraping of second-level domains.
Example:
```
allow_l2_domains=True
```

Scraper Callback

Parameter: scraper_callback: ScraperCallback | None = None
Description: A callback function that is called after each scraping operation.
Example:
```
scraper_callback=my_callback_function
```

Max Depth

Parameter: max_depth: int = 16
Description: The maximum depth to follow links from the initial URL.
Example:
```
max_depth=16
```

Max Requests Per Hour

Parameter: max_requests_per_hour: float = 60*60*10
Description: The maximum number of requests allowed per hour.
Example:
```
max_requests_per_hour=60*60*10
```

Rerequest After Hours

Parameter: rerequest_after_hours: int = 24
Description: The number of hours to wait before re-requesting a URL.
Example:
```
rerequest_after_hours=24
```

No Page Store

Parameter: no_page_store: bool = False
Description: Whether to disable storing the scraped pages.
Example:
```
no_page_store=False
```

User Agent

Parameter: user_agent: str = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'
Description: The user agent string to use for requests.

Example:

user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

timurua

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

2.0.7

Feb 12, 2025

2.0.6

Feb 10, 2025

2.0.5

Feb 10, 2025

2.0.4

Feb 10, 2025

2.0.3

Feb 10, 2025

2.0.2

Feb 5, 2025

2.0.1

Feb 4, 2025

2.0.0

Feb 4, 2025

1.0.16

Feb 1, 2025

1.0.15

Feb 1, 2025

1.0.14

Jan 30, 2025

1.0.13

Jan 27, 2025

1.0.12

Jan 27, 2025

1.0.11

Jan 25, 2025

1.0.10

Jan 25, 2025

1.0.9

Jan 25, 2025

1.0.8

Jan 24, 2025

1.0.7

Jan 24, 2025

1.0.6

Jan 24, 2025

1.0.5

Jan 24, 2025

1.0.4

Jan 24, 2025

1.0.3

Jan 23, 2025

1.0.2

Jan 23, 2025

1.0.1

Jan 23, 2025

1.0.0

Jan 22, 2025

This version

0.1.1

Jan 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyminiscraper-0.1.1.tar.gz (24.1 kB view details)

Uploaded Jan 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyminiscraper-0.1.1-py3-none-any.whl (26.0 kB view details)

Uploaded Jan 6, 2025 Python 3

File details

Details for the file pyminiscraper-0.1.1.tar.gz.

File metadata

Download URL: pyminiscraper-0.1.1.tar.gz
Upload date: Jan 6, 2025
Size: 24.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for pyminiscraper-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`802916def3fea98173c745772e0cbb148daf21e624c16c40f63d2cf871eff39e`
MD5	`855e21f0813ff079fa98341bef5c669a`
BLAKE2b-256	`79caa6732e4367634d052b8b30890d243ef16aa04e36b2de2044288fa0396268`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-0.1.1.tar.gz:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyminiscraper-0.1.1.tar.gz
- Subject digest: 802916def3fea98173c745772e0cbb148daf21e624c16c40f63d2cf871eff39e
- Sigstore transparency entry: 160132657
- Sigstore integration time: Jan 6, 2025
Source repository:
- Permalink: timurua/pyminiscraper@c7d1baba81f7f5a17ec8cfb1bbb7afdf3fdd73af
- Branch / Tag: refs/heads/main
- Owner: https://github.com/timurua
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@c7d1baba81f7f5a17ec8cfb1bbb7afdf3fdd73af
- Trigger Event: push

File details

Details for the file pyminiscraper-0.1.1-py3-none-any.whl.

File metadata

Download URL: pyminiscraper-0.1.1-py3-none-any.whl
Upload date: Jan 6, 2025
Size: 26.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for pyminiscraper-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`37800e59ee2a16151a18e226006779f74275d045d3633aad511efcf10d75e089`
MD5	`25f3c452407002d7b43136133086b551`
BLAKE2b-256	`01b8910e39eddcab7bd699e91182d043f6e66b47a10b5ef8028dacf043bfd07b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-0.1.1-py3-none-any.whl:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyminiscraper-0.1.1-py3-none-any.whl
- Subject digest: 37800e59ee2a16151a18e226006779f74275d045d3633aad511efcf10d75e089
- Sigstore transparency entry: 160132658
- Sigstore integration time: Jan 6, 2025
Source repository:
- Permalink: timurua/pyminiscraper@c7d1baba81f7f5a17ec8cfb1bbb7afdf3fdd73af
- Branch / Tag: refs/heads/main
- Owner: https://github.com/timurua
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@c7d1baba81f7f5a17ec8cfb1bbb7afdf3fdd73af
- Trigger Event: push

pyminiscraper 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pyminiscraper

Introduction

Simplest Use Case

Advanced Configuration Options

Scraper URLs

Max Parallel Requests

Use Headless Browser

Timeout Seconds

Only Sitemaps

Max Requested URLs

Max Back-to-Back Errors

Scraper Store Factory

Allow L2 Domains

Scraper Callback

Max Depth

Max Requests Per Hour

Rerequest After Hours

No Page Store

User Agent

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance