A mini web scraping utility package

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

timurua

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

pyminiscraper

Introduction

pyminiscraper is a lightweight Python library designed for easy web scraping. It provides a simple interface to extract data from web pages with minimal setup. Whether you are a beginner or an advanced user, pyminiscraper offers the flexibility to handle various scraping tasks efficiently.

Features

Feature	Description
Basic Web Page scraping	Scrape HTML content from web pages
Async scraping	Extremely scalable asynchronous scraping
Web Page spidering	Follow and scrape links from web pages
Parallel requests	Configure number of concurrent requests
Headless browser support	JavaScript rendering support
Robots.txt parsing	Respect robots.txt rules
Sitemap parsing	Parse and follow sitemap.xml
RSS/Atom parsing	Parse and follow RSS/Atom feeds
Open Graph parsing	Extract Open Graph metadata
Rate limiting	Configurable per-domain rate limiting
Error handling	Robust error handling with retry logic
Depth control	Control recursion depth for link following
Custom user agent	Set custom User-Agent strings
File storage	Built-in file storage system
Custom callbacks	Define custom processing logic
Domain restrictions	Control allowed/blocked domains
Request timeout	Configurable request timeouts
Page caching	Cache and reuse downloaded pages

Installation

pip install pyminiscraper

How does it work

┌───────────────────┐
│                   │
│  Initializing     │
│                   │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│     Download      │
│     Robots.txt    │
│                   │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│     Queue for     │
|    Configurable   |◀────┐
|      Parallel     |     |
|     Processing    |     |
└─────────┬─────────┘     |
          │               |
          ▼               |
┌───────────────────┐     |    ┌───────────────────┐
│        Scrape     │     |    │                   │
|      Web Pages,   |     |    │    Loading        │
│     RSS & Atom    │──── | ───│    Saving         │
│                   │     |    │    Web Pages      │    
└─────────┬─────────┘     |    └───────────────────┘ 
          │               |
          ▼               |
┌───────────────────┐     |
│      Discover     │     |   
│      Outgoing     |     |
|     Web Page      │─────┘
|  RSS/Atom feed    |
│      links        │
└───────────────────┘

Basic Usage

Downloading Sitemap-Referenced Pages

This example shows how to scrape only pages referenced in a sitemap:

from pyminiscraper import Scraper, ScraperConfig, ScraperUrl, FileStore
from pyminiscraper.config import ScraperDomainConfig, ScraperDomainConfigMode

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl("https://www.example.com/", max_depth=2)
        ],
        follow_sitemap_links=True,
        follow_web_page_links=False,
        follow_feed_links=False,
        callback=FileStore(storage_dir),
    ),
)
await scraper.run()

Scraping RSS/Atom Feeds

Example of scraping content from RSS/Atom feeds:

from pyminiscraper import Scraper, ScraperConfig, ScraperUrl, ScraperUrlType, FileStore

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl(
                "https://feeds.feedburner.com/PythonInsider", 
                type=ScraperUrlType.FEED
            )
        ],
        follow_sitemap_links=False,
        follow_web_page_links=False,
        follow_feed_links=True,
        callback=FileStore(storage_dir),
    ),
)
await scraper.run()

Full Website Crawling

Example of comprehensive website crawling using all available sources:

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl("https://www.example.com/")
        ],
        follow_sitemap_links=True,
        follow_web_page_links=True,
        follow_feed_links=True,
        callback=FileStore(storage_dir),
    ),
)
await scraper.run()

Custom Processing with Callbacks

Example of custom processing using callbacks:

from pyminiscraper.config import ScraperCallback, ScraperContext
from pyminiscraper.model import ScraperWebPage, ScraperUrl

class CustomCallback(ScraperCallback):
    async def on_web_page(self, context: ScraperContext, request: ScraperUrl, response: ScraperWebPage) -> None:
        # Custom processing logic here
        print(f"Processing {response.url}")
        print(f"Title: {response.metadata_title}")
        
    async def on_feed(self, context: ScraperContext, feed: Feed) -> None:
        # Custom feed processing
        for item in feed.items:
            print(f"Feed item: {item.title}")

scraper = Scraper(
    ScraperConfig(
        seed_urls=[ScraperUrl("https://example.com")],
        callback=CustomCallback(),
    )
)
await scraper.run()

Configuration Options

Configuration for web scraping behavior.

Parameters:

seed_urls (list[ScraperUrl]): Initial URLs to start scraping from
callback (ScraperCallback): Callback for processing scraped content
include_path_patterns (list[str]): URL paths to include (default: [])
exclude_path_patterns (list[str]): URL paths to exclude (default: [])
max_parallel_requests (int): Maximum concurrent requests (default: 16)
use_headless_browser (bool): Use headless browser for JavaScript (default: False)
request_timeout_seconds (int): Request timeout in seconds (default: 30)
follow_web_page_links (bool): Follow links in web pages (default: False)
follow_sitemap_links (bool): Follow sitemap.xml links (default: True)
follow_feed_links (bool): Follow RSS/Atom feed links (default: True)
prevent_default_queuing (bool): Disable automatic URL queuing (default: False)
max_requested_urls (int): Maximum total URLs to request (default: 65536)
max_back_to_back_errors (int): Consecutive errors before stopping (default: 128)
on_response_callback (ScraperResponseCallback): Optional response callback
max_depth (int): Maximum recursion depth for links (default: 16)
crawl_delay_seconds (int): Delay between requests per domain (default: 1)
domain_config (ScraperDomainConfig): Allowed/blocked domains configuration
user_agent (str): User agent string (default: 'pyminiscraper')
referer (str): Referer header (default: "https://www.google.com")

Domain Configuration

Control which domains are allowed or blocked:

from pyminiscraper.config import ScraperDomainConfig, ScraperDomainConfigMode

# Allow only specific domains
config = ScraperDomainConfig(
    allowance=ScraperAllowedDomains(domains=["example.com", "api.example.com"]),
    forbidden_domains=["ads.example.com"]
)

# Allow all domains
config = ScraperDomainConfig(
    allowance=ScraperDomainConfigMode.ALLOW_ALL
)

# Allow only domains derived from seed URLs
config = ScraperDomainConfig(
    allowance=ScraperDomainConfigMode.DERIVE_FROM_SEED_URLS
)

Error Handling

The scraper includes built-in error handling:

Respects max_back_to_back_errors to stop after consecutive failures
Retries failed requests with exponential backoff
Logs errors for debugging
Continues operation after non-fatal errors

Performance Tips

Adjust max_parallel_requests based on your needs and server capacity
Use crawl_delay_seconds to control request rate
Enable use_headless_browser only when JavaScript rendering is required
Implement caching in your callback to avoid re-downloading pages
Use path patterns to filter URLs before downloading

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

timurua

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

2.0.7

Feb 12, 2025

2.0.6

Feb 10, 2025

2.0.5

Feb 10, 2025

2.0.4

Feb 10, 2025

2.0.3

Feb 10, 2025

2.0.2

Feb 5, 2025

2.0.1

Feb 4, 2025

2.0.0

Feb 4, 2025

1.0.16

Feb 1, 2025

1.0.15

Feb 1, 2025

1.0.14

Jan 30, 2025

1.0.13

Jan 27, 2025

1.0.12

Jan 27, 2025

1.0.11

Jan 25, 2025

1.0.10

Jan 25, 2025

1.0.9

Jan 25, 2025

1.0.8

Jan 24, 2025

1.0.7

Jan 24, 2025

1.0.6

Jan 24, 2025

1.0.5

Jan 24, 2025

1.0.4

Jan 24, 2025

1.0.3

Jan 23, 2025

1.0.2

Jan 23, 2025

1.0.1

Jan 23, 2025

1.0.0

Jan 22, 2025

0.1.1

Jan 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyminiscraper-2.0.7.tar.gz (32.7 kB view details)

Uploaded Feb 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyminiscraper-2.0.7-py3-none-any.whl (37.1 kB view details)

Uploaded Feb 12, 2025 Python 3

File details

Details for the file pyminiscraper-2.0.7.tar.gz.

File metadata

Download URL: pyminiscraper-2.0.7.tar.gz
Upload date: Feb 12, 2025
Size: 32.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for pyminiscraper-2.0.7.tar.gz
Algorithm	Hash digest
SHA256	`61116232be1450c2d1d14fcc1c67674259b4509dedaa6cdb39f280e1252b7434`
MD5	`0344e96a7ec3f036b4fd428876cf0e92`
BLAKE2b-256	`b0ec9eb7072ac797280c16dd649af3f123546a208c2f8d3b45ea18eb9f882d94`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-2.0.7.tar.gz:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyminiscraper-2.0.7.tar.gz
- Subject digest: 61116232be1450c2d1d14fcc1c67674259b4509dedaa6cdb39f280e1252b7434
- Sigstore transparency entry: 170524054
- Sigstore integration time: Feb 12, 2025
Source repository:
- Permalink: timurua/pyminiscraper@b231f7dd69615379082db48a9c59ca88309d679f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/timurua
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@b231f7dd69615379082db48a9c59ca88309d679f
- Trigger Event: push

File details

Details for the file pyminiscraper-2.0.7-py3-none-any.whl.

File metadata

Download URL: pyminiscraper-2.0.7-py3-none-any.whl
Upload date: Feb 12, 2025
Size: 37.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for pyminiscraper-2.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d2a950cd59c14c275278c71d2feabd2a2acd9449349b5cbcec3fd335aec7b59`
MD5	`39af1c732578e8214f8913e0046ffbc6`
BLAKE2b-256	`5f08ed840cbf8564f0b383411d2d99d1f6e9666edb83ee43646c62dd4bf5187b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-2.0.7-py3-none-any.whl:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyminiscraper-2.0.7-py3-none-any.whl
- Subject digest: 9d2a950cd59c14c275278c71d2feabd2a2acd9449349b5cbcec3fd335aec7b59
- Sigstore transparency entry: 170524056
- Sigstore integration time: Feb 12, 2025
Source repository:
- Permalink: timurua/pyminiscraper@b231f7dd69615379082db48a9c59ca88309d679f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/timurua
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@b231f7dd69615379082db48a9c59ca88309d679f
- Trigger Event: push

pyminiscraper 2.0.7

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pyminiscraper

Introduction

Features

Installation

How does it work

Basic Usage

Downloading Sitemap-Referenced Pages

Scraping RSS/Atom Feeds

Full Website Crawling

Custom Processing with Callbacks

Configuration Options

Domain Configuration

Error Handling

Performance Tips

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance