A mini web scraping utility package

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

timurua

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

pyminiscraper

Introduction

pyminiscraper is a lightweight Python library designed for easy web scraping. It provides a simple interface to extract data from web pages with minimal setup. Whether you are a beginner or an advanced user, pyminiscraper offers the flexibility to handle various scraping tasks efficiently.

Features

Feature	Implemented
Basic Web Page scraping	✅
Extremely scalable async scraping	✅
Web Page spidering	✅
Parallel requests	✅
Headless browser support	✅
Robots parsing	✅
Sitemap parsing	✅
RSS parsing	✅
Atom parsing	✅
Open Graph parsing	✅
Rate limiting	✅
Error handling	✅
Depth control	✅
Custom user agent	✅
File storage	✅
Custom callbacks	✅
Domain restrictions	✅
Request timeout	✅
Page caching	✅

How does it work

┌───────────────────┐
│                   │
│  Initializing     │
│                   │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│     Download      │
│     Robots.txt    │
│                   │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│     Queue for     │
|    Configurable   |◀────┐
|      Parallel     |     |
|     Processing    |     |
└─────────┬─────────┘     |
          │               |
          ▼               |
┌───────────────────┐     |    ┌───────────────────┐
│        Scrape     │     |    │                   │
|      Web Pages,   |     |    │    Loading        │
│     RSS & Atom    │──── | ───│    Saving         │
│                   │     |    │    Web Pages      │    
└─────────┬─────────┘     |    └───────────────────┘ 
          │               |
          ▼               |
┌───────────────────┐     |
│      Discover     │     |   
│      Outgoing     |     |
|     Web Pages     │─────┘
|  RSS/Atom feeds   |
│                   │
└───────────────────┘

Use Cases

Downloading only sitemap referenced web pages

Here is a basic example of how to use pyminiscraper to scrape

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl(
                "https://www.anthropic.com/", max_depth=2, ScraperUrlType.HTML)
        ],
        follow_sitemap_links=True,
        follow_web_page_links=False,
        follow_feed_links=False,
        scraper_store_factory=FileStoreFactory(storage_dir),
    ),
)
await scraper.run()

Scraping pages referenced in Atom/RSS Feeds

Here is a basic example of how to use pyminiscraper to scrape

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl(
                "https://feeds.feedburner.com/PythonInsider", type= ScraperUrlType.FEED)
        ],
        follow_sitemap_links=False,
        follow_web_page_links=False,
        follow_feed_links=True,
        scraper_store_factory=FileStoreFactory(storage_dir),
    ),
)
await scraper.run()

Full web site capture/spidering using all possible sources of references Sitemaps/Atom/RSS/links on Web Pages

Here is a basic example of how to use pyminiscraper to scrape

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl(
                "https://www.anthropic.com/", type= ScraperUrlType.FEED)
        ],
        follow_sitemap_links=True,
        follow_web_page_links=True,
        follow_feed_links=True,
        scraper_store_factory=FileStoreFactory(storage_dir),
    ),
)
await scraper.run()

High volume scraping

Here is a basic example of how to use pyminiscraper to scrape

async def scrape_site(url: str)
    scraper = Scraper(
        ScraperConfig(
            seed_urls=[
                ScraperUrl(
                    url, type= ScraperUrlType.FEED)
            ],
            follow_sitemap_links=True,
            follow_web_page_links=True,
            follow_feed_links=True,
            scraper_store_factory=FileStoreFactory(storage_dir),
        ),
    )
    await scraper.run()

sites = [
            "https://example1.com", 
            "https://example2.com", 
            "https://example3.com"
        ]
tasks = [scrape_site(url) for url in sites]
await asyncio.gather(*tasks)

Advanced Configuration Options

Configuration for web scraping behavior.

Parameters:

max_parallel_requests (int): Maximum number of concurrent scraping requests
max_requested_urls (int): Maximum total number of URLs to request before stopping
max_depth (int): Maximum depth for recursively following links (0 means only scrape seed URLs)
max_back_to_back_errors (int): Number of consecutive errors before terminating scraper
crawl_delay_seconds (float): Minimum delay between requests to same domain
request_timeout_seconds (float): Request timeout in seconds
user_agent (str): User agent string to use in requests
store_factory: Factory for creating storage backend
seed_urls (List[ScraperUrl]): Initial URLs to start scraping from
use_headless_browser (bool): Whether to use headless browser for JavaScript rendering
follow_web_page_links (bool): Whether to follow links found in web pages
follow_sitemap_links (bool): Whether to follow links found in sitemaps
follow_feed_links (bool): Whether to follow links found in RSS/Atom feeds
domain_config (DomainConfig): Configuration for allowed/blocked domains
log (Callable): Logging function to use

The scraper will:

Start with seed URLs and scrape them according to configuration
Follow links up to max_depth if follow_web_page_links is True
Follow sitemap.xml links if follow_sitemap_links is True
Follow RSS/Atom feed links if follow_feed_links is True
Respect robots.txt and crawl delay settings
Store results using provided store_factory
Stop when max_requested_urls is reached or max_back_to_back_errors occurs

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

timurua

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

2.0.7

Feb 12, 2025

2.0.6

Feb 10, 2025

2.0.5

Feb 10, 2025

2.0.4

Feb 10, 2025

2.0.3

Feb 10, 2025

2.0.2

Feb 5, 2025

2.0.1

Feb 4, 2025

2.0.0

Feb 4, 2025

1.0.16

Feb 1, 2025

1.0.15

Feb 1, 2025

1.0.14

Jan 30, 2025

1.0.13

Jan 27, 2025

1.0.12

Jan 27, 2025

1.0.11

Jan 25, 2025

1.0.10

Jan 25, 2025

1.0.9

Jan 25, 2025

1.0.8

Jan 24, 2025

1.0.7

Jan 24, 2025

1.0.6

Jan 24, 2025

1.0.5

Jan 24, 2025

1.0.4

Jan 24, 2025

1.0.3

Jan 23, 2025

This version

1.0.2

Jan 23, 2025

1.0.1

Jan 23, 2025

1.0.0

Jan 22, 2025

0.1.1

Jan 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyminiscraper-1.0.2.tar.gz (28.5 kB view details)

Uploaded Jan 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyminiscraper-1.0.2-py3-none-any.whl (32.3 kB view details)

Uploaded Jan 23, 2025 Python 3

File details

Details for the file pyminiscraper-1.0.2.tar.gz.

File metadata

Download URL: pyminiscraper-1.0.2.tar.gz
Upload date: Jan 23, 2025
Size: 28.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for pyminiscraper-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`da821f32af3469d2a414eaa1b2aeac265dd51c7eca052daea65633950647931b`
MD5	`f0064792aa83ba53dd5509f7e49a9055`
BLAKE2b-256	`471d8e15aedd64687b0460c9d4f54fd7fec077c4db0c275e9f75ec36882f9e46`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-1.0.2.tar.gz:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyminiscraper-1.0.2.tar.gz
- Subject digest: da821f32af3469d2a414eaa1b2aeac265dd51c7eca052daea65633950647931b
- Sigstore transparency entry: 164948064
- Sigstore integration time: Jan 23, 2025
Source repository:
- Permalink: timurua/pyminiscraper@e2079d2b1d47f994093ae01e59921871ca695145
- Branch / Tag: refs/heads/main
- Owner: https://github.com/timurua
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@e2079d2b1d47f994093ae01e59921871ca695145
- Trigger Event: push

File details

Details for the file pyminiscraper-1.0.2-py3-none-any.whl.

File metadata

Download URL: pyminiscraper-1.0.2-py3-none-any.whl
Upload date: Jan 23, 2025
Size: 32.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for pyminiscraper-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2228e34dfb9e43b0556655f006b1f7533ed4c28895a5f31d47fd923831a52f48`
MD5	`dd9e57e5226847f396110ce84d5a281e`
BLAKE2b-256	`28743ac808951687738e66bdcd269a6bfdd7d30440f6381605719f395cfd1602`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pyminiscraper-1.0.2-py3-none-any.whl:

Publisher: python-publish.yml on timurua/pyminiscraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pyminiscraper-1.0.2-py3-none-any.whl
- Subject digest: 2228e34dfb9e43b0556655f006b1f7533ed4c28895a5f31d47fd923831a52f48
- Sigstore transparency entry: 164948065
- Sigstore integration time: Jan 23, 2025
Source repository:
- Permalink: timurua/pyminiscraper@e2079d2b1d47f994093ae01e59921871ca695145
- Branch / Tag: refs/heads/main
- Owner: https://github.com/timurua
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@e2079d2b1d47f994093ae01e59921871ca695145
- Trigger Event: push

pyminiscraper 1.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pyminiscraper

Introduction

Features

How does it work

Use Cases

Downloading only sitemap referenced web pages

Scraping pages referenced in Atom/RSS Feeds

Full web site capture/spidering using all possible sources of references Sitemaps/Atom/RSS/links on Web Pages

High volume scraping

Advanced Configuration Options

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance