Collection of different tools for async web scraping, crawling and parsing

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Akram71

These details have not been verified by PyPI

Development Status
- 4 - Beta
Framework
- AsyncIO
Programming Language

Project description

aiofetch

A Python toolkit for asynchronous web scraping with built-in error tracking and metadata management.

Features

Web Processing

Asynchronous file downloading with progress tracking
Rate limiting with configurable delays
Smart retry logic with timeout handling
Domain-aware crawling with URL validation

Content Processing

Flexible HTML content parsing
Custom selector-based metadata extraction
Automated link and image extraction
URL normalization and path handling

File & Data Management

Asynchronous file operations
Concurrent chunk-based downloads
Smart path handling and file naming
JSON data management with validation

Error Handling & Progress Tracking

Comprehensive error tracking and reporting
Progress monitoring for long operations
Detailed logging with configurable outputs
Operation statistics and summaries

Metadata Management

Efficient in-memory caching
Field-based search functionality
Automatic metadata indexing
Structured data validation

Installation

pip install aiofetch

Key Components

AsyncDownloader: Parallel file downloading with progress tracking
BatchProcessor: Process items in configurable batches
RateLimiter: Control request frequency
MetadataExtractor: HTML metadata extraction with custom selectors
PathHandler: Path and filename utilities
FileIO: Async/sync file operations
BaseCrawler: Extensible crawler base class with domain validation
LoggerFactory: Enhanced logging with file and console outputs

Requirements

Python 3.9+
aiofiles
aiohttp
BeautifulSoup4

Quick start

We are going to scrape images and books data from the example website - books.toscrape.com.

import os
import asyncio
from urllib.parse import urljoin
from aiofetch.crawler import BaseCrawler, RateLimiter
from aiofetch.utils import MetadataExtractor, FileIO
from aiofetch.downloader import AsyncDownloader


class BookScraper(BaseCrawler):
    def __init__(self, base_url: str):
        super().__init__(base_url)
        self.extractor = MetadataExtractor()
        self.rate_limiter = RateLimiter()

    async def scrape_page(self, url: str) -> list:
        async with self.rate_limiter:
            content = await self.fetch_page(url)
            if not content:
                return []
            self.logger.debug(f"Parsing HTML content from {url}")
            soup = await self.parse_html(content)
            books = []
            selectors = {
                'title': ('h3 a', 'title'),
                'relative_link': ('h3 a', 'href'),
                'price': 'p.price_color',
                'availability': 'p.instock.availability',
                'rating': ('p.star-rating', 'class', 1),
                'image': ('div.image_container img', 'src')
            }
            for article in soup.select('article.product_pod'):
                data = self.extractor.extract_from_html(article, selectors)
                if rel := data.pop('relative_link', None):
                    data['url'] = urljoin(url, rel)
                else:
                    data['url'] = url
                if img := data.get('image'):
                    data['image'] = urljoin(url, img)
                books.append(data)
        return books

    async def scrape(self, start_url: str) -> list:
        return await self.scrape_page(start_url)


async def main():
    # Scrape book data
    async with BookScraper("http://books.toscrape.com") as scraper:
        books = await scraper.scrape("http://books.toscrape.com/catalogue/page-1.html")
    
    # Save scraped data as JSON
    file_io = FileIO()
    json_path = "data/books.json"
    await file_io.write_json(books, json_path)
    print(f"Saved {len(books)} books to {json_path}")
    
    # Prepare and download images
    download_tasks = []
    for book in books:
        if image_url := book.get('image'):
            filename = os.path.basename(image_url)
            local_path = os.path.join("images", filename)
            download_tasks.append((image_url, local_path))
    
    if download_tasks:
        downloader = AsyncDownloader(concurrent_limit=10)
        results = await downloader.download_batch(download_tasks)
        print(f"Downloaded {sum(results)} images out of {len(download_tasks)}")
        downloader.save_failed_downloads()


if __name__ == "__main__":
    asyncio.run(main())

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or Issue.

Author

Akram Rakhmetulla (akram042006@gmail.com)

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Akram71

These details have not been verified by PyPI

Development Status
- 4 - Beta
Framework
- AsyncIO
Programming Language

Release history Release notifications | RSS feed

This version

0.0.6

Feb 7, 2025

0.0.5

Feb 7, 2025

0.0.4

Feb 7, 2025

0.0.3

Feb 7, 2025

0.0.2

Feb 5, 2025

0.0.1

Feb 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aiofetch-0.0.6.tar.gz (14.9 kB view details)

Uploaded Feb 7, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

aiofetch-0.0.6-py3-none-any.whl (13.7 kB view details)

Uploaded Feb 7, 2025 Python 3

File details

Details for the file aiofetch-0.0.6.tar.gz.

File metadata

Download URL: aiofetch-0.0.6.tar.gz
Upload date: Feb 7, 2025
Size: 14.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for aiofetch-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`61a87c574df618590a495bee2279716e3e2425f1e030222e55ea90508d65f06c`
MD5	`133e809c2cf3ae3c1a680628e65bdea8`
BLAKE2b-256	`32873aefec6445dda8efd984300f8f10417b1fdce224c81248d6a12ef2f34522`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aiofetch-0.0.6.tar.gz:

Publisher: workflow.yml on spike1236/aiofetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aiofetch-0.0.6.tar.gz
- Subject digest: 61a87c574df618590a495bee2279716e3e2425f1e030222e55ea90508d65f06c
- Sigstore transparency entry: 169629045
- Sigstore integration time: Feb 7, 2025
Source repository:
- Permalink: spike1236/aiofetch@608a3cc60653112c1a970c1ea0a83a78ea346014
- Branch / Tag: refs/tags/v0.0.6
- Owner: https://github.com/spike1236
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@608a3cc60653112c1a970c1ea0a83a78ea346014
- Trigger Event: push

File details

Details for the file aiofetch-0.0.6-py3-none-any.whl.

File metadata

Download URL: aiofetch-0.0.6-py3-none-any.whl
Upload date: Feb 7, 2025
Size: 13.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for aiofetch-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6c8d8ee01227519b57aa5a850e7131cdd7a80a1ad0302acb42ef50dda29f7d5b`
MD5	`183bed802d63e9ba22cdc67bfa22e203`
BLAKE2b-256	`79e96cd7427fd0bcd1069944f2726edfff9cb96bc238723609d5510b31fda4ce`

See more details on using hashes here.

Provenance

The following attestation bundles were made for aiofetch-0.0.6-py3-none-any.whl:

Publisher: workflow.yml on spike1236/aiofetch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: aiofetch-0.0.6-py3-none-any.whl
- Subject digest: 6c8d8ee01227519b57aa5a850e7131cdd7a80a1ad0302acb42ef50dda29f7d5b
- Sigstore transparency entry: 169629048
- Sigstore integration time: Feb 7, 2025
Source repository:
- Permalink: spike1236/aiofetch@608a3cc60653112c1a970c1ea0a83a78ea346014
- Branch / Tag: refs/tags/v0.0.6
- Owner: https://github.com/spike1236
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yml@608a3cc60653112c1a970c1ea0a83a78ea346014
- Trigger Event: push

aiofetch 0.0.6

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

aiofetch

Features

Web Processing

Content Processing

File & Data Management

Error Handling & Progress Tracking

Metadata Management

Installation

Key Components

Requirements

Quick start

License

Contributing

Author

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance