Collection of different tools for async web scraping, crawling and parsing
Project description
aiofetch
A Python toolkit for asynchronous web scraping with built-in error tracking and metadata management.
Features
Web Processing
- Asynchronous file downloading with progress tracking
- Rate limiting with configurable delays
- Smart retry logic with timeout handling
- Domain-aware crawling with URL validation
Content Processing
- Flexible HTML content parsing
- Custom selector-based metadata extraction
- Automated link and image extraction
- URL normalization and path handling
File & Data Management
- Asynchronous file operations
- Concurrent chunk-based downloads
- Smart path handling and file naming
- JSON data management with validation
Error Handling & Progress Tracking
- Comprehensive error tracking and reporting
- Progress monitoring for long operations
- Detailed logging with configurable outputs
- Operation statistics and summaries
Metadata Management
- Efficient in-memory caching
- Field-based search functionality
- Automatic metadata indexing
- Structured data validation
Installation
pip install aiofetch
Key Components
- AsyncDownloader: Parallel file downloading with progress tracking
- BatchProcessor: Process items in configurable batches
- RateLimiter: Control request frequency
- MetadataExtractor: HTML metadata extraction with custom selectors
- PathHandler: Path and filename utilities
- FileIO: Async/sync file operations
- BaseCrawler: Extensible crawler base class with domain validation
- LoggerFactory: Enhanced logging with file and console outputs
Requirements
- Python 3.9+
- aiofiles
- aiohttp
- BeautifulSoup4
Quick start
import asyncio
from aiofetch import (
AsyncDownloader,
MetadataExtractor,
ContentParser,
FileIO
)
async def main():
# Initialize components
downloader = AsyncDownloader(concurrent_limit=20)
parser = ContentParser()
file_io = FileIO()
# Download files
urls = [
("https://example.com/file1.pdf", "downloads/file1.pdf"),
("https://example.com/file2.pdf", "downloads/file2.pdf")
]
await downloader.download_batch(urls)
# Parse HTML content
html = """<html><body>
<h1>Title</h1>
<img src="image.jpg" alt="Test">
</body></html>"""
# Extract metadata
extractor = MetadataExtractor()
metadata = extractor.extract_from_html(html, {
'title': 'h1',
'images': ('img', 'src')
})
# Save results
await file_io.save_json(metadata, 'output/metadata.json')
if __name__ == "__main__":
asyncio.run(main())
License
MIT License - see LICENSE file for details
Contributing
Contributions are welcome! Please feel free to submit a Pull Request or Issue.
Author
Akram Rakhmetulla (akram042006@gmail.com)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aiofetch-0.0.2.tar.gz.
File metadata
- Download URL: aiofetch-0.0.2.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc7f7698d74cae4893fe2aa042ff5a9b5f28c6dfcf617d209327bdf3947d4d85
|
|
| MD5 |
15a0c6151578b8127a07a35fb7e8a3b5
|
|
| BLAKE2b-256 |
b422144a2862316947cf2263116f12d14b7d5d5c8faa8820f5f82c5786f2d385
|
Provenance
The following attestation bundles were made for aiofetch-0.0.2.tar.gz:
Publisher:
workflow.yml on spike1236/aiofetch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aiofetch-0.0.2.tar.gz -
Subject digest:
cc7f7698d74cae4893fe2aa042ff5a9b5f28c6dfcf617d209327bdf3947d4d85 - Sigstore transparency entry: 169003283
- Sigstore integration time:
-
Permalink:
spike1236/aiofetch@327175afd9c59b1ef39a268c540a5af5b545470f -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/spike1236
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@327175afd9c59b1ef39a268c540a5af5b545470f -
Trigger Event:
push
-
Statement type:
File details
Details for the file aiofetch-0.0.2-py3-none-any.whl.
File metadata
- Download URL: aiofetch-0.0.2-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d142fa3a150bfc45ba6c3d32521e4f1fa0017c9212cb2a9049f65d9c7587a70f
|
|
| MD5 |
12a9716d4fc68c3a3424cd6f3601b506
|
|
| BLAKE2b-256 |
444d433ab18e0c289accd88b7a30efabadf7a68bd4f3aa16ae89fee7aff04cb2
|
Provenance
The following attestation bundles were made for aiofetch-0.0.2-py3-none-any.whl:
Publisher:
workflow.yml on spike1236/aiofetch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
aiofetch-0.0.2-py3-none-any.whl -
Subject digest:
d142fa3a150bfc45ba6c3d32521e4f1fa0017c9212cb2a9049f65d9c7587a70f - Sigstore transparency entry: 169003287
- Sigstore integration time:
-
Permalink:
spike1236/aiofetch@327175afd9c59b1ef39a268c540a5af5b545470f -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/spike1236
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@327175afd9c59b1ef39a268c540a5af5b545470f -
Trigger Event:
push
-
Statement type: