A simple manager for async requests
Project description
WebRiderAsync
WebRiderAsync is an asynchronous web scraping utility designed to efficiently handle large volumes of web requests. It leverages Python's aiohttp for asynchronous HTTP requests, making it capable of achieving high performance by processing multiple requests in parallel.
Features
- Asynchronous requests for high performance
- Customizable user agents and proxies
- Retry policy for handling failed requests
- Logging support with customizable log levels and file output
- Configurable concurrency and delay settings
- Statistics tracking and reporting
Installation
To use WebRiderAsync, you need to have Python 3.8 or higher installed. Install the required packages using pip:
pip install webryder_async
Usage
Here's a basic example of how to use the WebRiderAsync class:
Initialization
from webrider_async import WebRiderAsync
# Create an instance of WebRiderAsync
webrider = WebRiderAsync(
log_level='debug', # Logging level: 'debug', 'info', 'warning', 'error'
file_output=True, # Save logs to a file
random_user_agents=True, # Use random user agents
concurrent_requests=20, # Number of concurrent requests
max_retries=3, # Maximum number of retries per request
delay_before_retry=2 # Delay before retrying a request (in seconds)
)
Making Requests
urls = ['https://example.com/page1', 'https://example.com/page2']
# Perform requests
responses = webrider.request(urls)
# Process responses
for response in responses:
print(response.url, response.status_code)
print(response.html[:100]) # Print the first 100 characters of the HTML
Updating Settings
webrider.update_settings(
log_level='info',
file_output=False,
random_user_agents=False,
custom_user_agent='MyCustomUserAgent',
concurrent_requests=10,
max_retries=5
)
Full working example of usage you can check in examples folder.
Tracking Statistics
# Print current statistics
webrider.stats()
# Reset statistics
webrider.reset_stats()
Parameters
__init__
Parameters
- log_level: Specifies the log level. Options: 'debug', 'info', 'warning', 'error'.
- file_output: If True, logs will be saved to a file.
- random_user_agents: If True, a random user agent will be used for each request.
- custom_user_agent: A custom user agent string.
- custom_headers: A dictionary of custom headers.
- custom_proxies: A list or single string of proxies to be used.
- concurrent_requests: Number of concurrent requests allowed.
- delay_per_chunk: Delay between chunks of requests (in seconds).
- max_retries: Maximum number of retries per request.
- delay_before_retry: Delay before retrying a failed request (in seconds).
- max_wait_for_resp: Maximum time to wait for a response (in seconds).
Methods
request(urls, headers=None, user_agent=None, proxies=None)
: Perform asynchronous requests to the specified URLs.update_settings(...)
: Update settings for the WebRiderAsync instance.stats()
: Print current scraping statistics.reset_stats()
: Reset statistics to zero.chunkify(initial_list, chunk_size=10)
: Split a list into chunks of the specified size.
Logging
Logging can be configured to print to the console or save to a file. The log file is saved in a logs directory under the current working directory, with a timestamp in the filename.
Error Handling
If a request fails after the maximum number of retries, it is logged as a failure. Errors during request processing are logged with traceback information.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file webrider_async-0.0.2.tar.gz
.
File metadata
- Download URL: webrider_async-0.0.2.tar.gz
- Upload date:
- Size: 30.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d4a6a475afb42a519b9a7386e534349583b412bd2897c9f9a22cfeb8877eaa7 |
|
MD5 | 8daa2f5e364d0babe3bb1ac8106b9422 |
|
BLAKE2b-256 | 7ffcc23840bb017062b5396aee5e0d9df0b590fa8f58c1b5d197f2753f942cfc |
File details
Details for the file webrider_async-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: webrider_async-0.0.2-py3-none-any.whl
- Upload date:
- Size: 29.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b122d061bb0db44265dc81e8117e16d1ae42000ebf1ad2b1ae9658192d399bc |
|
MD5 | 72c0771ca22909ef4d3d2b6ea00557a2 |
|
BLAKE2b-256 | 84d7ffe0c857300d0306f6798eaae5d7fe11d0f16c18116a3df1f0455dddb6fd |