Skip to main content

A simple manager for async requests

Project description

WebRiderAsync

WebRiderAsync is an asynchronous web scraping utility designed to efficiently handle large volumes of web requests. It leverages Python's aiohttp for asynchronous HTTP requests, making it capable of achieving high performance by processing multiple requests in parallel.

Features

  • Asynchronous requests for high performance
  • Customizable user agents and proxies
  • Retry policy for handling failed requests
  • Logging support with customizable log levels and file output
  • Configurable concurrency and delay settings
  • Statistics tracking and reporting

Installation

To use WebRiderAsync, you need to have Python 3.8 or higher installed. Install the required packages using pip:

pip install webryder_async==0.0.4

Usage

Full working example of usage you can find here examples folder.

Here's a basic example of how to use the WebRiderAsync class:

Initialization

from webrider_async import WebRiderAsync

# Create an instance of WebRiderAsync
webrider = WebRiderAsync(
    log_level='debug',                  # Logging level: 'debug', 'info', 'warning', 'error'
    file_output=True,                   # Save logs to a file
    random_user_agents=True,            # Use random user agents
    concurrent_requests=20,             # Number of concurrent requests
    max_retries=3,                       # Maximum number of retries per request
    delay_before_retry=2                # Delay before retrying a request (in seconds)
)

Making Requests

urls = ['https://example.com/page1', 'https://example.com/page2']

# Perform requests
responses = webrider.request(urls)

# Process responses
for response in responses:
    print(response.url, response.status_code)
    print(response.html[:100])  # Print the first 100 characters of the HTML

Updating Settings

webrider.update_settings(
    log_level='info',
    file_output=False,
    random_user_agents=False,
    custom_user_agent='MyCustomUserAgent',
    concurrent_requests=10,
    max_retries=5
)

Full working example of usage you can find here examples folder.

Tracking Statistics

# Print current statistics
webrider.stats()

# Reset statistics
webrider.reset_stats()

Parameters

__init__ Parameters

  • log_level: Specifies the log level. Options: 'debug', 'info', 'warning', 'error'.
  • file_output: If True, logs will be saved to a file.
  • random_user_agents: If True, a random user agent will be used for each request.
  • custom_user_agent: A custom user agent string.
  • custom_headers: A dictionary of custom headers.
  • custom_proxies: A list or single string of proxies to be used.
  • concurrent_requests: Number of concurrent requests allowed.
  • delay_per_chunk: Delay between chunks of requests (in seconds).
  • max_retries: Maximum number of retries per request.
  • delay_before_retry: Delay before retrying a failed request (in seconds).
  • max_wait_for_resp: Maximum time to wait for a response (in seconds).

Methods

  • request(urls, headers=None, user_agent=None, proxies=None): Perform asynchronous requests to the specified URLs.
  • update_settings(...): Update settings for the WebRiderAsync instance.
  • stats(): Print current scraping statistics.
  • reset_stats(): Reset statistics to zero.
  • chunkify(initial_list, chunk_size=10): Split a list into chunks of the specified size.

Logging

Logging can be configured to print to the console or save to a file. The log file is saved in a logs directory under the current working directory, with a timestamp in the filename.

Error Handling

If a request fails after the maximum number of retries, it is logged as a failure. Errors during request processing are logged with traceback information.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Thanks!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webrider_async-0.0.4.tar.gz (30.2 kB view details)

Uploaded Source

Built Distribution

webrider_async-0.0.4-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file webrider_async-0.0.4.tar.gz.

File metadata

  • Download URL: webrider_async-0.0.4.tar.gz
  • Upload date:
  • Size: 30.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for webrider_async-0.0.4.tar.gz
Algorithm Hash digest
SHA256 1e2138970c607168a19c85be233b558a99e4edcaeb991ad9b72c36ad6d5369f3
MD5 e0b3fc656c27a6a4f70852c2c7437c6c
BLAKE2b-256 d2428d3e644cc373bb9e9cc4395528a2ce4018f72c6be6c6ab5fa8c86a672f2c

See more details on using hashes here.

File details

Details for the file webrider_async-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for webrider_async-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 13d6c738f28a8fde79b3d5339b2edc86414c9411a8c4b154d54066d34f1eb644
MD5 d919ac13371176a0e51a97041182bbc0
BLAKE2b-256 b00b2406d051f8565c322d7e1d076d1ff5c92fd2ee38ae55a5166e046f8973d5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page