Skip to main content

A package designed to scrape webpages using aiohttp and asyncio. Has some error handling to overcome common issues such as sites blocking you after n requests over a short period.

Project description

Async-scrape

Perform webscrape asyncronously

Build Status

Async-scrape is a package which uses asyncio and aiohttp to scrape websites and has useful features built in.

Features

  • Breaks - pause scraping when a website blocks your requests consistently
  • Rate limit - slow down scraping to prevent being blocked

Installation

Async-scrape requires C++ Build tools v15+ to run.

pip install async-scrape

How to use it

Key inpur parameters:

  • post_process_func - the callable used to process the returned response
  • post_process_kwargs - and kwargs to be passed to the callable
  • use_proxy - should a proxy be used (if this is true then either provide a proxy or pac_url variable)
  • attempt_limit - how manay attempts should each request be given before it is marked as failed
  • rest_wait - how long should the programme pause between loops
  • call_rate_limit - limits the rate of requests (useful to stop getting blocked from websites)
  • randomise_headers - if set to True a new set of headers will be generated between each request

Get requests

# Create an instance
from async_scrape import AsyncScrape

def post_process(html, resp, **kwargs):
    """Function to process the gathered response from the request"""
    if resp.status == 200:
        return "Request worked"
    else:
        return "Request failed"

async_Scrape = AsyncScrape(
    post_process_func=post_process,
    post_process_kwargs={},
    fetch_error_handler=None,
    use_proxy=False,
    proxy=None,
    pac_url=None,
    acceptable_error_limit=100,
    attempt_limit=5,
    rest_between_attempts=True,
    rest_wait=60,
    call_rate_limit=None,
    randomise_headers=True
)

urls = [
    "https://www.google.com",
    "https://www.bing.com",
]

resps = async_Scrape.scrape_all(urls)

Post requests

# Create an instance
from async_scrape import AsyncScrape

def post_process(html, resp, **kwargs):
    """Function to process the gathered response from the request"""
    if resp.status == 200:
        return "Request worked"
    else:
        return "Request failed"

async_Scrape = AsyncScrape(
    post_process_func=post_process,
    post_process_kwargs={},
    fetch_error_handler=None,
    use_proxy=False,
    proxy=None,
    pac_url=None,
    acceptable_error_limit=100,
    attempt_limit=5,
    rest_between_attempts=True,
    rest_wait=60,
    call_rate_limit=None,
    randomise_headers=True
)

urls = [
    "https://eos1jv6curljagq.m.pipedream.net",
    "https://eos1jv6curljagq.m.pipedream.net",
]
payloads = [
    {"value": 0},
    {"value": 1}
]

resps = async_Scrape.scrape_all(urls, payloads=payloads)

Response

Response object is a list of dicts in the format:

{
    "url":url, # url of request
    "req":req, # combination of url and params
    "func_resp":func_resp, # response from post processing function
    "status":resp.status, # http status
    "error":None # any error encountered
}

License

MIT

Free Software, Hell Yeah!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

async_scrape-0.1.19.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

async_scrape-0.1.19-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file async_scrape-0.1.19.tar.gz.

File metadata

  • Download URL: async_scrape-0.1.19.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.1 Windows/10

File hashes

Hashes for async_scrape-0.1.19.tar.gz
Algorithm Hash digest
SHA256 8bb35bd8cc19763d2eaba8585505716eb71dcb5c991bc6d4421a0b9c9b85d24f
MD5 f0a4acb7063108f0fd4e47f322db7a98
BLAKE2b-256 6bb0ba7f36e47e1f9e15ab0f3b99d2b6224d0d8c6187287e10cad04b39a5988d

See more details on using hashes here.

File details

Details for the file async_scrape-0.1.19-py3-none-any.whl.

File metadata

  • Download URL: async_scrape-0.1.19-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.1 Windows/10

File hashes

Hashes for async_scrape-0.1.19-py3-none-any.whl
Algorithm Hash digest
SHA256 6168221e308c13ca9f72876c0550f08d8e2bb2ec1e5cb1bcce0eb14873049ea3
MD5 9815e165ebf360c00d731c751f9fea54
BLAKE2b-256 a76bb51bbe309a35ef4ff2f5cf27889c250eba9ebb824c0ef423509ba6c5a61f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page