A package designed to scrape webpages using aiohttp and asyncio. Has some error handling to overcome common issues such as sites blocking you after n requests over a short period.
Project description
Async-scrape
Perform webscrape asyncronously
Async-scrape is a package which uses asyncio and aiohttp to scrape websites and has useful features built in.
Features
- Breaks - pause scraping when a website blocks your requests consistently
- Rate limit - slow down scraping to prevent being blocked
Installation
Async-scrape requires C++ Build tools v15+ to run.
pip install async-scrape
How to use it
Key inpur parameters:
post_process_func
- the callable used to process the returned responsepost_process_kwargs
- and kwargs to be passed to the callableuse_proxy
- should a proxy be used (if this is true then either provide aproxy
orpac_url
variable)attempt_limit
- how manay attempts should each request be given before it is marked as failedrest_wait
- how long should the programme pause between loopscall_rate_limit
- limits the rate of requests (useful to stop getting blocked from websites)randomise_headers
- if set toTrue
a new set of headers will be generated between each request
Get requests
# Create an instance
from async_scrape import AsyncScrape
def post_process(html, resp, **kwargs):
"""Function to process the gathered response from the request"""
if resp.status == 200:
return "Request worked"
else:
return "Request failed"
async_Scrape = AsyncScrape(
post_process_func=post_process,
post_process_kwargs={},
fetch_error_handler=None,
use_proxy=False,
proxy=None,
pac_url=None,
acceptable_error_limit=100,
attempt_limit=5,
rest_between_attempts=True,
rest_wait=60,
call_rate_limit=None,
randomise_headers=True
)
urls = [
"https://www.google.com",
"https://www.bing.com",
]
resps = async_Scrape.scrape_all(urls)
Post requests
# Create an instance
from async_scrape import AsyncScrape
def post_process(html, resp, **kwargs):
"""Function to process the gathered response from the request"""
if resp.status == 200:
return "Request worked"
else:
return "Request failed"
async_Scrape = AsyncScrape(
post_process_func=post_process,
post_process_kwargs={},
fetch_error_handler=None,
use_proxy=False,
proxy=None,
pac_url=None,
acceptable_error_limit=100,
attempt_limit=5,
rest_between_attempts=True,
rest_wait=60,
call_rate_limit=None,
randomise_headers=True
)
urls = [
"https://eos1jv6curljagq.m.pipedream.net",
"https://eos1jv6curljagq.m.pipedream.net",
]
payloads = [
{"value": 0},
{"value": 1}
]
resps = async_Scrape.scrape_all(urls, payloads=payloads)
Response
Response object is a list of dicts in the format:
{
"url":url, # url of request
"req":req, # combination of url and params
"func_resp":func_resp, # response from post processing function
"status":resp.status, # http status
"error":None # any error encountered
}
License
MIT
Free Software, Hell Yeah!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
async_scrape-0.1.19.tar.gz
(12.8 kB
view details)
Built Distribution
File details
Details for the file async_scrape-0.1.19.tar.gz
.
File metadata
- Download URL: async_scrape-0.1.19.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.1 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bb35bd8cc19763d2eaba8585505716eb71dcb5c991bc6d4421a0b9c9b85d24f |
|
MD5 | f0a4acb7063108f0fd4e47f322db7a98 |
|
BLAKE2b-256 | 6bb0ba7f36e47e1f9e15ab0f3b99d2b6224d0d8c6187287e10cad04b39a5988d |
File details
Details for the file async_scrape-0.1.19-py3-none-any.whl
.
File metadata
- Download URL: async_scrape-0.1.19-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.1 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6168221e308c13ca9f72876c0550f08d8e2bb2ec1e5cb1bcce0eb14873049ea3 |
|
MD5 | 9815e165ebf360c00d731c751f9fea54 |
|
BLAKE2b-256 | a76bb51bbe309a35ef4ff2f5cf27889c250eba9ebb824c0ef423509ba6c5a61f |