A package designed to scrape webpages using aiohttp and asyncio. Has some error handling to overcome common issues such as sites blocking you after n requests over a short period.
Project description
Async-scrape
Perform webscrape asyncronously
Async-scrape is a package which uses asyncio and aiohttp to scrape websites and has useful features built in.
Features
- Breaks - pause scraping when a website blocks your requests consistently
- Rate limit - slow down scraping to prevent being blocked
Installation
Async-scrape requires C++ Build tools v15+ to run.
pip install async-scrape
How to use it
Key inpur parameters:
post_process_func- the callable used to process the returned responsepost_process_kwargs- and kwargs to be passed to the callableuse_proxy- should a proxy be used (if this is true then either provide aproxyorpac_urlvariable)attempt_limit- how manay attempts should each request be given before it is marked as failedrest_wait- how long should the programme pause between loopscall_rate_limit- limits the rate of requests (useful to stop getting blocked from websites)randomise_headers- if set toTruea new set of headers will be generated between each request
Get requests
# Create an instance
from async_scrape import AsyncScrape
def post_process(html, resp, **kwargs):
"""Function to process the gathered response from the request"""
if resp.status == 200:
return "Request worked"
else:
return "Request failed"
async_Scrape = AsyncScrape(
post_process_func=post_process,
post_process_kwargs={},
fetch_error_handler=None,
use_proxy=False,
proxy=None,
pac_url=None,
acceptable_error_limit=100,
attempt_limit=5,
rest_between_attempts=True,
rest_wait=60,
call_rate_limit=None,
randomise_headers=True
)
urls = [
"https://www.google.com",
"https://www.bing.com",
]
resps = async_Scrape.scrape_all(urls)
Post requests
# Create an instance
from async_scrape import AsyncScrape
def post_process(html, resp, **kwargs):
"""Function to process the gathered response from the request"""
if resp.status == 200:
return "Request worked"
else:
return "Request failed"
async_Scrape = AsyncScrape(
post_process_func=post_process,
post_process_kwargs={},
fetch_error_handler=None,
use_proxy=False,
proxy=None,
pac_url=None,
acceptable_error_limit=100,
attempt_limit=5,
rest_between_attempts=True,
rest_wait=60,
call_rate_limit=None,
randomise_headers=True
)
urls = [
"https://eos1jv6curljagq.m.pipedream.net",
"https://eos1jv6curljagq.m.pipedream.net",
]
payloads = [
{"value": 0},
{"value": 1}
]
resps = async_Scrape.scrape_all(urls, payloads=payloads)
Response
Response object is a list of dicts in the format:
{
"url":url, # url of request
"req":req, # combination of url and params
"func_resp":func_resp, # response from post processing function
"status":resp.status, # http status
"error":None # any error encountered
}
License
MIT
Free Software, Hell Yeah!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file async_scrape-0.1.20.tar.gz.
File metadata
- Download URL: async_scrape-0.1.20.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f46a478983a7edc6e49259336366721dc1516266d692e66ff8d7c387d19e16c9
|
|
| MD5 |
999b95590b23419646f672cb53dd6fc7
|
|
| BLAKE2b-256 |
4396b29c75a6b5d0367a232a71661191aa8a93d086355ceec4778f85c864b599
|
File details
Details for the file async_scrape-0.1.20-py3-none-any.whl.
File metadata
- Download URL: async_scrape-0.1.20-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36321206ce61656b0ee3d678b084d68b50040de4e04d8d34acc7b1d17271628c
|
|
| MD5 |
85ce80487a95338332c45c7759c73456
|
|
| BLAKE2b-256 |
bbda3d93a7c1fd5211495dc05edc750d04e97bb8c21fc80a645fbb7f594825c4
|