A package designed to scrape webpages using aiohttp and asyncio. Has some error handling to overcome common issues such as sites blocking you after n requests over a short period.
Project description
Async-scrape
Perform webscrape asyncronously
Async-scrape is a package which uses asyncio and aiohttp to scrape websites and has useful features built in.
Features
- Breaks - pause scraping when a website blocks your requests consistently
- Rate limit - slow down scraping to prevent being blocked
Installation
Async-scrape requires C++ Build tools v15+ to run.
pip install async-scrape
How to use it
#Create an instance
from async_scrape import AsyncScrape
def post_process(html, resp, **kwargs):
"""Function to process the gathered response from the request"""
if resp.status == 200:
return "Request worked"
else:
return "Request failed"
async_Scrape = AsyncScrape(
post_process_func=post_process,
post_process_kwargs={},
fetch_error_handler=None,
use_proxy=False,
proxy=None,
pac_url=None,
acceptable_error_limit=100,
attempt_limit=5,
rest_between_attempts=True,
rest_wait=60
)
urls = [
"https://www.google.com",
"https://www.bing.com",
]
resps = async_Scrape.scrape_all(urls)
Response object is a list of dicts in the format:
{
"url":url, #url of request
"func_resp":func_resp, #response from post processing function
"status":resp.status, #http status
"error":None #any error encountered
}
License
MIT
Free Software, Hell Yeah!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
async-scrape-0.1.9.tar.gz
(7.9 kB
view hashes)
Built Distribution
Close
Hashes for async_scrape-0.1.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d1ed2ece51deec8e380bf04b758d7af925adf5d798b1237ca460a89aea31117 |
|
MD5 | ee11d62a0dbb5fd5ce631d8c3cca6dc2 |
|
BLAKE2b-256 | a6d5a815fb1ab18721568853421b912b0d8374d3c299b89e7c7d7a96fb3685a3 |