Asynchronous web crawler built on asyncio
Project description
AIOCrawler
Asynchronous web crawler built on asyncio
Installation
pip install pyaiocrawler
Usage
Generating sitemap
import asyncio
from aiocrawler import SitemapCrawler
crawler = SitemapCrawler('https://www.google.com', depth=3)
sitemap = asyncio.run(crawler.get_results())
Configuring the crawler
from aiocrawler import SitemapCrawler
crawler = SitemapCrawler(
init_url='https://www.google.com', # The base URL to start crawling from
depth=3, # The maximum depth to crawl till
concurrency=300, # Maximum concurrent requests to make
max_retries=3, # Maximum times the crawler will retry to get a response from a URL
user_agent='My Crawler', # Use a custom user agent for requests
)
Extending the crawler
To create your own crawler, simply subclass AIOCrawler
and implement the parse
method. For every page crawled, the parse
method is executed with the url of the page, the links in that page and the html of the page. The return of the parse
method is appended to an array which is then available when the get_results
method is called. We have implemented an example crawler here that extracts the title from the page.
import asyncio
from aiocrawler import AIOCrawler
from bs4 import BeautifulSoup # We will use beautifulsoup to extract the title from the html
from typing import Set, Tuple
class TitleScraper(AIOCrawler):
'''
Subclasses AIOCrawler to extract titles for the pages on the given domain
'''
timeout = 10
max_redirects = 2
def parse(self, url: str, links: Set[str], html: bytes) -> Tuple[str, str]:
'''
Returns the url and the title of the url
'''
soup = BeautifulSoup(html, 'html.parser')
title = soup.find('title').string
return url, title
crawler = TitleScraper('https://www.google.com', 3)
titles = asyncio.run(crawler.get_results())
Contributing
Installing dependencies
pipenv install --dev
Running tests
pytest --cov=aiocrawler
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyaiocrawler-0.5.0.tar.gz
(4.8 kB
view details)
Built Distribution
File details
Details for the file pyaiocrawler-0.5.0.tar.gz
.
File metadata
- Download URL: pyaiocrawler-0.5.0.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a07db1e07f8987c2bac5abfac36babfe3104e39d1b4481cea5bff9ccb0733678 |
|
MD5 | e39d8b7a589329b4a85f43703e136d61 |
|
BLAKE2b-256 | 6dc2a448e9b227bc4388431aa300f262657dd8d0b399bc2e0540c7df07544fdd |
File details
Details for the file pyaiocrawler-0.5.0-py3-none-any.whl
.
File metadata
- Download URL: pyaiocrawler-0.5.0-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c2dde87c72ee1bf6ab807905cf1a38fe9d9c1483f09a24893975e250469a044d |
|
MD5 | 9109abd3523f1c4f79129032777616aa |
|
BLAKE2b-256 | 2c6d6a0bc696a2b16a2a5dfe17a80875a33c58f8b2d2ba91a7fc93c3eee71a76 |