Skip to main content

Asynchronous web crawler built on asyncio

Project description


Build Status Codacy Badge PyPI version

Asynchronous web crawler built on asyncio


pip install pyaiocrawler


Generating sitemap

import asyncio
from aiocrawler import SitemapCrawler

crawler = SitemapCrawler('', depth=3)
sitemap =

Configuring the crawler

from aiocrawler import SitemapCrawler

crawler = SitemapCrawler(
    init_url='', # The base URL to start crawling from
    depth=3,                           # The maximum depth to crawl till
    concurrency=300,                   # Maximum concurrent requests to make
    max_retries=3,                     # Maximum times the crawler will retry to get a response from a URL
    user_agent='My Crawler',           # Use a custom user agent for requests

Extending the crawler

To create your own crawler, simply subclass AIOCrawler and implement the parse method. For every page crawled, the parse method is executed with the url of the page, the links in that page and the html of the page. The return of the parse method is appended to an array which is then available when the get_results method is called. We have implemented an example crawler here that extracts the title from the page.

import asyncio
from aiocrawler import AIOCrawler
from bs4 import BeautifulSoup          # We will use beautifulsoup to extract the title from the html
from typing import Set, Tuple

class TitleScraper(AIOCrawler):
    Subclasses AIOCrawler to extract titles for the pages on the given domain
    timeout = 10
    max_redirects = 2

    def parse(self, url: str, links: Set[str], html: bytes) -> Tuple[str, str]:
        Returns the url and the title of the url
        soup = BeautifulSoup(html, 'html.parser')
        title = soup.find('title').string
        return url, title

crawler = TitleScraper('', 3)
titles =


Installing dependencies

pipenv install --dev

Running tests

pytest --cov=aiocrawler

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyaiocrawler-0.5.0.tar.gz (4.8 kB view hashes)

Uploaded source

Built Distribution

pyaiocrawler-0.5.0-py3-none-any.whl (6.0 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page