Skip to main content

Asynchronous web crawler built on asyncio

Project description

AIOCrawler

Build Status Codacy Badge PyPI version

Asynchronous web crawler built on asyncio

Installation

pip install pyaiocrawler

Usage

Generating sitemap

import asyncio
from aiocrawler import SitemapCrawler

crawler = SitemapCrawler('https://www.google.com', depth=3)
sitemap = asyncio.run(crawler.get_results())

Configuring the crawler

from aiocrawler import SitemapCrawler

crawler = SitemapCrawler(
    init_url='https://www.google.com', # The base URL to start crawling from
    depth=3,                           # The maximum depth to crawl till
    concurrency=300,                   # Maximum concurrent requests to make
    max_retries=3,                     # Maximum times the crawler will retry to get a response from a URL
    user_agent='My Crawler',           # Use a custom user agent for requests
)

Extending the crawler

To create your own crawler, simply subclass AIOCrawler and implement the parse method. For every page crawled, the parse method is executed with the url of the page, the links in that page and the html of the page. The return of the parse method is appended to an array which is then available when the get_results method is called. We have implemented an example crawler here that extracts the title from the page.

import asyncio
from aiocrawler import AIOCrawler
from bs4 import BeautifulSoup          # We will use beautifulsoup to extract the title from the html
from typing import Set, Tuple


class TitleScraper(AIOCrawler):
    '''
    Subclasses AIOCrawler to extract titles for the pages on the given domain
    '''
    timeout = 10
    max_redirects = 2

    def parse(self, url: str, links: Set[str], html: bytes) -> Tuple[str, str]:
        '''
        Returns the url and the title of the url
        '''
        soup = BeautifulSoup(html, 'html.parser')
        title = soup.find('title').string
        return url, title


crawler = TitleScraper('https://www.google.com', 3)
titles = asyncio.run(crawler.get_results())

Contributing

Installing dependencies

pipenv install --dev

Running tests

pytest --cov=aiocrawler

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyaiocrawler-0.5.0.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

pyaiocrawler-0.5.0-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file pyaiocrawler-0.5.0.tar.gz.

File metadata

  • Download URL: pyaiocrawler-0.5.0.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.7.1

File hashes

Hashes for pyaiocrawler-0.5.0.tar.gz
Algorithm Hash digest
SHA256 a07db1e07f8987c2bac5abfac36babfe3104e39d1b4481cea5bff9ccb0733678
MD5 e39d8b7a589329b4a85f43703e136d61
BLAKE2b-256 6dc2a448e9b227bc4388431aa300f262657dd8d0b399bc2e0540c7df07544fdd

See more details on using hashes here.

File details

Details for the file pyaiocrawler-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: pyaiocrawler-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.7.1

File hashes

Hashes for pyaiocrawler-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2dde87c72ee1bf6ab807905cf1a38fe9d9c1483f09a24893975e250469a044d
MD5 9109abd3523f1c4f79129032777616aa
BLAKE2b-256 2c6d6a0bc696a2b16a2a5dfe17a80875a33c58f8b2d2ba91a7fc93c3eee71a76

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page