Skip to main content

A simple and clear Web Crawler framework build on python3.6+ with async

Project description

https://img.shields.io/pypi/v/ant_nest.svg https://img.shields.io/travis/strongbugman/ant_nest/master.svg https://codecov.io/gh/strongbugman/ant_nest/branch/master/graph/badge.svg

Overview

AntNest is a simple, clear and fast Web Crawler framework build on python3.6+, powered by asyncio. It has only 600+ lines core code now(thanks powerful lib like aiohttp, lxml and other else).

Features

  • Useful http client out of box

  • Things(request, response and item) can though pipelines(in async or not)

  • Item extractor, it`s easy to define and extract(by xpath, jpath or regex) one item we want from html, json or strings.

  • Custom “ensure_future” and “as_completed” api provide a easy work flow

Install

pip install ant_nest

Usage

Create one demo project by cli:

>>> ant_nest -c examples

Then we have a project:

drwxr-xr-x   5 bruce  staff  160 Jun 30 18:24 ants
-rw-r--r--   1 bruce  staff  208 Jun 26 22:59 settings.py

Presume we want to get hot repos from github, let`s create “examples/ants/example2.py”:

from yarl import URL
from ant_nest.ant import Ant
from ant_nest.pipelines import ItemFieldReplacePipeline
from ant_nest.things import ItemExtractor


class GithubAnt(Ant):
    """Crawl trending repositories from github"""

    item_pipelines = [
        ItemFieldReplacePipeline(
            ("meta_content", "star", "fork"), excess_chars=("\r", "\n", "\t", "  ")
        )
    ]
    concurrent_limit = 1  # save the website`s and your bandwidth!

    def __init__(self):
        super().__init__()
        self.item_extractor = ItemExtractor(dict)
        self.item_extractor.add_extractor(
            "title", lambda x: x.html_element.xpath("//h1/strong/a/text()")[0]
        )
        self.item_extractor.add_extractor(
            "author", lambda x: x.html_element.xpath("//h1/span/a/text()")[0]
        )
        self.item_extractor.add_extractor(
            "meta_content",
            lambda x: "".join(
                x.html_element.xpath(
                    '//div[@class="repository-content "]/div[2]//text()'
                )
            ),
        )
        self.item_extractor.add_extractor(
            "star",
            lambda x: x.html_element.xpath(
                '//a[@class="social-count js-social-count"]/text()'
            )[0],
        )
        self.item_extractor.add_extractor(
            "fork",
            lambda x: x.html_element.xpath('//a[@class="social-count"]/text()')[0],
        )
        self.item_extractor.add_extractor("origin_url", lambda x: str(x.url))

    async def crawl_repo(self, url):
        """Crawl information from one repo"""
        response = await self.request(url)
        # extract item from response
        item = self.item_extractor.extract(response)
        item["origin_url"] = response.url

        await self.collect(item)  # let item go through pipelines(be cleaned)
        self.logger.info("*" * 70 + "I got one hot repo!\n" + str(item))

    async def run(self):
        """App entrance, our play ground"""
        response = await self.request("https://github.com/explore")
        for url in response.html_element.xpath(
            "/html/body/div[4]/main/div[2]/div/div[2]/div[1]/article/div/div[1]/h1/a[2]/"
            "@href"
        ):
            # crawl many repos with our coroutines pool
            self.schedule_task(self.crawl_repo(response.url.join(URL(url))))
        self.logger.info("Waiting...")

Then we can list all ants we defined (in “examples”)

>>> $ant_nest -l
ants.example2.GithubAnt

Run it! (without debug log):

>>> ant_nest -a ants.example2.GithubAnt
INFO:GithubAnt:Opening
INFO:GithubAnt:Waiting...
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'NLP-progress', 'author': 'sebastianruder', 'meta_content': 'Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.', 'star': '3,743', 'fork': '327', 'origin_url': URL('https://github.com/sebastianruder/NLP-progress')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'material-dashboard', 'author': 'creativetimofficial', 'meta_content': 'Material Dashboard - Open Source Bootstrap 4 Material Design Adminhttps://demos.creative-tim.com/materi…', 'star': '6,032', 'fork': '187', 'origin_url': URL('https://github.com/creativetimofficial/material-dashboard')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'mkcert', 'author': 'FiloSottile', 'meta_content': "A simple zero-config tool to make locally-trusted development certificates with any names you'd like.", 'star': '2,311', 'fork': '60', 'origin_url': URL('https://github.com/FiloSottile/mkcert')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'pure-bash-bible', 'author': 'dylanaraps', 'meta_content': '📖 A collection of pure bash alternatives to external processes.', 'star': '6,385', 'fork': '210', 'origin_url': URL('https://github.com/dylanaraps/pure-bash-bible')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'flutter', 'author': 'flutter', 'meta_content': 'Flutter makes it easy and fast to build beautiful mobile apps.https://flutter.io', 'star': '30,579', 'fork': '1,337', 'origin_url': URL('https://github.com/flutter/flutter')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'Java-Interview', 'author': 'crossoverJie', 'meta_content': '👨\u200d🎓 Java related : basic, concurrent, algorithm https://crossoverjie.top/categories/J…', 'star': '4,687', 'fork': '409', 'origin_url': URL('https://github.com/crossoverJie/Java-Interview')}
INFO:GithubAnt:Closed
INFO:GithubAnt:Get 7 Request in total
INFO:GithubAnt:Get 7 Response in total
INFO:GithubAnt:Get 6 dict in total
INFO:GithubAnt:Run GithubAnt in 18.157656 seconds

So, it`s easy to config ant by class attribute

class Ant(abc.ABC):
    response_pipelines: typing.List[Pipeline] = []
    request_pipelines: typing.List[Pipeline] = []
    item_pipelines: typing.List[Pipeline] = []
    request_cls = Request
    response_cls = Response
    request_timeout = 60
    request_retries = 3
    request_retry_delay = 5
    request_proxies: typing.List[typing.Union[str, URL]] = []
    request_max_redirects = 10
    request_allow_redirects = True
    response_in_stream = False
    connection_limit = 10  # see "TCPConnector" in "aiohttp"
    connection_limit_per_host = 0
    concurrent_limit = 100

And you can rewrite some config for one request

async def request(
    self,
    url: typing.Union[str, URL],
    method: str = aiohttp.hdrs.METH_GET,
    params: typing.Optional[dict] = None,
    headers: typing.Optional[dict] = None,
    cookies: typing.Optional[dict] = None,
    data: typing.Optional[
        typing.Union[typing.AnyStr, typing.Dict, typing.IO]
    ] = None,
    proxy: typing.Optional[typing.Union[str, URL]] = None,
    timeout: typing.Optional[float] = None,
    retries: typing.Optional[int] = None,
    response_in_stream: typing.Optional[bool] = None,
) -> Response:

About Item

We use dict to store one item in examples, actually it support many way to define our item: dict, normal class, atrrs`s class, data class and ORM class, it depend on your need and choice.

Examples

You can get some example in “./examples”

Defect

  • Complex exception handle

one coroutine`s exception will break await chain especially in a loop, unless we handle it by hand. eg:

for cor in self.as_completed((self.crawl(url) for url in self.urls)):
    try:
        await cor
    except Exception:  # may raise many exception in a await chain
        pass

but we can use “self.as_completed_with_async” now, eg:

async fo result in self.as_completed_with_async(
self.crawl(url) for url in self.urls, raise_exception=False):
    # exception in "self.crawl(url)" will be passed and logged automatic
    self.handle(result)
  • High memory usage

It`s a “feature” that asyncio eat large memory especially with high concurrent IO, we can set a concurrent limit(“connection_limit” or “concurrent_limit”) simply, but it`s complex to get the balance between performance and limit.

Coding style

Follow “Flake8”, Format by “Black”, typing check by “MyPy”, sea Makefile for more detail.

Todo

[*] Log system [*] Nest item extractor [ ] Docs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ant_nest-0.38.0.tar.gz (28.9 kB view details)

Uploaded Source

File details

Details for the file ant_nest-0.38.0.tar.gz.

File metadata

  • Download URL: ant_nest-0.38.0.tar.gz
  • Upload date:
  • Size: 28.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/3.6

File hashes

Hashes for ant_nest-0.38.0.tar.gz
Algorithm Hash digest
SHA256 f6c553cb6b222af6e98807f94869206dd8c1d9531417c7ee50ccb978d8eeeb36
MD5 f050235352f505024cc37de1501df980
BLAKE2b-256 89fcaf5d59f88a56b5556dfc4a2964e97ef9eb24bf181dd61615ad8f209a5b56

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page