ant-nest

A simple and clear Web Crawler framework build on python3.6+ with async

These details have not been verified by PyPI

Project links

Homepage

Project description

https://img.shields.io/pypi/v/ant_nest.svg

https://img.shields.io/travis/strongbugman/ant_nest/master.svg

https://codecov.io/gh/strongbugman/ant_nest/branch/master/graph/badge.svg

Overview

AntNest is a simple, clear and fast Web Crawler framework build on python3.6+, powered by asyncio. It has only 600+ lines core code now(thanks powerful lib like aiohttp, lxml and other else).

Features

Useful http client out of box
Things(request, response and item) can though pipelines(in async or not)
Item extractor, it`s easy to define and extract(by xpath, jpath or regex) one item we want from html, json or strings.
Custom “ensure_future” and “as_completed” api provide a easy work flow

Install

pip install ant_nest

Usage

Create one demo project by cli:

>>> ant_nest -c examples

Then we have a project:

drwxr-xr-x   5 bruce  staff  160 Jun 30 18:24 ants
-rw-r--r--   1 bruce  staff  208 Jun 26 22:59 settings.py

Presume we want to get hot repos from github, let`s create “examples/ants/example2.py”:

from yarl import URL
from ant_nest.ant import Ant
from ant_nest.pipelines import ItemFieldReplacePipeline
from ant_nest.things import ItemExtractor


class GithubAnt(Ant):
    """Crawl trending repositories from github"""

    item_pipelines = [
        ItemFieldReplacePipeline(
            ("meta_content", "star", "fork"), excess_chars=("\r", "\n", "\t", "  ")
        )
    ]
    concurrent_limit = 1  # save the website`s and your bandwidth!

    def __init__(self):
        super().__init__()
        self.item_extractor = ItemExtractor(dict)
        self.item_extractor.add_extractor(
            "title", lambda x: x.html_element.xpath("//h1/strong/a/text()")[0]
        )
        self.item_extractor.add_extractor(
            "author", lambda x: x.html_element.xpath("//h1/span/a/text()")[0]
        )
        self.item_extractor.add_extractor(
            "meta_content",
            lambda x: "".join(
                x.html_element.xpath(
                    '//div[@class="repository-content "]/div[2]//text()'
                )
            ),
        )
        self.item_extractor.add_extractor(
            "star",
            lambda x: x.html_element.xpath(
                '//a[@class="social-count js-social-count"]/text()'
            )[0],
        )
        self.item_extractor.add_extractor(
            "fork",
            lambda x: x.html_element.xpath('//a[@class="social-count"]/text()')[0],
        )
        self.item_extractor.add_extractor("origin_url", lambda x: str(x.url))

    async def crawl_repo(self, url):
        """Crawl information from one repo"""
        response = await self.request(url)
        # extract item from response
        item = self.item_extractor.extract(response)
        item["origin_url"] = response.url

        await self.collect(item)  # let item go through pipelines(be cleaned)
        self.logger.info("*" * 70 + "I got one hot repo!\n" + str(item))

    async def run(self):
        """App entrance, our play ground"""
        response = await self.request("https://github.com/explore")
        for url in response.html_element.xpath(
            "/html/body/div[4]/main/div[2]/div/div[2]/div[1]/article/div/div[1]/h1/a[2]/"
            "@href"
        ):
            # crawl many repos with our coroutines pool
            self.schedule_task(self.crawl_repo(response.url.join(URL(url))))
        self.logger.info("Waiting...")

Then we can list all ants we defined (in “examples”)

>>> $ant_nest -l
ants.example2.GithubAnt

Run it! (without debug log):

>>> ant_nest -a ants.example2.GithubAnt
INFO:GithubAnt:Opening
INFO:GithubAnt:Waiting...
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'NLP-progress', 'author': 'sebastianruder', 'meta_content': 'Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.', 'star': '3,743', 'fork': '327', 'origin_url': URL('https://github.com/sebastianruder/NLP-progress')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'material-dashboard', 'author': 'creativetimofficial', 'meta_content': 'Material Dashboard - Open Source Bootstrap 4 Material Design Adminhttps://demos.creative-tim.com/materi…', 'star': '6,032', 'fork': '187', 'origin_url': URL('https://github.com/creativetimofficial/material-dashboard')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'mkcert', 'author': 'FiloSottile', 'meta_content': "A simple zero-config tool to make locally-trusted development certificates with any names you'd like.", 'star': '2,311', 'fork': '60', 'origin_url': URL('https://github.com/FiloSottile/mkcert')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'pure-bash-bible', 'author': 'dylanaraps', 'meta_content': '📖 A collection of pure bash alternatives to external processes.', 'star': '6,385', 'fork': '210', 'origin_url': URL('https://github.com/dylanaraps/pure-bash-bible')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'flutter', 'author': 'flutter', 'meta_content': 'Flutter makes it easy and fast to build beautiful mobile apps.https://flutter.io', 'star': '30,579', 'fork': '1,337', 'origin_url': URL('https://github.com/flutter/flutter')}
INFO:GithubAnt:**********************************************************************I got one hot repo!
{'title': 'Java-Interview', 'author': 'crossoverJie', 'meta_content': '👨\u200d🎓 Java related : basic, concurrent, algorithm https://crossoverjie.top/categories/J…', 'star': '4,687', 'fork': '409', 'origin_url': URL('https://github.com/crossoverJie/Java-Interview')}
INFO:GithubAnt:Closed
INFO:GithubAnt:Get 7 Request in total
INFO:GithubAnt:Get 7 Response in total
INFO:GithubAnt:Get 6 dict in total
INFO:GithubAnt:Run GithubAnt in 18.157656 seconds

So, it`s easy to config ant by class attribute

class Ant(abc.ABC):
    response_pipelines: typing.List[Pipeline] = []
    request_pipelines: typing.List[Pipeline] = []
    item_pipelines: typing.List[Pipeline] = []
    request_cls = Request
    response_cls = Response
    request_timeout = 60
    request_retries = 3
    request_retry_delay = 5
    request_proxies: typing.List[typing.Union[str, URL]] = []
    request_max_redirects = 10
    request_allow_redirects = True
    response_in_stream = False
    connection_limit = 10  # see "TCPConnector" in "aiohttp"
    connection_limit_per_host = 0
    concurrent_limit = 100

And you can rewrite some config for one request

async def request(
    self,
    url: typing.Union[str, URL],
    method: str = aiohttp.hdrs.METH_GET,
    params: typing.Optional[dict] = None,
    headers: typing.Optional[dict] = None,
    cookies: typing.Optional[dict] = None,
    data: typing.Optional[
        typing.Union[typing.AnyStr, typing.Dict, typing.IO]
    ] = None,
    proxy: typing.Optional[typing.Union[str, URL]] = None,
    timeout: typing.Optional[float] = None,
    retries: typing.Optional[int] = None,
    response_in_stream: typing.Optional[bool] = None,
) -> Response:

About Item

We use dict to store one item in examples, actually it support many way to define our item: dict, normal class, atrrs`s class, data class and ORM class, it depend on your need and choice.

Examples

You can get some example in “./examples”

Defect

Complex exception handle

one coroutine`s exception will break await chain especially in a loop, unless we handle it by hand. eg:

for cor in self.as_completed((self.crawl(url) for url in self.urls)):
    try:
        await cor
    except Exception:  # may raise many exception in a await chain
        pass

but we can use “self.as_completed_with_async” now, eg:

async fo result in self.as_completed_with_async(
self.crawl(url) for url in self.urls, raise_exception=False):
    # exception in "self.crawl(url)" will be passed and logged automatic
    self.handle(result)

High memory usage

It`s a “feature” that asyncio eat large memory especially with high concurrent IO, we can set a concurrent limit(“connection_limit” or “concurrent_limit”) simply, but it`s complex to get the balance between performance and limit.

Coding style

Follow “Flake8”, Format by “Black”, typing check by “MyPy”, sea Makefile for more detail.

Todo

[*] Log system [*] Nest item extractor [ ] Docs

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.1

Jan 1, 2021

1.0.0

Nov 28, 2020

0.38.1

Apr 22, 2019

This version

0.38.0

Apr 10, 2019

0.37.2

Apr 9, 2019

0.37.1

Sep 16, 2018

0.37.0

Jul 30, 2018

0.36.3

Jul 23, 2018

0.36.2

Jul 14, 2018

0.36.1

Jul 3, 2018

0.36

Jul 3, 2018

0.35

Jun 14, 2018

0.34.2

Apr 12, 2018

0.34.1

Mar 30, 2018

0.34.0

Mar 23, 2018

0.33.0

Mar 16, 2018

0.32.0

Mar 5, 2018

0.31.0

Feb 28, 2018

0.30.5

Feb 27, 2018

0.30.4

Feb 26, 2018

0.30.3

Feb 25, 2018

0.30.2

Feb 24, 2018

0.30.1

Feb 24, 2018

0.30.0

Feb 11, 2018

0.29.0

Feb 7, 2018

0.28.0

Feb 3, 2018

0.27.4

Jan 30, 2018

0.27.3

Jan 25, 2018

0.27.2

Jan 24, 2018

0.27.1

Jan 23, 2018

0.27.0

Jan 22, 2018

0.26.2

Jan 20, 2018

0.26.1

Jan 18, 2018

0.26

Jan 16, 2018

0.25.2

Jan 15, 2018

0.25.1

Jan 13, 2018

0.25.0

Jan 8, 2018

0.24.0

Jan 6, 2018

0.23.3

Jan 4, 2018

0.23.2

Jan 3, 2018

0.23.1

Jan 2, 2018

0.23.0

Jan 1, 2018

0.22.2

Dec 28, 2017

0.22.1

Dec 25, 2017

0.22.0

Dec 24, 2017

0.21.0

Dec 20, 2017

0.20.1

Dec 12, 2017

0.20

Dec 9, 2017

0.19.2

Dec 7, 2017

0.19.1

Dec 7, 2017

0.19

Dec 7, 2017

0.18

Dec 7, 2017

0.17

Dec 5, 2017

0.16

Dec 4, 2017

0.15

Dec 2, 2017

0.14

Dec 1, 2017

0.13

Nov 30, 2017

0.12

Nov 29, 2017

0.11

Nov 27, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ant_nest-0.38.0.tar.gz (28.9 kB view hashes)

Uploaded Apr 10, 2019 Source

Hashes for ant_nest-0.38.0.tar.gz

Hashes for ant_nest-0.38.0.tar.gz
Algorithm	Hash digest
SHA256	`f6c553cb6b222af6e98807f94869206dd8c1d9531417c7ee50ccb978d8eeeb36`
MD5	`f050235352f505024cc37de1501df980`
BLAKE2b-256	`89fcaf5d59f88a56b5556dfc4a2964e97ef9eb24bf181dd61615ad8f209a5b56`