Skip to main content

A simple and clear Web Crawler framework build on python3.6+ with async

Project description

Overview

AntNest is a simple, clear and fast Web Crawler framework build on python3.6+, powered by asyncio.

As a Scrapy user, I think scrapy provide many awesome features what I think AntNest should have too.This is some main difference:

  • Scrapy use callback way to write code while AntNest use coroutines
  • Scrapy is stable and widely usage while AntNest is in early development
  • AntNest has only 600+ lines core code now, and it works

Features

  • Things(request, response and item) can though pipelines(in async or not)
  • Item and item extractor, it`s easy to define and extract(by xpath for now) a validated(by field type) item
  • Custom “ensure_future” and “as_completed” method provide concurrent limit and collection of completed coroutines
  • Default coroutines concurrent limit, reduce memory usage

Install

pip install ant_nest

Usage

Let`s take a look, create book.py first:

from ant_nest.ant import Ant
from ant_nest.things import StringField, IntField ItemExtractor
from ant_nest.pipelines import *

# define a item structure we want to crawl
class BookItem(Item):
    name = StringField()
    author = StringField()
    content = StringField()
    origin_url = StringField()
    date = IntField(null=True)  # The filed is optional


# define our ant
class BookAnt(Ant):
    # the things(request, response, item) will pass through pipelines in order, pipelines can change or drop them
    item_pipelines = [ItemValidatePipeline(),
                      ItemMysqlInsertPipeline(settings.MYSQL_HOST, settings.MYSQL_PORT, settings.MYSQL_USER,
                                              settings.MYSQL_PASSWORD, settings.MYSQL_DATABASE, 'book'),
                      ReportPipeline()]
    request_pipelines = [RequestDuplicateFilterPipeline(), RequestUserAgentPipeline(), ReportPipeline()]
    response_pipelines = [ResponseRetryPipeline(), ResponseFilterErrorPipeline(), ReportPipeline()]


    # define ItemExtractor to extract item field by xpath from response(html source code)
    self.item_extractor = ItemExtractor(BookItem)
    self.item_extractor.add_xpath('name', '//div/h1/text()')
    self.item_extractor.add_xpath('author', '/html/body/div[1]/div[@class="author"]/text()')
    self.item_extractor.add_xpath('content', '/html/body/div[2]/div[2]/div[2]//text()',
                                  ItemExtractor.join_all)

    # crawl book information
    async def crawl_book(self, url):
        # send request and wait for response
        response = await self.request(url)
        # extract item from response
        item = self.item_extractor.extract(response)
        item.origin_url = str(response.url)  # or item['origin_url'] = str(response.url)
        # wait "collect" coroutine, it will let item pass through "item_pipelines"
        await self.collect(item)

    # app entrance
    async def run(self):
        response = await self.request('https://fake_bookstore.com')
        # extract all book links by xpath ("html_element" is a HtmlElement object from lxml)
        urls = response.html_element.xpath('//a[@class="single_book"]/@href')
        # run "crawl_book" coroutines in concurrent
        for url in urls:
            # "self.ensure_future" is a method like "ensure_future" in "asyncio", but it provide something else
            self.ensure_future(self.crawl_book(url))

Create a settings.py:

import logging


logging.basicConfig(level=logging.DEBUG)
ANT_PACKAGES = ['book']

Then in a console:

$ant_nest -a book.BookAnt

Defect

  • Complex exception handle

one coroutine`s exception will break await chain especially in a loop unless we handle it by hand. eg:

for cor in self.as_completed((self.crawl(url) for url in self.urls)):
    try:
        await cor
    except Exception:  # may raise many exception in a await chain
        pass
  • High memory usage

It`s a “feature” that asyncio eat large memory especially with high concurrent IO, one simple solution is set a concurrent limit, but it`s complex to get the balance between performance and limit.

Todo

  • Memory leaks
  • Regular expressions extractor support
  • Jpath(json-path) extractor support
  • Redis pipeline

Project details


Release history Release notifications

History Node

0.36.2

History Node

0.36.1

History Node

0.36

History Node

0.35

History Node

0.34.2

History Node

0.34.1

History Node

0.34.0

History Node

0.33.0

History Node

0.32.0

History Node

0.31.0

History Node

0.30.5

History Node

0.30.4

History Node

0.30.3

History Node

0.30.2

History Node

0.30.1

History Node

0.30.0

History Node

0.29.0

History Node

0.28.0

History Node

0.27.4

History Node

0.27.3

History Node

0.27.2

History Node

0.27.1

History Node

0.27.0

History Node

0.26.2

History Node

0.26.1

History Node

0.26

History Node

0.25.2

History Node

0.25.1

History Node

0.25.0

History Node

0.24.0

History Node

0.23.3

History Node

0.23.2

History Node

0.23.1

History Node

0.23.0

History Node

0.22.2

History Node

0.22.1

History Node

0.22.0

History Node

0.21.0

History Node

0.20.1

History Node

0.20

History Node

0.19.2

History Node

0.19.1

History Node

0.19

This version
History Node

0.18

History Node

0.17

History Node

0.16

History Node

0.15

History Node

0.14

History Node

0.13

History Node

0.12

History Node

0.11

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
ant_nest-0.18.tar.gz (11.7 kB) Copy SHA256 hash SHA256 Source None Dec 7, 2017

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page