ant-nest

A simple and clear Web Crawler framework build on python3.6+ with async

These details have not been verified by PyPI

Project links

Homepage

Project description

https://img.shields.io/pypi/v/ant_nest.svg

https://img.shields.io/travis/6ugman/ant_nest/master.svg

https://codecov.io/gh/6ugman/ant_nest/branch/master/graph/badge.svg

Overview

AntNest is a simple, clear and fast Web Crawler framework build on python3.6+, powered by asyncio.

As a Scrapy user, I think scrapy provide many awesome features what I think AntNest should have too.This is some main difference:

Scrapy use callback way to write code while AntNest use coroutines
Scrapy is stable and widely usage while AntNest is in early development
AntNest has only 600+ lines core code now(thanks powerful lib like aiohttp, lxml and other else), and it works

Features

Things(request, response and item) can though pipelines(in async or not)
Item and item extractor, it`s easy to define and extract(by xpath, jpath or regex) a validated(by field type) item
Custom “ensure_future” and “as_completed” method provide concurrent limit and collection of completed coroutines
Default coroutines concurrent limit, reduce memory usage

Install

pip install ant_nest

Usage

Let`s take a look, create book.py first:

from ant_nest import *

# define a item structure we want to crawl
class BookItem(Item):
    name = StringField()
    author = StringField(default='Li')
    content = StringField()
    origin_url = StringField()
    date = IntField(null=True)  # The filed is optional


# define our ant
class BookAnt(Ant):
    request_retry_delay = 10
    request_allow_redirects = False
    # the things(request, response, item) will pass through pipelines in order, pipelines can change or drop them
    item_pipelines = [ItemValidatePipeline(),
                      ItemMysqlInsertPipeline(settings.MYSQL_HOST, settings.MYSQL_PORT, settings.MYSQL_USER,
                                              settings.MYSQL_PASSWORD, settings.MYSQL_DATABASE, 'book'),
                      ReportPipeline()]
    request_pipelines = [RequestDuplicateFilterPipeline(), RequestUserAgentPipeline(), ReportPipeline()]
    response_pipelines = [ResponseFilterErrorPipeline(), ReportPipeline()]


    # define ItemExtractor to extract item field by xpath from response(html source code)
    self.item_extractor = ItemExtractor(BookItem)
    self.item_extractor.add_regex('name', 'name=(\w+);')
    self.item_extractor.add_xpath('author', '/html/body/div[1]/div[@class="author"]/text()')
    self.item_extractor.add_xpath('content', '/html/body/div[2]/div[2]/div[2]//text()',
                                  ItemExtractor.join_all)

    # crawl book information
    async def crawl_book(self, url):
        # send request and wait for response
        response = await self.request(url)
        # extract item from response
        item = self.item_extractor.extract(response)
        item.origin_url = str(response.url)  # or item['origin_url'] = str(response.url)
        # wait "collect" coroutine, it will let item pass through "item_pipelines"
        await self.collect(item)

    # app entrance
    async def run(self):
        response = await self.request('https://fake_bookstore.com')
        # extract all book links by xpath ("html_element" is a HtmlElement object from lxml lib)
        urls = response.html_element.xpath('//a[@class="single_book"]/@href')
        # run "crawl_book" coroutines in concurrent
        for url in urls:
            # "queen.schedule_coroutine" is a function like "ensure_future" in "asyncio",
            # but it provide something else
            queen.schedule_coroutine(self.crawl_book(url))

Create a settings.py:

import logging


logging.basicConfig(level=logging.DEBUG)
ANT_PACKAGES = ['book']

Then in a console:

$ant_nest -a book.BookAnt

Defect

Complex exception handle

one coroutine`s exception will break await chain especially in a loop unless we handle it by hand. eg:

for cor in self.as_completed((self.crawl(url) for url in self.urls)):
    try:
        await cor
    except Exception:  # may raise many exception in a await chain
        pass

High memory usage

It`s a “feature” that asyncio eat large memory especially with high concurrent IO, one simple solution is set a concurrent limit, but it`s complex to get the balance between performance and limit.

Todo

Memory leaks?
Log system

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.1

Jan 1, 2021

1.0.0

Nov 28, 2020

0.38.1

Apr 22, 2019

0.38.0

Apr 10, 2019

0.37.2

Apr 9, 2019

0.37.1

Sep 16, 2018

0.37.0

Jul 30, 2018

0.36.3

Jul 23, 2018

0.36.2

Jul 14, 2018

0.36.1

Jul 3, 2018

0.36

Jul 3, 2018

0.35

Jun 14, 2018

0.34.2

Apr 12, 2018

0.34.1

Mar 30, 2018

0.34.0

Mar 23, 2018

0.33.0

Mar 16, 2018

0.32.0

Mar 5, 2018

0.31.0

Feb 28, 2018

0.30.5

Feb 27, 2018

0.30.4

Feb 26, 2018

0.30.3

Feb 25, 2018

0.30.2

Feb 24, 2018

0.30.1

Feb 24, 2018

0.30.0

Feb 11, 2018

0.29.0

Feb 7, 2018

0.28.0

Feb 3, 2018

0.27.4

Jan 30, 2018

0.27.3

Jan 25, 2018

0.27.2

Jan 24, 2018

0.27.1

Jan 23, 2018

0.27.0

Jan 22, 2018

0.26.2

Jan 20, 2018

0.26.1

Jan 18, 2018

0.26

Jan 16, 2018

0.25.2

Jan 15, 2018

0.25.1

Jan 13, 2018

0.25.0

Jan 8, 2018

0.24.0

Jan 6, 2018

0.23.3

Jan 4, 2018

This version

0.23.2

Jan 3, 2018

0.23.1

Jan 2, 2018

0.23.0

Jan 1, 2018

0.22.2

Dec 28, 2017

0.22.1

Dec 25, 2017

0.22.0

Dec 24, 2017

0.21.0

Dec 20, 2017

0.20.1

Dec 12, 2017

0.20

Dec 9, 2017

0.19.2

Dec 7, 2017

0.19.1

Dec 7, 2017

0.19

Dec 7, 2017

0.18

Dec 7, 2017

0.17

Dec 5, 2017

0.16

Dec 4, 2017

0.15

Dec 2, 2017

0.14

Dec 1, 2017

0.13

Nov 30, 2017

0.12

Nov 29, 2017

0.11

Nov 27, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ant_nest-0.23.2.tar.gz (14.4 kB view details)

Uploaded Jan 3, 2018 Source

File details

Details for the file ant_nest-0.23.2.tar.gz.

File metadata

Download URL: ant_nest-0.23.2.tar.gz
Upload date: Jan 3, 2018
Size: 14.4 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for ant_nest-0.23.2.tar.gz
Algorithm	Hash digest
SHA256	`c10c7d2cfc290927268b8873a2716f98e4853de1a08432746ab833c415382b2f`
MD5	`249d48e7b6bd20b951e33e28912aed6e`
BLAKE2b-256	`da7f02089852d2cf8eac33d14253c223d87a0d24b4fb434a3a4ccfadba7604ad`

See more details on using hashes here.

ant-nest 0.23.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Features

Install

Usage

Defect

Todo

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes