A simple and clear Web Crawler framework build on python3.6+
Project description
Overview
AntNest is a simple, clear and fast Web Crawler framework build on python3.6+, powered by asyncio. It has only 600+ lines core code now(thanks powerful lib like aiohttp, lxml and other else).
Features
Useful http client out of box
Easy pipelines, in async or not
Easy item extractor, define data detail(by xpath, jpath or regex) and extract from html, json or strings.
Easy async work flow, build in async task pool
Install
pip install ant_nest
Usage
Create one demo project:
>>> ant_nest -c examples
Then we get a project:
drwxr-xr-x 5 bruce staff 160 Jun 30 18:24 ants -rw-r--r-- 1 bruce staff 208 Jun 26 22:59 settings.py
Presume we want to get hot repos from github, let`s create “examples/ants/example2.py”:
from yarl import URL from ant_nest.ant import Ant from ant_nest.pipelines import ItemFieldReplacePipeline from ant_nest.items import Extractor class GithubAnt(Ant): """Crawl trending repositories from github""" item_pipelines = [ ItemFieldReplacePipeline( ("meta_content", "star", "fork"), excess_chars=("\r", "\n", "\t", " ") ) ] def __init__(self): super().__init__() self.item_extractor = Extractor(dict) self.item_extractor.add_extractor( "title", lambda x: x.html_element.xpath("//h1/strong/a/text()")[0] ) self.item_extractor.add_extractor( "author", lambda x: x.html_element.xpath("//h1/span/a/text()")[0] ) self.item_extractor.add_extractor( "meta_content", lambda x: "".join( x.html_element.xpath( '//div[@class="repository-content "]/div[2]//text()' ) ), ) self.item_extractor.add_extractor( "star", lambda x: x.html_element.xpath( '//a[@class="social-count js-social-count"]/text()' )[0], ) self.item_extractor.add_extractor( "fork", lambda x: x.html_element.xpath('//a[@class="social-count"]/text()')[0], ) self.item_extractor.add_extractor("origin_url", lambda x: str(x.url)) async def crawl_repo(self, url): """Crawl information from one repo""" response = await self.request(url) # extract item from response item = self.item_extractor.extract(response) item["origin_url"] = response.url await self.collect(item) # let item go through pipelines(be cleaned) self.logger.info("*" * 70 + "I got one hot repo!\n" + str(item)) async def run(self): """App entrance, our play ground""" response = await self.request("https://github.com/explore") for url in response.html_element.xpath( "/html/body/div[4]/main/div[2]/div/div[2]/div[1]/article/div/div[1]/h1/a[2]/" "@href" ): # crawl many repos with our coroutines pool self.schedule_task(self.crawl_repo(response.url.join(URL(url)))) self.logger.info("Waiting...")
Then we can list all ants we defined (in “examples”)
>>> $ant_nest -l ants.example2.GithubAnt
Run it! (without debug log):
>>> ant_nest -a ants.example2.GithubAnt INFO:GithubAnt:Opening INFO:GithubAnt:Waiting... INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'app-ideas', 'author': 'florinpop17', 'meta_content': 'A Collection of application ideas which can be used to improve your coding skills.', 'star': '11.7k', 'fork': '500', 'origin_url': URL('https://github.com/florinpop17/app-ideas')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'Carbon', 'author': 'briannesbitt', 'meta_content': 'A simple PHP API extension for DateTime.https://carbon.nesbot.com/', 'star': '14k', 'fork': '249', 'origin_url': URL('https://github.com/briannesbitt/Carbon')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'org-roam', 'author': 'jethrokuan', 'meta_content': 'Rudimentary Roam replica with Org-modehttps://org-roam.readthedocs.io/en/la…', 'star': '261', 'fork': '27', 'origin_url': URL('https://github.com/jethrokuan/org-roam')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'joplin', 'author': 'laurent22', 'meta_content': 'Joplin - an open source note taking and to-do application with synchronization capabilities for Windows, macOS, Linux, Android and iOS. Forum: https://discourse.joplinapp.org/https://joplinapp.org', 'star': '13k', 'fork': '335', 'origin_url': URL('https://github.com/laurent22/joplin')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'snoop', 'author': 'snooppr', 'meta_content': 'Snoop — инструмент разведки на основе открытых данных', 'star': '281', 'fork': '9', 'origin_url': URL('https://github.com/snooppr/snoop')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': '1on1-questions', 'author': 'VGraupera', 'meta_content': 'Mega list of 1 on 1 meeting questions compiled from a variety to sources', 'star': '4k', 'fork': '93', 'origin_url': URL('https://github.com/VGraupera/1on1-questions')} INFO:GithubAnt:Get 8 Request in total with 8/60s rate INFO:GithubAnt:Get 7 Response in total with 7/60s rate INFO:GithubAnt:Get 6 dict in total with 6/60s rate INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'python-small-examples', 'author': 'jackzhenguo', 'meta_content': 'Python有趣的小例子一网打尽。Python基础、Python坑点、Python字符串和正则、Python绘图、Python日期和文件、Web开发、数据科学、机器学习、深度2.4k', 'fork': '102', 'origin_url': URL('https://github.com/jackzhenguo/python-small-examples')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'system-design-primer', 'author': 'donnemartin', 'meta_content': 'Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.', 'star': '83.2k', 'fork': '4.4k', 'origin_url': URL('https://github.com/donnemartin/system-design-primer')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'awesome-scalability', 'author': 'binhnguyennus', 'meta_content': 'The Patterns of Scalable, Reliable, and Performant Large-Scale Systemshttp://awesome-scalability.com/', 'star': '24.5k', 'fork': '1.4k', 'origin_url': URL('https://github.com/binhnguyennus/awesome-scalability')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'gdb-frontend', 'author': 'rohanrhu', 'meta_content': '☕ GDBFrontend is an easy, flexible and extensionable gui debugger.https://oguzhaneroglu.com/projects/gd…', 'star': '716', 'fork': '14', 'origin_url': URL('https://github.com/rohanrhu/gdb-frontend')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'Complete-Python-3-Bootcamp', 'author': 'Pierian-Data', 'meta_content': 'Course Files for Complete Python 3 Bootcamp Course on Udemy', 'star': '8.1k', 'fork': '1.8k', 'origin_url': URL('https://github.com/Pierian-Data/Complete-Python-3-Bootcamp')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'leon', 'author': 'leon-ai', 'meta_content': '\U0001f9e0 Leon is your open-source personal assistant.https://getleon.ai', 'star': '6.3k', 'fork': '147', 'origin_url': URL('https://github.com/leon-ai/leon')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'esbuild', 'author': 'evanw', 'meta_content': 'An extremely fast JavaScript bundler and minifier', 'star': '2.3k', 'fork': '38', 'origin_url': URL('https://github.com/evanw/esbuild')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'wearable-microphone-jamming', 'author': 'y-x-c', 'meta_content': 'Repository for our paper Wearable Microphone Jamminghttp://sandlab.cs.uchicago.edu/jammer/', 'star': '138', 'fork': '10', 'origin_url': URL('https://github.com/y-x-c/wearable-microphone-jamming')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'efcore', 'author': 'dotnet', 'meta_content': 'EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.https://docs.microsoft.com/ef/core/', 'star': '8.7k', 'fork': '965', 'origin_url': URL('https://github.com/dotnet/efcore')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'playwright', 'author': 'microsoft', 'meta_content': 'Node library to automate Chromium, Firefox and WebKit with a single APIhttps://www.npmjs.com/package/playwright', 'star': '9k', 'fork': '92', 'origin_url': URL('https://github.com/microsoft/playwright')} INFO:GithubAnt:Get 18 Request in total with 10/60s rate INFO:GithubAnt:Get 17 Response in total with 10/60s rate INFO:GithubAnt:Get 16 dict in total with 10/60s rate INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'degoogle', 'author': 'tycrek', 'meta_content': 'A huge list of alternatives to Google products. Privacy tips, tricks, and links.https://degoogle.jmoore.dev', 'star': '2k', 'fork': '50', 'origin_url': URL('https://github.com/tycrek/degoogle')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'sherlock', 'author': 'sherlock-project', 'meta_content': '🔎 Hunt down social media accounts by username across social networkshttp://sherlock-project.github.io', 'star': '10.4k', 'fork': '207', 'origin_url': URL('https://github.com/sherlock-project/sherlock')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'the-art-of-command-line', 'author': 'jlevy', 'meta_content': 'Master the command line, in one page', 'star': '68.9k', 'fork': '2.2k', 'origin_url': URL('https://github.com/jlevy/the-art-of-command-line')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'freespeech', 'author': 'Merkie', 'meta_content': 'A free program designed to help the non-verbal.', 'star': '168', 'fork': '20', 'origin_url': URL('https://github.com/Merkie/freespeech')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'awesome-pentest', 'author': 'enaqx', 'meta_content': 'A collection of awesome penetration testing resources, tools and other shiny things', 'star': '11.4k', 'fork': '1k', 'origin_url': URL('https://github.com/enaqx/awesome-pentest')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'trax', 'author': 'google', 'meta_content': 'Trax — your path to advanced deep learning', 'star': '2.7k', 'fork': '90', 'origin_url': URL('https://github.com/google/trax')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'introtodeeplearning', 'author': 'aamini', 'meta_content': 'Lab Materials for MIT 6.S191: Introduction to Deep Learning', 'star': '1.6k', 'fork': '116', 'origin_url': URL('https://github.com/aamini/introtodeeplearning')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': 'CleanArchitecture', 'author': 'ardalis', 'meta_content': 'A starting point for Clean Architecture with ASP.NET Core', 'star': '3.8k', 'fork': '300', 'origin_url': URL('https://github.com/ardalis/CleanArchitecture')} INFO:GithubAnt:**********************************************************************I got one hot repo! {'title': '3y', 'author': 'ZhongFuCheng3y', 'meta_content': '📓从Java基础、JavaWeb基础到常用的框架再到面试题都有完整的教程,几乎涵盖了Java后端必备的知识点', 'star': '5.1k', 'fork': '285', 'origin_url': URL('https://github.com/ZhongFuCheng3y/3y')} INFO:GithubAnt:Closed INFO:GithubAnt:Get 26 Request in total INFO:GithubAnt:Get 26 Response in total INFO:GithubAnt:Get 25 dict in total INFO:GithubAnt:Run GithubAnt in 180.234251 seconds
About Item
We use dict to store one item in examples, actually it support many way: dict, normal class, atrrs’s class, data class and ORM class, it depend on your need and choice.
Examples
You can get some example in “./examples”
Defect
Complex exception handle
one coroutine’s exception will break await chain especially in a loop, unless we handle it by hand. eg:
for cor in self.as_completed((self.crawl(url) for url in self.urls)): try: await cor except Exception: # may raise many exception in a await chain pass
but we can use “self.as_completed_with_async” now, eg:
async fo result in self.as_completed_with_async( self.crawl(url) for url in self.urls, raise_exception=False): # exception in "self.crawl(url)" will be passed and logged automatic self.handle(result)
High memory usage
It`s a “feature” that asyncio eat large memory especially with high concurrent IO, we can set a concurrent limit(“connection_limit” or “concurrent_limit”) simply, but it`s complex to get the balance between performance and limit.
Coding style
Follow “Flake8”, Format by “Black”, typing check by “MyPy”, sea Makefile for more detail.
Todo
[*] Log system [*] Nest item extractor [ ] Docs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ant_nest-1.0.1.tar.gz
.
File metadata
- Download URL: ant_nest-1.0.1.tar.gz
- Upload date:
- Size: 22.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.8.2 Darwin/20.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f1dfcf2168a5dc426e84cf5cebb63e24cf4e6125045924e2f7de82af19c5730 |
|
MD5 | a073799a1902133022fbcecae318ff2a |
|
BLAKE2b-256 | 3f6ed717d8ddd1d611d8667647fcd90a8dc17712c3302790ae2e9f9b709d42fe |
File details
Details for the file ant_nest-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: ant_nest-1.0.1-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.8.2 Darwin/20.2.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 23828ffdf6410bc383022e0a35af0d0366d11bff9ff8d56c123f42bd09a0fbd8 |
|
MD5 | a0339532c6123e3ddd65d0e36b1a69c0 |
|
BLAKE2b-256 | 3de35b3ebf5d319acbb9810c47bc2b45dc66940d3d57979bb7101044b91330f6 |