A common use targeted concurrent crawler for any directed graph. It's designed to be easy to use.
Project description
Arakneed
A common use targeted concurrent crawler for any directed graph. It's designed to be easy to use.
It's an adequate practice to organize your crawler code instead of a spider library or framework.
Why this since there's scrapy .etc?
Because they are not supposed to crawl my pictures on my laptop but I want to crawl them like a spider.
Arakneed can be used to traverse any directed graph, including a directory on your computer to collect pictures or collecting someone's all Github Gists... And anything looks like directed graph.
Though it can also be used to crawl a website like a tranditional spider.
Usage
Install as dependency:
pip install -U arakneed
Or you can clone the repo and work right in place within file crawler.py
:
git clone https://github.com/arakneed/crawler.git
How does it work?
Any vertex spotted by the spider will be scheduled as a task. The only thing you need to do is to define how to handle the tasks.
import asyncio
from pathlib import Path
import re
import aiohttp
from arakneed import Crawler, Task
async def resolver(task: Task, response: aiohttp.ClientResponse):
if task.type == 'page':
r = await response.text()
return [
Task('image', group[1])
for group in re.compile(r'<img.+?src=\"(.+?)\".*?>').finditer(r)
if group[1].endswith('.jpg') or group[1].endswith('.png') or group[1].endswith('.svg')
]
if task.type == 'image':
image_path = Path('~/Downloads/gh-images', task.key.split('/')[-1]).expanduser()
if not image_path.parent.is_dir():
image_path.parent.mkdir()
image_path.touch()
image_path.write_bytes(await response.content.read())
asyncio.run(Crawler().run(Task('page', 'https://github.com'), resolver))
This code downloads all images it founds on Github. I believe it explains what does the business code look like.
Examples
- Crawl a website
- Collect pictures in a local driver
- Abstract Syntax Tree analyzer
MISC
- Be careful with circles in the directed graph if you are customizing the scheduler/spider. The framework always checks whether all corresponding vertices are resolved recursively of every vertex to know when could it have a relax :)
- This paradigm is not distributed. Though you can take a glance of it through Redis based vertices resolving check, but the task is locked as soon as it's resolved, you cannot resolve a task on several machines simultaneously.
Development
A branch called dev
is recommended for common development.
Useful commands:
- install poetry
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
source $HOME/.poetry/env
- install dependencies
poetry install
- run tests
poetry run pytest
- lint
poetry run flake8 --max-line-length=120 --statistics
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file arakneed-0.3.0.tar.gz
.
File metadata
- Download URL: arakneed-0.3.0.tar.gz
- Upload date:
- Size: 8.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60ceeb2686b9b88cc1cc043a4412692248b646417f47e93ddf26f215e76885d5 |
|
MD5 | 7c6467fed09dcc59f8a8a4d58575ae6d |
|
BLAKE2b-256 | e92ba6ad7fcc37b894f03a0de427aa85552c28dcf3fc1a73ef334f9baa572c9f |
File details
Details for the file arakneed-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: arakneed-0.3.0-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec158fac378753e22a8e651378f258d1293ec79246f8a0ec91e97f653c1688af |
|
MD5 | ce5e8f81df03ea924eaa7b9e24b648e3 |
|
BLAKE2b-256 | 6d40a6031816fa0589d9d25e41a28e887baf770b70c238d76c2eb8e94ccbac56 |