Skip to main content

A common use targeted concurrent crawler for any directed graph. It's designed to be easy to use.

Project description

Arakneed

A common use targeted concurrent crawler for any directed graph. It's designed to be easy to use.

It's an adequate practice to organize your crawler code instead of a spider library or framework.

Why this since there's scrapy .etc?

Because they are not supposed to crawl my pictures on my laptop but I want to crawl them like a spider.

Arakneed can be used to traverse any directed graph, including a directory on your computer to collect pictures or collecting someone's all Github Gists... And anything looks like directed graph.

Though it can also be used to crawl a website like a tranditional spider.

Usage

Install as dependency:

pip install -U arakneed

Or you can clone the repo and work right in place within file crawler.py:

git clone https://github.com/arakneed/crawler.git

How does it work?

Any vertex spotted by the spider will be scheduled as a task. The only thing you need to do is to define how to handle the tasks.

import asyncio
from pathlib import Path
import re

import aiohttp
from arakneed import Crawler, Task


async def resolver(task: Task, response: aiohttp.ClientResponse):

    if task.type == 'page':
        r = await response.text()

        return [
            Task('image', group[1])
            for group in re.compile(r'<img.+?src=\"(.+?)\".*?>').finditer(r)
            if group[1].endswith('.jpg') or group[1].endswith('.png') or group[1].endswith('.svg')
        ]

    if task.type == 'image':
        image_path = Path('~/Downloads/gh-images', task.key.split('/')[-1]).expanduser()
        if not image_path.parent.is_dir():
            image_path.parent.mkdir()
        image_path.touch()
        image_path.write_bytes(await response.content.read())


asyncio.run(Crawler().run(Task('page', 'https://github.com'), resolver))

This code downloads all images it founds on Github. I believe it explains what does the business code look like.

Examples

  • Crawl a website
  • Collect pictures in a local driver
  • Abstract Syntax Tree analyzer

MISC

  • Be careful with circles in the directed graph if you are customizing the scheduler/spider. The framework always checks whether all corresponding vertices are resolved recursively of every vertex to know when could it have a relax :)
  • This paradigm is not distributed. Though you can take a glance of it through Redis based vertices resolving check, but the task is locked as soon as it's resolved, you cannot resolve a task on several machines simultaneously.

Development

A branch called dev is recommended for common development.

Useful commands:

  • install poetry
    curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
    source $HOME/.poetry/env
  • install dependencies
    poetry install
  • run tests
    poetry run pytest
  • lint
    poetry run flake8 --max-line-length=120 --statistics

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arakneed-0.3.0.tar.gz (8.9 kB view details)

Uploaded Source

Built Distribution

arakneed-0.3.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file arakneed-0.3.0.tar.gz.

File metadata

  • Download URL: arakneed-0.3.0.tar.gz
  • Upload date:
  • Size: 8.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.12

File hashes

Hashes for arakneed-0.3.0.tar.gz
Algorithm Hash digest
SHA256 60ceeb2686b9b88cc1cc043a4412692248b646417f47e93ddf26f215e76885d5
MD5 7c6467fed09dcc59f8a8a4d58575ae6d
BLAKE2b-256 e92ba6ad7fcc37b894f03a0de427aa85552c28dcf3fc1a73ef334f9baa572c9f

See more details on using hashes here.

File details

Details for the file arakneed-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: arakneed-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.12

File hashes

Hashes for arakneed-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ec158fac378753e22a8e651378f258d1293ec79246f8a0ec91e97f653c1688af
MD5 ce5e8f81df03ea924eaa7b9e24b648e3
BLAKE2b-256 6d40a6031816fa0589d9d25e41a28e887baf770b70c238d76c2eb8e94ccbac56

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page