Skip to main content

Multiprocess Crawler

Project description

Overview

GitHub

Main aim of this project is to provide simple tool for multiprocess crawling.
It can be used either as a tool for python project or as a command line tool.

Concept

This package was written to serve general purpose of crawling various resources.
To achieve that, Crawler interface needs to be adopted.
Just by implementing this interface, other Crawlers can be created and used with crawlMp manager.
Resource Crawler is then used with CrawlWorker.
Every crawler first enters entry point (link) and extract hits and links.
If specified, pipeline of actions is executed on every hit. By default, hits are collected in SIMPLE_MODE only and that's the fastest method to crawl as well.
If other metadata related to hit is required, use MODE_EXTENDED.
Other workers can pick up and follow link(s) from shared list asynchronously.

What is in the package

  • Crawler interface
  • Action interface
  • File system Crawler, Actions with search capabilities
  • scripts, providing easy access from command line

Installation

Pip

python3 -m pip install crawlMp

Git

git clone https://github.com/domarm-comat/crawlMp.git
cd crawlMp
python3 setup.py install

Usage examples

Scripts

  • Show help:
    search_fs_mp --help
  • Search for .zip files
    search_fs_mp \\.zip$
  • Get all .zip files in different directories
    search_fs_mp \\.zip$ -l /home /usr/share
  • Show search summary
    search_fs_mp \\.zip$ -l /home /usr/share -os

Python code (blocking)

from crawlMp.crawlMp import CrawlMp
from crawlMp.crawlers.crawler_fs import CrawlerSearchFs
from crawlMp.snippets.output import print_summary

manager = CrawlMp(CrawlerSearchFs, links=["/home"], num_proc=8, pattern="\.zip$")
manager.start()
print_summary(manager.results)

Python code (callback)

from crawlMp.crawlMp import CrawlMp
from crawlMp.crawlers.crawler_fs import CrawlerSearchFs
from crawlMp.snippets.output import print_summary


def on_done(manager):
  print_summary(manager.results)


manager = CrawlMp(CrawlerSearchFs, links=["/home"], num_proc=8, pattern="\.zip$")
manager.start(on_done)

Python code (actions)

from crawlMp.crawlMp import CrawlMp
from crawlMp.actions.action_fs import Copy, Remove
from crawlMp.crawlers.crawler_fs import CrawlerSearchFs
from crawlMp.snippets.output import print_summary


def on_done(manager):
  print_summary(manager.results)


# Copy all found zip files and then remove them
# It's just to demonstrate usage of actions
actions = (Copy(target_dir="/home/domarm/zip_files"), Remove())
manager = CrawlMp(CrawlerSearchFs, links=["/home"], num_proc=8, pattern="\.zip$", actions=actions)
manager.start(on_done)

Code coverage

Run pytests and code coverage by executing following commands

coverage run -m pytest --rootdir ./crawlMp/tests/
coverage report > coverage.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawlMp-0.3.7.tar.gz (45.0 kB view details)

Uploaded Source

Built Distribution

crawlMp-0.3.7-py3-none-any.whl (50.2 kB view details)

Uploaded Python 3

File details

Details for the file crawlMp-0.3.7.tar.gz.

File metadata

  • Download URL: crawlMp-0.3.7.tar.gz
  • Upload date:
  • Size: 45.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for crawlMp-0.3.7.tar.gz
Algorithm Hash digest
SHA256 7f6d3ee1a7bed35a9664432e65d398b67b4fda25750662378d1bb204358be93c
MD5 f37a26ea48ca2fb1e6a552cfef7c57dc
BLAKE2b-256 c1e5789f9c4e6d0c54fa5e03c644c268a98964f64b27a6d12525e12f85f6f9f7

See more details on using hashes here.

File details

Details for the file crawlMp-0.3.7-py3-none-any.whl.

File metadata

  • Download URL: crawlMp-0.3.7-py3-none-any.whl
  • Upload date:
  • Size: 50.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for crawlMp-0.3.7-py3-none-any.whl
Algorithm Hash digest
SHA256 4c3833e8dcdc1de6f73b0f030f7041bd8c306118d41f125aed785a593b631938
MD5 d6c3db0964a8382dcc0fcbf0397c943d
BLAKE2b-256 9429e6b32c87d2f01cb1b075a819923706b9cedbb214d7f320e7105fca01fbee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page