Skip to main content

Scrapy for Request Queue

Project description

os-rq-scrapy

Build Status codecov PyPI - Python Version PyPI

A framework for Scrapy working with os-rq-pod and os-rq-hub to build "broad crawls" system.

As you know, Scrapy is a very popular python crawler framework. It is suit for "focused crawl", start from several URLs of specific sites, fetch html, extract and save "structured data" also with patternd links to crawl recursively. But for large scale, long time crawling especially "broad crawls", scrapy is incompetent. Basically, you have to decouple the whole crawling system into several sub-systems, high-performance full-featured distributed fetcher, task scheduler, html extractor, link database, data storage, proxy and a lot of auxiliary equipments. It will be more complex when your system is for multi-tenancy.

The os-rq-scrapy and os-rq-pod project are basic components to build "broad crawls" system. The core conceptions are very simple, os-rq-pod is multi-sites request queue have http api to recieve requests. os-rq-scrapy is fetcher, getting requests from os-rq-pod and crawl multi-sites concurrently. os-rq-hub can also be used to connect multi pod and scrapy instances to work simultaneously.

Requirements

  • Python 3.6+ (pypy3.6+)
  • Scrapy 2.0

extra requirements:

  • ujson, for json performance

Install

pip install os-rq-scrapy

Usage

Command line

rq-scrapy command enhance the basic scrapy command. When RQ_API is configured, the crawl subcommand will run on rq mode, get requests from rq.

Unit Tests

tox

License

MIT licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

os-rq-scrapy-0.0.7.tar.gz (13.9 kB view details)

Uploaded Source

File details

Details for the file os-rq-scrapy-0.0.7.tar.gz.

File metadata

  • Download URL: os-rq-scrapy-0.0.7.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.7

File hashes

Hashes for os-rq-scrapy-0.0.7.tar.gz
Algorithm Hash digest
SHA256 3ada9612a96b6e11e1f887fa5e7b189ac31f1e39347cae46c327750fd6812034
MD5 7499d2ce97696fdade6b8e122ddb5c96
BLAKE2b-256 00ca974fdc645ff6521677b6629fcf0726fe5df1e9fa79b1461fb720c455c67f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page