Skip to main content

Run a Scrapy spider programmatically from a script or a Celery task - no project required.

Project description


Scrapyscript

Embed Scrapy jobs directly in your code

What is Scrapyscript?

Scrapyscript is a Python library you can use to run Scrapy spiders directly from your code. Scrapy is a great framework to use for scraping projects, but sometimes you don't need the whole framework, and just want to run a small spider from a script or a Celery job. That's where Scrapyscript comes in.

With Scrapyscript, you can:

  • wrap regular Scrapy Spiders in a Job
  • load the Job(s) in a Processor
  • call processor.run() to execute them

... returning all results when the last job completes.

Let's see an example.

import scrapy
from scrapyscript import Job, Processor

processor = Processor(settings=None)

class PythonSpider(scrapy.spiders.Spider):
    name = "myspider"

    def start_requests(self):
        yield scrapy.Request(self.url)

    def parse(self, response):
        data = response.xpath("//title/text()").extract_first()
        return {'title': data}

job = Job(PythonSpider, url="http://www.python.org")
results = processor.run(job)

print(results)
[{ "title": "Welcome to Python.org" }]

See the examples directory for more, including a complete Celery example.

Install

pip install scrapyscript

Requirements

  • Linux or MacOS
  • Python 3.8+
  • Scrapy 2.5+

API

Job (spider, *args, **kwargs)

A single request to call a spider, optionally passing in *args or **kwargs, which will be passed through to the spider constructor at runtime.

# url will be available as self.url inside MySpider at runtime
myjob = Job(MySpider, url='http://www.github.com')

Processor (settings=None)

Create a multiprocessing reactor for running spiders. Optionally provide a scrapy.settings.Settings object to configure the Scrapy runtime.

settings = scrapy.settings.Settings(values={'LOG_LEVEL': 'WARNING'})
processor = Processor(settings=settings)

Processor.run(jobs)

Start the Scrapy engine, and execute one or more jobs. Blocks and returns consolidated results in a single list. jobs can be a single instance of Job, or a list.

results = processor.run(myjob)

or

results = processor.run([myjob1, myjob2, ...])

A word about Spider outputs

As per the scrapy docs, a Spider must return an iterable of Request and/or dict or Item objects.

Requests will be consumed by Scrapy inside the Job. dict or scrapy.Item objects will be queued and output together when all spiders are finished.

Due to the way billiard handles communication between processes, each dict or Item must be pickle-able using pickle protocol 0. It's generally best to output dict objects from your Spider.

Contributing

Updates, additional features or bug fixes are always welcome.

Setup

  • Install Poetry
  • git clone git@github.com:jschnurr/scrapyscript.git
  • poetry install

Tests

  • make test or make tox

Version History

See CHANGELOG.md

License

The MIT License (MIT). See LICENCE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapyscript-1.1.5.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

scrapyscript-1.1.5-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file scrapyscript-1.1.5.tar.gz.

File metadata

  • Download URL: scrapyscript-1.1.5.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.12 Linux/5.11.0-1021-azure

File hashes

Hashes for scrapyscript-1.1.5.tar.gz
Algorithm Hash digest
SHA256 1fb8f9edc81d83a0115eca4c57aa5f8f007c96a62ce7ceb5c01872e38955747e
MD5 d1ee1eebea67153447df07fdd8abc874
BLAKE2b-256 da3e84e42ebbc54a31849a9a4e6482b44a3d6ac691b8081e3f69d8ac3852642d

See more details on using hashes here.

File details

Details for the file scrapyscript-1.1.5-py3-none-any.whl.

File metadata

  • Download URL: scrapyscript-1.1.5-py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.12 Linux/5.11.0-1021-azure

File hashes

Hashes for scrapyscript-1.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 066762a40a76997afc2fbd1e92f205e4f982080b4392b736453bfa9b645b1f4a
MD5 18f2995700bb81e18b515498625eb668
BLAKE2b-256 6a66759a87d85893904824bbb2066c97624a5dad11427aca4861507bcaa0c6fc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page