Dynamically generate spider subclasses. Run crawls sequentially with crochet. Do both.
Project description
Scrapy Dynamic Spiders
Scrapy configures its Crawler objects using class variables associated with each Spider class. Most of these can be meaningfully changed using the Spider's constructor, or even with other Spider class methods. However, some class variables, most notably custom_settings, are accessed before Spider instantiation. Crawl rules are similarly accessed before spider instantiation, though a protected method does allow for rule re-compilation with somewhat unpredictable results. In certain cases, this can lead to the creation of large numbers of subclasses made solely to handle slightly different websites or use cases.
Dynamic Spiders offers a simple, extensible API for generating spider subclasses based on existing ones, as well as an API for running crawls in a synchronous environment (using Crochet).
Use Cases
- Mutate your pipeline without putting a million boolean flags in your code
- Rapidly generate large numbers of spider subclasses based on external data
- Run crawls from within a blocking event loop
- Run crawls tailored to user input
The APIs
What follows is a brief overview of the provided APIs.
SpiderClsFactory
class scrapy_dynamic_spiders.factories.SpiderClsFactory(custom_settings=None, settings_ow=False)
This is the basic SpiderClsFactory, which can handle Spider, XmlFeedSpider, CsvFeedSpider, and SitemapSpider subclasses. It allows for dynamic definition of custom_settings, which is the only thing in these generic spider classes which have effects before instantiation.
Attributes
custom_settings
- a custom_settings dictionary.
settings_ow
- determines how custom_settings are handled. If True
, the provided settings overwrite any other
custom settings native to the template class. If False
, the provided settings are merged with a copy of the native
settings, with preference for provided settings.
Methods
construct_spiderself, (spidercls) -> type
- construct_spider returns a subclass of the template spidercls
with
its custom settings
class variable customized based on the Factory's current attributes.
Extension
You should subclass SpiderFactory if you wish to customize class variables besides custom_settings
. This is done in
the construct_spider
method, which you will thus need to overwrite. There is one relevant private method which you
may wish to use.
_construct_custom_settings(self, spidercls) -> dict
- returns a new custom_settings
dictionary based off of
existing custom_settings
and the Factory's current attributes.
CrawlSpiderClsFactory
Inherits from SpiderClsFactory.
class scrapy_dynamic_spiders.factories.CrawlSpiderClsFactory(custom_settings=None, settings_ow=False,
extractor_configs=None, rule_configs=None, rule_ow:=False)
This subclass handles rule construction for CrawlSpiders and CrawlSpider subclasses. It uses two lists of dictionaries,
extractor_configs
and rule_configs
to create rules dynamically. Each extractor_config
dictionary provides
kwargs to a LinkExtractor, whose corresponding Rule is provided keyword arguments by the dictionary at the matching
index position in rule_configs
. If there are fewer extractor_configs
than rule_configs
, the last
extractor_config
is used for all extra rule_configs
.
Attributes
extractor_configs
- a list of dictionaries. Each dictionary should contain kwargs for a LinkExtractor object.
rule_configs
- a list of dictionaries. Each dictionary should contain kwargs for a Rule object.
rule_ow
- if True
, any exiting rules are replaced by the dynamically generated rules. If false
, new rules
are appended to a copy of the existing rule list.
Extension
Extending this subclass should be similar to extending SpiderFactory. There is one private method which you may wish to use.
_construct_rule_list(self, spidercls) -> List[Rule]
- returns a new list of Rule objects based on those in the
template spidercls and the Factory's current attributes.
SpiderWrangler
class scrapy_dynamic_spiders.wranglers.SpiderWrangler(settings, spidercls = None, clsfactory = None, gen_spiders = True)
This class allows you to run scrapy crawls sequentially, within a synchronous context. It does this through a crochet- managed Twisted Reactor. Because of this, it executes crawls using a composited CrawlerRunner object, rather than a CrawlerProcess object. SpiderWranglers offer built-in support for generating spider subclasses using a SpiderClsFactory, though this behavior is optional.
Parameters
settings
- a Scrapy Settings object. Required to initialize the composited CrawlerRunner.
Attributes
gen_spiders
- If True
, crawls made using the start_crawl
method will use a spider subclass generated using
the composited SpiderClsFactory. If False
, crawls will use the Spider class referenced in the Wrangler's spidercls
attribute. Attempting to crawl with gen_spiders=True
without a clsfactory
raises an error.
spidercls
- The Spider class which will be used for crawling or generating subclasses for crawling.
clsfactory
- A SpiderClsFactory or subclass instance.
Methods
start_crawl(self, *args, **kwargs) -> Deferred
- Initiates a crawl which blocks in a synchronous context. Unlike
Scrapy's Core API, this allows you to perform multiple crawls in sequence without having to manually manipulate your
Reactor. *args
and **kwargs
are passed to the Spider's constructor. The Spider class used is either spidercls
or a dynamically-generated subclass of it.
Extension
Extension of SpiderWrangler is generally straightforward. However, modifying the behavior of start_crawl
requires
extra attention. start_crawl
is decorated using crochet's wait_for
decorator. To modify the timeout specified by
this decorator, you will need to override the existing start_crawl
method. To simplify this, the primary functionality
of start_crawl
is encapusalated in the private method _start_crawl
. Changing the behavior of the decorator thus
requires overriding start_crawl
with a new instance of the wait_for
decorator, and calling _start_crawl
in
the body of the function.
_start_crawl(self, *args, **kwargs) -> Deferred
- Wraps the actual functionality of start_crawl
. Return the value
return by this function, or a mutation of it, when overriding start_crawl
.
Tests
Run tests like so.
python -m unittests tests
Known Issues
- Running tests results in unclosed sockets. Currently investigating.
Changelog
v1.0.0a1
- Initial alpha release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy_dynamic_spiders-1.0.0a1.tar.gz
.
File metadata
- Download URL: scrapy_dynamic_spiders-1.0.0a1.tar.gz
- Upload date:
- Size: 10.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c94d5b3f58e2660eec0ed6b9c735a986ca3512b9b226a5c045c873930524fa48 |
|
MD5 | 3dbb2dd3375d4aec57d06eba461693f5 |
|
BLAKE2b-256 | bd93b7f48bf8d55f0b34978f3c23f1dc78633e8d836e855df9f792a1c5fbfa26 |
File details
Details for the file scrapy_dynamic_spiders-1.0.0a1-py3-none-any.whl
.
File metadata
- Download URL: scrapy_dynamic_spiders-1.0.0a1-py3-none-any.whl
- Upload date:
- Size: 11.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3cad84e8159a7b8e4bcd60484f1e454f91ab3c47ad4eecf8bbedf14638eb55a3 |
|
MD5 | 41450cfad421ebee6f0c4f1a53a79ecf |
|
BLAKE2b-256 | ed4e93cb6356d0ddb42b83b8a3744c20789141de058ec4b5499e06d5f1e491fb |