Skip to main content

Selenium Components for Scrapy & Gerapy

Project description

Gerapy Selenium

This is a package for supporting selenium in Scrapy, also this package is a module in Gerapy.

Installation

pip3 install gerapy-selenium

Usage

You can use SeleniumRequest to specify a request which uses selenium to render.

For example:

yield SeleniumRequest(detail_url, callback=self.parse_detail)

And you also need to enable SeleniumMiddleware in DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    'gerapy_selenium.downloadermiddlewares.SeleniumMiddleware': 543,
}

Congratulate, you've finished the all of the required configuration.

If you run the Spider again, Selenium will be started to render every web page which you configured the request as SeleniumRequest.

Settings

GerapySelenium provides some optional settings.

Concurrency

You can directly use Scrapy's setting to set Concurrency of Selenium, for example:

CONCURRENT_REQUESTS = 3

Pretend as Real Browser

Some website will detect WebDriver or Headless, GerapySelenium can pretend Chromium by inject scripts. This is enabled by default.

You can close it if website does not detect WebDriver to speed up:

GERAPY_SELENIUM_PRETEND = False

Also you can use pretend attribute in SeleniumRequest to overwrite this configuration.

Logging Level

By default, Selenium will log all the debug messages, so GerapySelenium configured the logging level of Selenium to WARNING.

If you want to see more logs from Selenium, you can change the this setting:

import logging
GERAPY_SELENIUM_LOGGING_LEVEL = logging.DEBUG

Download Timeout

Selenium may take some time to render the required web page, you can also change this setting, default is 30s:

# selenium timeout
GERAPY_SELENIUM_DOWNLOAD_TIMEOUT = 30

Headless

By default, Selenium is running in Headless mode, you can also change it to False as you need, default is True:

GERAPY_SELENIUM_HEADLESS = False 

Window Size

You can also set the width and height of Selenium window:

GERAPY_SELENIUM_WINDOW_WIDTH = 1400
GERAPY_SELENIUM_WINDOW_HEIGHT = 700

Default is 1400, 700.

SeleniumRequest

SeleniumRequest provide args which can override global settings above.

  • url: request url
  • callback: callback
  • wait_for: wait for some element to load, also supports dict
  • script: script to execute
  • proxy: use proxy for this time, like http://x.x.x.x:x
  • sleep: time to sleep after loaded, override GERAPY_SELENIUM_SLEEP
  • timeout: load timeout, override GERAPY_SELENIUM_DOWNLOAD_TIMEOUT
  • pretend: pretend as normal browser, override GERAPY_SELENIUM_PRETEND
  • screenshot: ignored resource types, see https://miyakogi.github.io/selenium/_modules/selenium/page.html#Page.screenshot, override GERAPY_SELENIUM_SCREENSHOT

For example, you can configure SeleniumRequest as:

from gerapy_selenium import SeleniumRequest

def parse(self, response):
    yield SeleniumRequest(url, 
        callback=self.parse_detail,
        wait_for='title',
        script='() => { console.log(document) }',
        sleep=2)

Then Selenium will:

  • wait for title to load
  • execute console.log(document) script
  • sleep for 2s
  • return the rendered web page content

Example

For more detail, please see example.

Also you can directly run with Docker:

docker run germey/gerapy-selenium-example

Outputs:

2020-07-13 01:49:13 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)
2020-07-13 01:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May  6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit
2020-07-13 01:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2020-07-13 01:49:13 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'example',
 'CONCURRENT_REQUESTS': 3,
 'NEWSPIDER_MODULE': 'example.spiders',
 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504],
 'SPIDER_MODULES': ['example.spiders']}
2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 83c276fb41754bd0
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'gerapy_selenium.downloadermiddlewares.SeleniumMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-13 01:49:13 [scrapy.core.engine] INFO: Spider opened
2020-07-13 01:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-13 01:49:13 [example.spiders.book] INFO: crawling https://dynamic5.scrape.center/page/1
2020-07-13 01:49:13 [gerapy.selenium] DEBUG: processing request <GET https://dynamic5.scrape.center/page/1>
2020-07-13 01:49:13 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:14 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/page/1
2020-07-13 01:49:19 [gerapy.selenium] DEBUG: waiting for .item .name finished
2020-07-13 01:49:20 [gerapy.selenium] DEBUG: wait for .item .name finished
2020-07-13 01:49:20 [gerapy.selenium] DEBUG: close selenium
2020-07-13 01:49:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/page/1> (referer: None)
2020-07-13 01:49:20 [gerapy.selenium] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26898909>
2020-07-13 01:49:20 [gerapy.selenium] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26861389>
2020-07-13 01:49:20 [gerapy.selenium] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26855315>
2020-07-13 01:49:20 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:20 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:21 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:21 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/detail/26855315
2020-07-13 01:49:21 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/detail/26861389
2020-07-13 01:49:21 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/detail/26898909
2020-07-13 01:49:24 [gerapy.selenium] DEBUG: waiting for .item .name finished
2020-07-13 01:49:24 [gerapy.selenium] DEBUG: wait for .item .name finished
2020-07-13 01:49:24 [gerapy.selenium] DEBUG: close selenium
2020-07-13 01:49:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26861389> (referer: https://dynamic5.scrape.center/page/1)
2020-07-13 01:49:24 [gerapy.selenium] DEBUG: processing request <GET https://dynamic5.scrape.center/page/2>
2020-07-13 01:49:24 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26861389>
{'name': '壁穴ヘブンホール',
 'score': '5.6',
 'tags': ['BL漫画', '小基漫', 'BL', '『又腐又基』', 'BLコミック']}
2020-07-13 01:49:25 [gerapy.selenium] DEBUG: waiting for .item .name finished
2020-07-13 01:49:25 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/page/2
2020-07-13 01:49:26 [gerapy.selenium] DEBUG: wait for .item .name finished
2020-07-13 01:49:26 [gerapy.selenium] DEBUG: close selenium
2020-07-13 01:49:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26855315> (referer: https://dynamic5.scrape.center/page/1)
2020-07-13 01:49:26 [gerapy.selenium] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/27047626>
2020-07-13 01:49:26 [gerapy.selenium] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26855315>
{'name': '冒险小虎队', 'score': '9.4', 'tags': ['冒险小虎队', '童年', '冒险', '推理', '小时候读的']}
2020-07-13 01:49:26 [gerapy.selenium] DEBUG: waiting for .item .name finished
2020-07-13 01:49:26 [gerapy.selenium] DEBUG: crawling https://dynamic5.scrape.center/detail/27047626
2020-07-13 01:49:27 [gerapy.selenium] DEBUG: wait for .item .name finished
2020-07-13 01:49:27 [gerapy.selenium] DEBUG: close selenium
...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gerapy-selenium-0.0.3.tar.gz (12.9 kB view hashes)

Uploaded Source

Built Distribution

gerapy_selenium-0.0.3-py2.py3-none-any.whl (10.5 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page