Skip to main content

Pyppeteer Components for Scrapy & Gerapy

Project description

Gerapy Pyppeteer

This is a package for supporting pyppeteer in Scrapy, also this package is a module in Gerapy.

Installation

pip3 install gerapy-pyppeteer

Usage

You can use PyppeteerRequest to specify a request which uses pyppeteer to render.

For example:

yield PyppeteerRequest(detail_url, callback=self.parse_detail)

And you also need to enable PyppeteerMiddleware in DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    'gerapy_pyppeteer.downloadermiddlewares.py.PyppeteerMiddleware': 543,
}

Others optional settings:

# pyppeteer logging level
GERAPY_PYPPETEER_LOGGING_LEVEL = logging.WARNING

# pyppeteer timeout
GERAPY_PYPPETEER_DOWNLOAD_TIMEOUT = 30

# pyppeteer browser window
GERAPY_PYPPETEER_WINDOW_WIDTH = 1400
GERAPY_PYPPETEER_WINDOW_HEIGHT = 700

# pyppeteer settings
GERAPY_PYPPETEER_HEADLESS = True
GERAPY_PYPPETEER_DUMPIO = False
GERAPY_PYPPETEER_DEVTOOLS = False
GERAPY_PYPPETEER_EXECUTABLE_PATH = None
GERAPY_PYPPETEER_DISABLE_EXTENSIONS = True
GERAPY_PYPPETEER_HIDE_SCROLLBARS = True
GERAPY_PYPPETEER_MUTE_AUDIO = True
GERAPY_PYPPETEER_NO_SANDBOX = True
GERAPY_PYPPETEER_DISABLE_SETUID_SANDBOX = True
GERAPY_PYPPETEER_DISABLE_GPU = True

Example

For more detail, please see example.

Also you can directly run with Docker:

docker run germey/gerapy-pyppeteer-example

Outputs:

2020-07-13 01:49:13 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)
2020-07-13 01:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May  6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit
2020-07-13 01:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2020-07-13 01:49:13 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'example',
 'CONCURRENT_REQUESTS': 3,
 'NEWSPIDER_MODULE': 'example.spiders',
 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504],
 'SPIDER_MODULES': ['example.spiders']}
2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 83c276fb41754bd0
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'gerapy_pyppeteer.downloadermiddlewares.PyppeteerMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-13 01:49:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-13 01:49:13 [scrapy.core.engine] INFO: Spider opened
2020-07-13 01:49:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-13 01:49:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-13 01:49:13 [example.spiders.book] INFO: crawling https://dynamic5.scrape.center/page/1
2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/page/1>
2020-07-13 01:49:13 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:14 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/1
2020-07-13 01:49:19 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: wait for .item .name finished
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: close pyppeteer
2020-07-13 01:49:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/page/1> (referer: None)
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26898909>
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26861389>
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/26855315>
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:20 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26855315
2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26861389
2020-07-13 01:49:21 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/26898909
2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished
2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: wait for .item .name finished
2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: close pyppeteer
2020-07-13 01:49:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26861389> (referer: https://dynamic5.scrape.center/page/1)
2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/page/2>
2020-07-13 01:49:24 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26861389>
{'name': '壁穴ヘブンホール',
 'score': '5.6',
 'tags': ['BL漫画', '小基漫', 'BL', '『又腐又基』', 'BLコミック']}
2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished
2020-07-13 01:49:25 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/page/2
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: wait for .item .name finished
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: close pyppeteer
2020-07-13 01:49:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dynamic5.scrape.center/detail/26855315> (referer: https://dynamic5.scrape.center/page/1)
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: processing request <GET https://dynamic5.scrape.center/detail/27047626>
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: set options {'headless': True, 'dumpio': False, 'devtools': False, 'args': ['--window-size=1400,700', '--disable-extensions', '--hide-scrollbars', '--mute-audio', '--no-sandbox', '--disable-setuid-sandbox', '--disable-gpu']}
2020-07-13 01:49:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://dynamic5.scrape.center/detail/26855315>
{'name': '冒险小虎队', 'score': '9.4', 'tags': ['冒险小虎队', '童年', '冒险', '推理', '小时候读的']}
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: waiting for .item .name finished
2020-07-13 01:49:26 [gerapy.pyppeteer] DEBUG: crawling https://dynamic5.scrape.center/detail/27047626
2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: wait for .item .name finished
2020-07-13 01:49:27 [gerapy.pyppeteer] DEBUG: close pyppeteer
...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gerapy-pyppeteer-0.0.4.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

gerapy_pyppeteer-0.0.4-py2.py3-none-any.whl (8.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file gerapy-pyppeteer-0.0.4.tar.gz.

File metadata

  • Download URL: gerapy-pyppeteer-0.0.4.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.3

File hashes

Hashes for gerapy-pyppeteer-0.0.4.tar.gz
Algorithm Hash digest
SHA256 a63d9db256ffd7daa1c7da4ed288b25d3607b784e9891fa188a9df7222471e8f
MD5 cbc409985564fe51ee1b93a0fd07bb23
BLAKE2b-256 80d8c9eedb1ff7dacc5fcc36e3d2c1983fdedea8f7eca1c1b081597a6704e8e3

See more details on using hashes here.

Provenance

File details

Details for the file gerapy_pyppeteer-0.0.4-py2.py3-none-any.whl.

File metadata

  • Download URL: gerapy_pyppeteer-0.0.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.3

File hashes

Hashes for gerapy_pyppeteer-0.0.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d49797e15bd3c59030a8ddcd3e44e4d0ca870bb3f6bf60d2fcfb9e371a65370d
MD5 7599ec1dd88293a7050a82d1a7ee1d71
BLAKE2b-256 df1751e96633abd45acd551b707a7c64b22d9921e98572b2c5782a8214fa0c9e

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page