Skip to main content

Pipeline to Download PDF or Save page as PDF for scrapy item

Project description

Pipeline to Download PDF or Save page as PDF for scrapy item

Installation

Install scrapy-save-as-pdf using pip:

pip install scrapy-save-as-pdf

Configuration

  1. (Optionally) if you want to use WEBDRIVER_HUB_URL, you can use docker to setup one like this:
docker run -d -p 4444:4444 -v /dev/shm:/dev/shm selenium/standalone-chrome:4.0.0-alpha-7-20201119

then WEBDRIVER_HUB_URL value is http://docker_host_ip:4444/wd/hub and we often debug on local host, so we use http://127.0.0.1:4444/wd/hub

  1. Add the settings.py of your Scrapy project like this:
PROXY = ""
CHROME_DRIVER_PATH ='/snap/bin/chromium.chromedriver'
PDF_SAVE_PATH = "./pdfs"
PDF_SAVE_AS_PDF = False
PDF_DOWNLOAD_TIMEOUT = 60
PDF_PRINT_OPTIONS = {
    'landscape': False,
    'displayHeaderFooter': False,
    'printBackground': True,
    'preferCSSPageSize': True,
}
WEBDRIVER_HUB_URL = 'http://127.0.0.1:4444/wd/hub'

If both WEBDRIVER_HUB_URL and CHROME_DRIVER_PATH are set, we use WEBDRIVER_HUB_URL.

  1. Enable the pipeline by adding it to ITEM_PIPELINES in your settings.py file and changing priority:
ITEM_PIPELINES = {
    'scrapy_save_as_pdf.pipelines.SaveAsPdfPipeline': -1,
}

The order should before your persist pipeline such as save to database and after your preprocess pipeline.

In the demo scrapy project, I put the SaveToQiniuPipeline after this plugin to persist pdf to the cloud.

Usage

set the pdf_url and/or url field in your yielded item

import scrapy

class MySpider(scrapy.Spider):
    start_urls = [
        "http://example.com",
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse)

    def parse(self, response):
        yield {
            "url": "http://example.com/cate1/page1.html",
            "pdf_url": "http://example.com/cate1/page1.pdf",
        }
        yield {
            "url": "http://example.com/cate1/page2.html",
            "pdf_url": "http://example.com/cate1/page2.pdf",
        }

the pdf_url field will be populated with the downloaded pdf file location, if pdf_url field has old value then move it to origin_pdf_url field, you can handle them in your next pipeline.

Getting help

Please use github issue

Contributing

PRs are always welcomed.

Changes

0.1.0 (2020-12-25)

Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-save-as-pdf-0.2.1.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

scrapy_save_as_pdf-0.2.1-py2.py3-none-any.whl (4.4 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scrapy-save-as-pdf-0.2.1.tar.gz.

File metadata

  • Download URL: scrapy-save-as-pdf-0.2.1.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5

File hashes

Hashes for scrapy-save-as-pdf-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f82b5749638847854744334d2e2458f8906a0f55411d03295208dfbaa0e45b82
MD5 2d5d4d8d232674210f710f84e1ad5f81
BLAKE2b-256 2efd50076cee9f90dadad9a1394039ddcfb01f33fdcd375c88ba06922439ba80

See more details on using hashes here.

File details

Details for the file scrapy_save_as_pdf-0.2.1-py2.py3-none-any.whl.

File metadata

  • Download URL: scrapy_save_as_pdf-0.2.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 4.4 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5

File hashes

Hashes for scrapy_save_as_pdf-0.2.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 520bf291d0fb2d7f5a874ec78c91ad40b766ca88b14c57d503eb290744cc7655
MD5 f4bf9a22ac292e1503eb46bb701f48ff
BLAKE2b-256 46f2f22008a74342410ffb71075ae7260020a84b27626f659f9e29b7c77e2d14

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page