Skip to main content

Download PDF or Save page as PDF

Project description

Download PDF function for scrapy

Installation

Install scrapy-save-as-pdf using pip::

$ pip install scrapy-save-as-pdf

Configuration

  1. Add the settings.py of your Scrapy project like this:
PROXY = ""
CHROME_DRIVER_PATH ='/snap/bin/chromium.chromedriver'
PDF_SAVE_PATH="./pdfs"
PDF_SAVE_AS_PDF = False
PDF_DOWNLOAD_TIMEOUT = 60
PDF_PRINT_OPTIONS = {
    'landscape': False,
    'displayHeaderFooter': False,
    'printBackground': True,
    'preferCSSPageSize': True,
}
  1. Enable the pipeline by adding it to ITEM_PIPELINES in your settings.py file and changing HttpCompressionMiddleware priority:
ITEM_PIPELINES = {
    'scrapy_save_as_pdf.pipelines.SaveAsPdfPipeline': -1,
}

The order should before your persist pipeline such as save to database and after your preprocess pipeline.

Usage

set the pdf_url and/or url field in your yielded item

import scrapy

class MySpider(scrapy.Spider):
    start_urls = [
        "http://example.com",
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse)

    def parse(self, response):
        yield {
            "url": "http://example.com/cate1/page1.html",
            "pdf_url": "http://example.com/cate1/page1.pdf",
        }
        yield {
            "url": "http://example.com/cate1/page2.html",
            "pdf_url": "http://example.com/cate1/page2.pdf",
        }

the pdf_url field will be populated with the downloaded pdf file location, if pdf_url field has old value then move it to origin_pdf_url field, you can handle them in your next pipeline.

Getting help

Please use github issue

Contributing

PRs are always welcomed.

Changes

0.1.0 (2020-12-25)

Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-save-as-pdf-0.1.3.tar.gz (3.6 kB view hashes)

Uploaded Source

Built Distribution

scrapy_save_as_pdf-0.1.3-py2.py3-none-any.whl (3.8 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page