Pipeline to Download PDF or Save page as PDF for scrapy item
Project description
Pipeline to Download PDF or Save page as PDF for scrapy item
Installation
Install scrapy-save-as-pdf
using pip
:
pip install scrapy-save-as-pdf
Configuration
- (Optionally) if you want to use
WEBDRIVER_HUB_URL
, you can usedocker
to setup one like this:
docker run -d -p 4444:4444 -v /dev/shm:/dev/shm selenium/standalone-chrome:4.0.0-alpha-7-20201119
then WEBDRIVER_HUB_URL
value is http://docker_host_ip:4444/wd/hub
and we often debug on local host, so we use http://127.0.0.1:4444/wd/hub
- Add the
settings.py
of your Scrapy project like this:
PROXY = ""
CHROME_DRIVER_PATH ='/snap/bin/chromium.chromedriver'
PDF_SAVE_PATH = "./pdfs"
PDF_SAVE_AS_PDF = False
PDF_DOWNLOAD_TIMEOUT = 60
PDF_PRINT_OPTIONS = {
'landscape': False,
'displayHeaderFooter': False,
'printBackground': True,
'preferCSSPageSize': True,
}
WEBDRIVER_HUB_URL = 'http://127.0.0.1:4444/wd/hub'
If both WEBDRIVER_HUB_URL
and CHROME_DRIVER_PATH
are set, we use WEBDRIVER_HUB_URL
.
- Enable the pipeline by adding it to
ITEM_PIPELINES
in yoursettings.py
file and changing priority:
ITEM_PIPELINES = {
'scrapy_save_as_pdf.pipelines.SaveAsPdfPipeline': -1,
}
The order should before your persist pipeline such as save to database and after your preprocess pipeline.
In the demo scrapy project, I put the SaveToQiniuPipeline
after this plugin to persist pdf to the cloud.
Usage
set the pdf_url
and/or url
field in your yielded item
import scrapy
class MySpider(scrapy.Spider):
start_urls = [
"http://example.com",
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse)
def parse(self, response):
yield {
"url": "http://example.com/cate1/page1.html",
"pdf_url": "http://example.com/cate1/page1.pdf",
}
yield {
"url": "http://example.com/cate1/page2.html",
"pdf_url": "http://example.com/cate1/page2.pdf",
}
the pdf_url
field will be populated with the downloaded pdf file location, if pdf_url
field has old value then move it to origin_pdf_url
field, you can handle them in your next pipeline.
Getting help
Please use github issue
Contributing
PRs are always welcomed.
Changes
0.1.0 (2020-12-25)
Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy_save_as_pdf-0.2.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 520bf291d0fb2d7f5a874ec78c91ad40b766ca88b14c57d503eb290744cc7655 |
|
MD5 | f4bf9a22ac292e1503eb46bb701f48ff |
|
BLAKE2b-256 | 46f2f22008a74342410ffb71075ae7260020a84b27626f659f9e29b7c77e2d14 |