Pipeline to Download PDF or Save page as PDF for scrapy item
Project description
Pipeline to Download PDF or Save page as PDF for scrapy item
Installation
Install scrapy-save-as-pdf
using pip
:
pip install scrapy-save-as-pdf
Configuration
- (Optionally) if you want to use
WEBDRIVER_HUB_URL
, you can usedocker
to setup one like this:
docker run -d -p 4444:4444 -v /dev/shm:/dev/shm selenium/standalone-chrome:4.0.0-alpha-7-20201119
then WEBDRIVER_HUB_URL
value is http://docker_host_ip:4444/wd/hub
and we often debug on local host, so we use http://127.0.0.1:4444/wd/hub
- Add the
settings.py
of your Scrapy project like this:
PROXY = ""
CHROME_DRIVER_PATH ='/snap/bin/chromium.chromedriver'
PDF_SAVE_PATH = "./pdfs"
PDF_SAVE_AS_PDF = False
PDF_DOWNLOAD_TIMEOUT = 60
PDF_PRINT_OPTIONS = {
'landscape': False,
'displayHeaderFooter': False,
'printBackground': True,
'preferCSSPageSize': True,
}
WEBDRIVER_HUB_URL = 'http://127.0.0.1:4444/wd/hub'
If both WEBDRIVER_HUB_URL
and CHROME_DRIVER_PATH
are set, we use WEBDRIVER_HUB_URL
.
- Enable the pipeline by adding it to
ITEM_PIPELINES
in yoursettings.py
file and changing priority:
ITEM_PIPELINES = {
'scrapy_save_as_pdf.pipelines.SaveAsPdfPipeline': -1,
}
The order should before your persist pipeline such as save to database and after your preprocess pipeline.
In the demo scrapy project, I put the SaveToQiniuPipeline
after this plugin to persist pdf to the cloud.
Usage
set the pdf_url
and/or url
field in your yielded item
import scrapy
class MySpider(scrapy.Spider):
start_urls = [
"http://example.com",
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse)
def parse(self, response):
yield {
"url": "http://example.com/cate1/page1.html",
"pdf_url": "http://example.com/cate1/page1.pdf",
}
yield {
"url": "http://example.com/cate1/page2.html",
"pdf_url": "http://example.com/cate1/page2.pdf",
}
the pdf_url
field will be populated with the downloaded pdf file location, if pdf_url
field has old value then move it to origin_pdf_url
field, you can handle them in your next pipeline.
Getting help
Please use github issue
Contributing
PRs are always welcomed.
Changes
0.1.0 (2020-12-25)
Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-save-as-pdf-0.2.1.tar.gz
.
File metadata
- Download URL: scrapy-save-as-pdf-0.2.1.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f82b5749638847854744334d2e2458f8906a0f55411d03295208dfbaa0e45b82 |
|
MD5 | 2d5d4d8d232674210f710f84e1ad5f81 |
|
BLAKE2b-256 | 2efd50076cee9f90dadad9a1394039ddcfb01f33fdcd375c88ba06922439ba80 |
File details
Details for the file scrapy_save_as_pdf-0.2.1-py2.py3-none-any.whl
.
File metadata
- Download URL: scrapy_save_as_pdf-0.2.1-py2.py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 520bf291d0fb2d7f5a874ec78c91ad40b766ca88b14c57d503eb290744cc7655 |
|
MD5 | f4bf9a22ac292e1503eb46bb701f48ff |
|
BLAKE2b-256 | 46f2f22008a74342410ffb71075ae7260020a84b27626f659f9e29b7c77e2d14 |