Pipeline to Download PDF or Save page as PDF for scrapy item
Project description
Pipeline to Download PDF or Save page as PDF for scrapy item
Installation
Install scrapy-save-as-pdf using pip:
pip install scrapy-save-as-pdf
Configuration
- (Optionally) if you want to use
WEBDRIVER_HUB_URL, you can usedockerto setup one like this:
docker run -d -p 4444:4444 -v /dev/shm:/dev/shm selenium/standalone-chrome:4.0.0-alpha-7-20201119
then WEBDRIVER_HUB_URL value is http://docker_host_ip:4444/wd/hub
and we often debug on local host, so we use http://127.0.0.1:4444/wd/hub
- Add the
settings.pyof your Scrapy project like this:
PROXY = ""
CHROME_DRIVER_PATH ='/snap/bin/chromium.chromedriver'
PDF_SAVE_PATH = "./pdfs"
PDF_SAVE_AS_PDF = False
PDF_DOWNLOAD_TIMEOUT = 60
PDF_PRINT_OPTIONS = {
'landscape': False,
'displayHeaderFooter': False,
'printBackground': True,
'preferCSSPageSize': True,
}
WEBDRIVER_HUB_URL = 'http://127.0.0.1:4444/wd/hub'
If both WEBDRIVER_HUB_URL and CHROME_DRIVER_PATH are set, we use WEBDRIVER_HUB_URL.
- Enable the pipeline by adding it to
ITEM_PIPELINESin yoursettings.pyfile and changing priority:
ITEM_PIPELINES = {
'scrapy_save_as_pdf.pipelines.SaveAsPdfPipeline': -1,
}
The order should before your persist pipeline such as save to database and after your preprocess pipeline.
In the demo scrapy project, I put the SaveToQiniuPipeline after this plugin to persist pdf to the cloud.
Usage
set the pdf_url and/or url field in your yielded item
import scrapy
class MySpider(scrapy.Spider):
start_urls = [
"http://example.com",
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse)
def parse(self, response):
yield {
"url": "http://example.com/cate1/page1.html",
"pdf_url": "http://example.com/cate1/page1.pdf",
}
yield {
"url": "http://example.com/cate1/page2.html",
"pdf_url": "http://example.com/cate1/page2.pdf",
}
the pdf_url field will be populated with the downloaded pdf file location, if pdf_url field has old value then move it to origin_pdf_url field, you can handle them in your next pipeline.
Getting help
Please use github issue
Contributing
PRs are always welcomed.
Changes
0.1.0 (2020-12-25)
Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy-save-as-pdf-0.2.1.tar.gz.
File metadata
- Download URL: scrapy-save-as-pdf-0.2.1.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f82b5749638847854744334d2e2458f8906a0f55411d03295208dfbaa0e45b82
|
|
| MD5 |
2d5d4d8d232674210f710f84e1ad5f81
|
|
| BLAKE2b-256 |
2efd50076cee9f90dadad9a1394039ddcfb01f33fdcd375c88ba06922439ba80
|
File details
Details for the file scrapy_save_as_pdf-0.2.1-py2.py3-none-any.whl.
File metadata
- Download URL: scrapy_save_as_pdf-0.2.1-py2.py3-none-any.whl
- Upload date:
- Size: 4.4 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
520bf291d0fb2d7f5a874ec78c91ad40b766ca88b14c57d503eb290744cc7655
|
|
| MD5 |
f4bf9a22ac292e1503eb46bb701f48ff
|
|
| BLAKE2b-256 |
46f2f22008a74342410ffb71075ae7260020a84b27626f659f9e29b7c77e2d14
|