A tool to scrape websites and generate PDFs from sitemap URLs.
Project description
MDPDF Web Scraper
A Python package for scraping websites and generating PDFs from their content.
What it does
This package allows you to scrape a website's sitemap, extract the HTML content from each URL, convert it to Markdown, and generate a PDF file for each URL. The package uses concurrent futures for asynchronous processing, making it efficient and fast.
How to use
Installation
You can install the package using pip:
pip install mdpdf-scraper
Usage
To use the package, simply import the WebToPDF_Scraper
function and pass the URL of the website's sitemap as an argument:
from mdpdfscraper import WebToPDF_Scraper
sitemap_url = "https://example.com/sitemap.xml"
pdf_folder = "pdfs" # change the folder path to the folder where you want the files to be saved in
WebToPDF_Scraper(sitemap_url)
This will scrape the website's sitemap, extract the HTML content from each URL, convert it to Markdown, and generate a PDF file for each URL. The PDF files will be saved in a directory named "pdfs".
Configuration
You can change the directory where the PDF files are saved by modifying the pdf_folder
variable.
Requirements
The package requires the following dependencies:
requests
beautifulsoup4
fpdf2
License
This package is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for MDPDF_scraper-0.1.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46a50175de235a61f7e9c10623c84d9557d097baab8a3042401a662048b74104 |
|
MD5 | 508078dcfaa585bdc07acfe7d3dce578 |
|
BLAKE2b-256 | 9c6fbab8a23698d8553c7f68ed80986d9e65ea3d0f36bf7f3dca1a96b63119a0 |