A tool to scrape websites and generate PDFs from sitemap URLs.
Project description
MDPDF Web Scraper
A Python package for scraping websites and generating PDFs from their content.
What it does
This package allows you to scrape a website's sitemap, extract the HTML content from each URL, convert it to Markdown, and generate a PDF file for each URL. The package uses concurrent futures for asynchronous processing, making it efficient and fast.
How to use
Installation
You can install the package using pip:
pip install MDPDF-scraper
Usage
To use the package, simply import the WebToPDF_Scraper
function and pass the URL of the website's sitemap as an argument:
from MDPDF_scraper.mdpdfscraper import WebToPDF_Scraper
if __name__ == "__main__":
sitemap_url = "https://www.example.com/sitemap.xml"
pdf_folder = "pdfs"
scraper = WebToPDF_Scraper(sitemap_url, pdf_folder)
scraper.scrape()
This will scrape the website's sitemap, extract the HTML content from each URL, convert it to Markdown, and generate a PDF file for each URL. The PDF files will be saved in a directory named "pdfs".
Configuration
You can change the directory where the PDF files are saved by modifying the pdf_folder
variable.
Requirements
The package requires the following dependencies:
requests
beautifulsoup4
fpdf2
License
This package is licensed under the MIT License.
Hashan Wickramasinghe InferQ hashan@inferencequotient.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for MDPDF_scraper-0.1.5.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | af94a9602aa7b9195f43b7f1f06cd394886da64d0084c3b78a993a2e22443020 |
|
MD5 | 6523c87ae08d6f350f648dec059b2234 |
|
BLAKE2b-256 | e31cf1c2979be88b7dd1b00c2356f030e9497ae7fe4d6e1a38a01e9913662b45 |