A tool to scrape websites and generate PDFs from sitemap URLs.
Project description
MDPDF Web Scraper
A Python package for scraping websites and generating PDFs from their content.
What it does
This package allows you to scrape a website's sitemap, extract the HTML content from each URL, convert it to Markdown, and generate a PDF file for each URL. The package uses concurrent futures for asynchronous processing, making it efficient and fast.
How to use
Installation
You can install the package using pip:
pip install MDPDF-scraper
Usage
To use the package, simply import the WebToPDF_Scraper
function and pass the URL of the website's sitemap as an argument:
from MDPDF_scraper.mdpdfscraper import WebToPDF_Scraper
if __name__ == "__main__":
sitemap_url = "https://www.example.com/sitemap.xml"
pdf_folder = "pdfs"
scraper = WebToPDF_Scraper(sitemap_url, pdf_folder)
scraper.scrape()
This will scrape the website's sitemap, extract the HTML content from each URL, convert it to Markdown, and generate a PDF file for each URL. The PDF files will be saved in a directory named "pdfs".
Configuration
You can change the directory where the PDF files are saved by modifying the pdf_folder
variable.
Requirements
The package requires the following dependencies:
requests
beautifulsoup4
fpdf2
License
This package is licensed under the MIT License.
Hashan Wickramasinghe InferQ hashan@inferencequotient.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file MDPDF-scraper-0.1.5.1.tar.gz
.
File metadata
- Download URL: MDPDF-scraper-0.1.5.1.tar.gz
- Upload date:
- Size: 377.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce1a412a3fcfe6da2e9a8bf231f3c06e3f6f4e59ddc472c1513da6ad194c50a9 |
|
MD5 | 332841f1f29e0803c89ba2030439ac39 |
|
BLAKE2b-256 | ad14945829e6dcfeafbdac502d027ea83187435f99f26e81515dbd0e01f30117 |
File details
Details for the file MDPDF_scraper-0.1.5.1-py3-none-any.whl
.
File metadata
- Download URL: MDPDF_scraper-0.1.5.1-py3-none-any.whl
- Upload date:
- Size: 379.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | af94a9602aa7b9195f43b7f1f06cd394886da64d0084c3b78a993a2e22443020 |
|
MD5 | 6523c87ae08d6f350f648dec059b2234 |
|
BLAKE2b-256 | e31cf1c2979be88b7dd1b00c2356f030e9497ae7fe4d6e1a38a01e9913662b45 |