Skip to main content

A tool to scrape websites and generate PDFs from sitemap URLs.

Project description

MDPDF Web Scraper

A Python package for scraping websites and generating PDFs from their content.

What it does

This package allows you to scrape a website's sitemap, extract the HTML content from each URL, convert it to Markdown, and generate a PDF file for each URL. The package uses concurrent futures for asynchronous processing, making it efficient and fast.

How to use

Installation

You can install the package using pip:

pip install MDPDF-scraper

Usage

To use the package, simply import the WebToPDF_Scraper function and pass the URL of the website's sitemap as an argument:

from MDPDF_scraper.mdpdfscraper import WebToPDF_Scraper

if __name__ == "__main__":
    sitemap_url = "https://www.example.com/sitemap.xml"
    pdf_folder = "pdfs"
    scraper = WebToPDF_Scraper(sitemap_url, pdf_folder)
    scraper.scrape()

This will scrape the website's sitemap, extract the HTML content from each URL, convert it to Markdown, and generate a PDF file for each URL. The PDF files will be saved in a directory named "pdfs".

Configuration

You can change the directory where the PDF files are saved by modifying the pdf_folder variable.

Requirements

The package requires the following dependencies:

  • requests
  • beautifulsoup4
  • fpdf2

License

This package is licensed under the MIT License.

Hashan Wickramasinghe InferQ hashan@inferencequotient.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

MDPDF-scraper-0.1.5.1.tar.gz (377.4 kB view details)

Uploaded Source

Built Distribution

MDPDF_scraper-0.1.5.1-py3-none-any.whl (379.2 kB view details)

Uploaded Python 3

File details

Details for the file MDPDF-scraper-0.1.5.1.tar.gz.

File metadata

  • Download URL: MDPDF-scraper-0.1.5.1.tar.gz
  • Upload date:
  • Size: 377.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.7

File hashes

Hashes for MDPDF-scraper-0.1.5.1.tar.gz
Algorithm Hash digest
SHA256 ce1a412a3fcfe6da2e9a8bf231f3c06e3f6f4e59ddc472c1513da6ad194c50a9
MD5 332841f1f29e0803c89ba2030439ac39
BLAKE2b-256 ad14945829e6dcfeafbdac502d027ea83187435f99f26e81515dbd0e01f30117

See more details on using hashes here.

File details

Details for the file MDPDF_scraper-0.1.5.1-py3-none-any.whl.

File metadata

File hashes

Hashes for MDPDF_scraper-0.1.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 af94a9602aa7b9195f43b7f1f06cd394886da64d0084c3b78a993a2e22443020
MD5 6523c87ae08d6f350f648dec059b2234
BLAKE2b-256 e31cf1c2979be88b7dd1b00c2356f030e9497ae7fe4d6e1a38a01e9913662b45

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page