Skip to main content

A Python library for downloading PDF files from webpages, with support for recursive link following and PDF merging.

Project description

fetcharoo

Tests Python 3.10+ License: MIT

A Python library for downloading PDF files from webpages with support for recursive link following, PDF merging, and security hardening.

Features

  • Download PDF files from a specified webpage
  • Recursive crawling with configurable depth (up to 5 levels)
  • Merge downloaded PDFs into a single file or save separately
  • Security hardening: Domain restriction, path traversal protection, rate limiting
  • Configurable timeouts and request delays
  • Simple, easy-to-use Python API

Requirements

  • Python 3.10 or higher
  • Dependencies: requests, pymupdf, beautifulsoup4

Installation

Using pip

pip install fetcharoo

From GitHub (latest)

pip install git+https://github.com/MALathon/fetcharoo.git

Using Poetry

poetry add fetcharoo

From source

git clone https://github.com/MALathon/fetcharoo.git
cd fetcharoo
poetry install

Quick Start

from fetcharoo import download_pdfs_from_webpage

# Download PDFs from a webpage and merge them into a single file
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1,
    mode='merge',
    write_dir='output'
)

Usage

Basic Usage

from fetcharoo import download_pdfs_from_webpage

# Download and save PDFs as separate files
download_pdfs_from_webpage(
    url='https://example.com/documents',
    recursion_depth=0,  # Only search the specified page
    mode='separate',
    write_dir='downloads'
)

With Security Options

from fetcharoo import download_pdfs_from_webpage

# Restrict crawling to specific domains
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=2,
    mode='merge',
    write_dir='output',
    allowed_domains={'example.com', 'docs.example.com'},
    request_delay=1.0,  # 1 second between requests
    timeout=60  # 60 second timeout
)

Finding PDFs Without Downloading

from fetcharoo import find_pdfs_from_webpage

# Just get the list of PDF URLs
pdf_urls = find_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1
)

for url in pdf_urls:
    print(url)

Processing PDFs Separately

from fetcharoo import find_pdfs_from_webpage, process_pdfs

# Find PDFs first
pdf_urls = find_pdfs_from_webpage('https://example.com')

# Then process them
if pdf_urls:
    success = process_pdfs(
        pdf_links=pdf_urls,
        write_dir='output',
        mode='separate'
    )

API Reference

download_pdfs_from_webpage()

Main function to find and download PDFs from a webpage.

Parameter Type Default Description
url str required The webpage URL to search
recursion_depth int 0 How many levels of links to follow (max 5)
mode str 'separate' 'merge' or 'separate'
write_dir str 'output' Output directory for PDFs
allowed_domains set None Restrict crawling to these domains
request_delay float 0.5 Seconds between requests
timeout int 30 Request timeout in seconds

find_pdfs_from_webpage()

Find PDF URLs without downloading.

process_pdfs()

Download and save a list of PDF URLs.

Utility Functions

  • merge_pdfs() - Merge multiple PDF documents
  • is_valid_url() - Validate URL format and scheme
  • is_safe_domain() - Check if domain is allowed
  • sanitize_filename() - Prevent path traversal attacks

Security Features

fetcharoo includes several security measures:

  • Domain restriction: Limit recursive crawling to specified domains (SSRF protection)
  • Path traversal protection: Sanitizes filenames to prevent directory escape
  • Rate limiting: Configurable delays between requests
  • Timeout handling: Prevents hanging on slow servers
  • URL validation: Only allows http/https schemes

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes with tests
  4. Submit a pull request

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Developed by Mark A. Lifson, Ph.D.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetcharoo-0.1.0.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fetcharoo-0.1.0-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file fetcharoo-0.1.0.tar.gz.

File metadata

  • Download URL: fetcharoo-0.1.0.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.12 Linux/6.11.0-1018-azure

File hashes

Hashes for fetcharoo-0.1.0.tar.gz
Algorithm Hash digest
SHA256 af53b5170672fb5a330ef2b2d6d4b7316a832e7f07bb7b645b089ec6f89a43ae
MD5 7ea4323f77b8782a2ce987a9aaf7aaae
BLAKE2b-256 15a08cc824a3c9e60cb78843c67e2e7636ffa8906ee517ebd59b7da5bfc22653

See more details on using hashes here.

File details

Details for the file fetcharoo-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fetcharoo-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.12.12 Linux/6.11.0-1018-azure

File hashes

Hashes for fetcharoo-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 133953963100157622b30ef2fc3a755116ab2037a82599a18e7c4271d7f59c10
MD5 aa3a7ffca8d777ce9b9dc91f9d9b6aa1
BLAKE2b-256 0467d5942660182dadeac5429d0e566029c3c2f9afdcedc5ca95aeb8533ba2bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page