A Python library for downloading PDF files from webpages, with support for recursive link following and PDF merging.

These details have not been verified by PyPI

Project links

Project description

fetcharoo

A Python library for downloading PDF files from webpages with support for recursive link following, PDF merging, and security hardening.

Features

Download PDF files from a specified webpage
Recursive crawling with configurable depth (up to 5 levels)
Merge downloaded PDFs into a single file or save separately
Security hardening: Domain restriction, path traversal protection, rate limiting
Configurable timeouts and request delays
Simple, easy-to-use Python API

Requirements

Python 3.10 or higher
Dependencies: requests, pymupdf, beautifulsoup4

Installation

Using pip

pip install fetcharoo

From GitHub (latest)

pip install git+https://github.com/MALathon/fetcharoo.git

Using Poetry

poetry add fetcharoo

From source

git clone https://github.com/MALathon/fetcharoo.git
cd fetcharoo
poetry install

Quick Start

from fetcharoo import download_pdfs_from_webpage

# Download PDFs from a webpage and merge them into a single file
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1,
    mode='merge',
    write_dir='output'
)

Usage

Basic Usage

from fetcharoo import download_pdfs_from_webpage

# Download and save PDFs as separate files
download_pdfs_from_webpage(
    url='https://example.com/documents',
    recursion_depth=0,  # Only search the specified page
    mode='separate',
    write_dir='downloads'
)

With Security Options

from fetcharoo import download_pdfs_from_webpage

# Restrict crawling to specific domains
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=2,
    mode='merge',
    write_dir='output',
    allowed_domains={'example.com', 'docs.example.com'},
    request_delay=1.0,  # 1 second between requests
    timeout=60  # 60 second timeout
)

Finding PDFs Without Downloading

from fetcharoo import find_pdfs_from_webpage

# Just get the list of PDF URLs
pdf_urls = find_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1
)

for url in pdf_urls:
    print(url)

Processing PDFs Separately

from fetcharoo import find_pdfs_from_webpage, process_pdfs

# Find PDFs first
pdf_urls = find_pdfs_from_webpage('https://example.com')

# Then process them
if pdf_urls:
    success = process_pdfs(
        pdf_links=pdf_urls,
        write_dir='output',
        mode='separate'
    )

API Reference

`download_pdfs_from_webpage()`

Main function to find and download PDFs from a webpage.

Parameter	Type	Default	Description
`url`	str	required	The webpage URL to search
`recursion_depth`	int	0	How many levels of links to follow (max 5)
`mode`	str	'separate'	'merge' or 'separate'
`write_dir`	str	'output'	Output directory for PDFs
`allowed_domains`	set	None	Restrict crawling to these domains
`request_delay`	float	0.5	Seconds between requests
`timeout`	int	30	Request timeout in seconds

`find_pdfs_from_webpage()`

Find PDF URLs without downloading.

`process_pdfs()`

Download and save a list of PDF URLs.

Utility Functions

merge_pdfs() - Merge multiple PDF documents
is_valid_url() - Validate URL format and scheme
is_safe_domain() - Check if domain is allowed
sanitize_filename() - Prevent path traversal attacks

Security Features

fetcharoo includes several security measures:

Domain restriction: Limit recursive crawling to specified domains (SSRF protection)
Path traversal protection: Sanitizes filenames to prevent directory escape
Rate limiting: Configurable delays between requests
Timeout handling: Prevents hanging on slow servers
URL validation: Only allows http/https schemes

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes with tests
Submit a pull request

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Developed by Mark A. Lifson, Ph.D.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Dec 13, 2025

This version

0.1.0

Dec 13, 2025

0.0.1

May 7, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetcharoo-0.1.0.tar.gz (7.3 kB view details)

Uploaded Dec 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fetcharoo-0.1.0-py3-none-any.whl (9.5 kB view details)

Uploaded Dec 13, 2025 Python 3

File details

Details for the file fetcharoo-0.1.0.tar.gz.

File metadata

Download URL: fetcharoo-0.1.0.tar.gz
Upload date: Dec 13, 2025
Size: 7.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.12.12 Linux/6.11.0-1018-azure

File hashes

Hashes for fetcharoo-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`af53b5170672fb5a330ef2b2d6d4b7316a832e7f07bb7b645b089ec6f89a43ae`
MD5	`7ea4323f77b8782a2ce987a9aaf7aaae`
BLAKE2b-256	`15a08cc824a3c9e60cb78843c67e2e7636ffa8906ee517ebd59b7da5bfc22653`

See more details on using hashes here.

File details

Details for the file fetcharoo-0.1.0-py3-none-any.whl.

File metadata

Download URL: fetcharoo-0.1.0-py3-none-any.whl
Upload date: Dec 13, 2025
Size: 9.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.12.12 Linux/6.11.0-1018-azure

File hashes

Hashes for fetcharoo-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`133953963100157622b30ef2fc3a755116ab2037a82599a18e7c4271d7f59c10`
MD5	`aa3a7ffca8d777ce9b9dc91f9d9b6aa1`
BLAKE2b-256	`0467d5942660182dadeac5429d0e566029c3c2f9afdcedc5ca95aeb8533ba2bf`

See more details on using hashes here.

fetcharoo 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fetcharoo

Features

Requirements

Installation

Using pip

From GitHub (latest)

Using Poetry

From source

Quick Start

Usage

Basic Usage

With Security Options

Finding PDFs Without Downloading

Processing PDFs Separately

API Reference

download_pdfs_from_webpage()

find_pdfs_from_webpage()

process_pdfs()

Utility Functions

Security Features

Contributing

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`download_pdfs_from_webpage()`

`find_pdfs_from_webpage()`

`process_pdfs()`