A Python library for downloading PDF files from webpages, with support for recursive link following and PDF merging.
Project description
fetcharoo
A Python library for downloading PDF files from webpages with support for recursive link following, PDF merging, and security hardening.
Features
- Download PDF files from a specified webpage
- Recursive crawling with configurable depth (up to 5 levels)
- Merge downloaded PDFs into a single file or save separately
- Security hardening: Domain restriction, path traversal protection, rate limiting
- Configurable timeouts and request delays
- Simple, easy-to-use Python API
Requirements
- Python 3.10 or higher
- Dependencies:
requests,pymupdf,beautifulsoup4
Installation
Using pip
pip install fetcharoo
From GitHub (latest)
pip install git+https://github.com/MALathon/fetcharoo.git
Using Poetry
poetry add fetcharoo
From source
git clone https://github.com/MALathon/fetcharoo.git
cd fetcharoo
poetry install
Quick Start
from fetcharoo import download_pdfs_from_webpage
# Download PDFs from a webpage and merge them into a single file
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1,
mode='merge',
write_dir='output'
)
Usage
Basic Usage
from fetcharoo import download_pdfs_from_webpage
# Download and save PDFs as separate files
download_pdfs_from_webpage(
url='https://example.com/documents',
recursion_depth=0, # Only search the specified page
mode='separate',
write_dir='downloads'
)
With Security Options
from fetcharoo import download_pdfs_from_webpage
# Restrict crawling to specific domains
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=2,
mode='merge',
write_dir='output',
allowed_domains={'example.com', 'docs.example.com'},
request_delay=1.0, # 1 second between requests
timeout=60 # 60 second timeout
)
Finding PDFs Without Downloading
from fetcharoo import find_pdfs_from_webpage
# Just get the list of PDF URLs
pdf_urls = find_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1
)
for url in pdf_urls:
print(url)
Processing PDFs Separately
from fetcharoo import find_pdfs_from_webpage, process_pdfs
# Find PDFs first
pdf_urls = find_pdfs_from_webpage('https://example.com')
# Then process them
if pdf_urls:
success = process_pdfs(
pdf_links=pdf_urls,
write_dir='output',
mode='separate'
)
API Reference
download_pdfs_from_webpage()
Main function to find and download PDFs from a webpage.
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str | required | The webpage URL to search |
recursion_depth |
int | 0 | How many levels of links to follow (max 5) |
mode |
str | 'separate' | 'merge' or 'separate' |
write_dir |
str | 'output' | Output directory for PDFs |
allowed_domains |
set | None | Restrict crawling to these domains |
request_delay |
float | 0.5 | Seconds between requests |
timeout |
int | 30 | Request timeout in seconds |
find_pdfs_from_webpage()
Find PDF URLs without downloading.
process_pdfs()
Download and save a list of PDF URLs.
Utility Functions
merge_pdfs()- Merge multiple PDF documentsis_valid_url()- Validate URL format and schemeis_safe_domain()- Check if domain is allowedsanitize_filename()- Prevent path traversal attacks
Security Features
fetcharoo includes several security measures:
- Domain restriction: Limit recursive crawling to specified domains (SSRF protection)
- Path traversal protection: Sanitizes filenames to prevent directory escape
- Rate limiting: Configurable delays between requests
- Timeout handling: Prevents hanging on slow servers
- URL validation: Only allows http/https schemes
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a pull request
License
This project is licensed under the MIT License. See the LICENSE file for details.
Author
Developed by Mark A. Lifson, Ph.D.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fetcharoo-0.1.0.tar.gz.
File metadata
- Download URL: fetcharoo-0.1.0.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.12.12 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af53b5170672fb5a330ef2b2d6d4b7316a832e7f07bb7b645b089ec6f89a43ae
|
|
| MD5 |
7ea4323f77b8782a2ce987a9aaf7aaae
|
|
| BLAKE2b-256 |
15a08cc824a3c9e60cb78843c67e2e7636ffa8906ee517ebd59b7da5bfc22653
|
File details
Details for the file fetcharoo-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fetcharoo-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.12.12 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
133953963100157622b30ef2fc3a755116ab2037a82599a18e7c4271d7f59c10
|
|
| MD5 |
aa3a7ffca8d777ce9b9dc91f9d9b6aa1
|
|
| BLAKE2b-256 |
0467d5942660182dadeac5429d0e566029c3c2f9afdcedc5ca95aeb8533ba2bf
|