A Python library for downloading PDF files from webpages, with support for recursive link following and PDF merging.
Project description
fetcharoo
A Python library for downloading PDF files from webpages with support for recursive link following, PDF merging, and security hardening.
Features
- Download PDF files from a specified webpage
- Recursive crawling with configurable depth (up to 5 levels)
- Merge downloaded PDFs into a single file or save separately
- Command-line interface for quick downloads
- robots.txt compliance for ethical web crawling
- Custom User-Agent support
- Dry-run mode to preview downloads
- Progress bars with tqdm integration
- PDF filtering by filename, URL patterns, and size
- Security hardening: Domain restriction, path traversal protection, rate limiting
- Configurable timeouts and request delays
Requirements
- Python 3.10 or higher
- Dependencies:
requests,pymupdf,beautifulsoup4,tqdm
Installation
Using pip
pip install fetcharoo
From GitHub (latest)
pip install git+https://github.com/MALathon/fetcharoo.git
Using Poetry
poetry add fetcharoo
From source
git clone https://github.com/MALathon/fetcharoo.git
cd fetcharoo
poetry install
Command-Line Interface
fetcharoo includes a CLI for quick PDF downloads:
# Download PDFs from a webpage
fetcharoo https://example.com
# Download with recursion and merge into one file
fetcharoo https://example.com -d 2 -m
# List PDFs without downloading (dry run)
fetcharoo https://example.com --dry-run
# Download with custom options
fetcharoo https://example.com -o my_pdfs --delay 1.0 --progress
# Filter PDFs by pattern
fetcharoo https://example.com --include "report*.pdf" --exclude "*draft*"
CLI Options
| Option | Description |
|---|---|
-o, --output DIR |
Output directory (default: output) |
-d, --depth N |
Recursion depth (default: 0) |
-m, --merge |
Merge all PDFs into a single file |
--dry-run |
List PDFs without downloading |
--delay SECONDS |
Delay between requests (default: 0.5) |
--timeout SECONDS |
Request timeout (default: 30) |
--user-agent STRING |
Custom User-Agent string |
--respect-robots |
Respect robots.txt rules |
--progress |
Show progress bars |
--include PATTERN |
Include PDFs matching pattern |
--exclude PATTERN |
Exclude PDFs matching pattern |
--min-size BYTES |
Minimum PDF size |
--max-size BYTES |
Maximum PDF size |
Quick Start
from fetcharoo import download_pdfs_from_webpage
# Download PDFs from a webpage and merge them into a single file
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1,
mode='merge',
write_dir='output'
)
Usage
Basic Usage
from fetcharoo import download_pdfs_from_webpage
# Download and save PDFs as separate files
download_pdfs_from_webpage(
url='https://example.com/documents',
recursion_depth=0, # Only search the specified page
mode='separate',
write_dir='downloads'
)
With robots.txt Compliance
from fetcharoo import download_pdfs_from_webpage
# Respect robots.txt rules
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=2,
mode='merge',
write_dir='output',
respect_robots=True,
user_agent='MyBot/1.0'
)
Dry-Run Mode
from fetcharoo import download_pdfs_from_webpage
# Preview what would be downloaded
result = download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1,
dry_run=True
)
print(f"Found {result['count']} PDFs:")
for url in result['urls']:
print(f" - {url}")
With Progress Bars
from fetcharoo import download_pdfs_from_webpage
# Show progress during download
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=2,
write_dir='output',
show_progress=True
)
PDF Filtering
from fetcharoo import download_pdfs_from_webpage, FilterConfig
# Filter by filename patterns and size
filter_config = FilterConfig(
filename_include=['report*.pdf', 'annual*.pdf'],
filename_exclude=['*draft*', '*temp*'],
min_size=10000, # 10KB minimum
max_size=50000000 # 50MB maximum
)
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1,
write_dir='output',
filter_config=filter_config
)
With Security Options
from fetcharoo import download_pdfs_from_webpage
# Restrict crawling to specific domains
download_pdfs_from_webpage(
url='https://example.com',
recursion_depth=2,
mode='merge',
write_dir='output',
allowed_domains={'example.com', 'docs.example.com'},
request_delay=1.0, # 1 second between requests
timeout=60 # 60 second timeout
)
Finding PDFs Without Downloading
from fetcharoo import find_pdfs_from_webpage
# Just get the list of PDF URLs
pdf_urls = find_pdfs_from_webpage(
url='https://example.com',
recursion_depth=1
)
for url in pdf_urls:
print(url)
Custom User-Agent
from fetcharoo import download_pdfs_from_webpage, set_default_user_agent
# Set a global default User-Agent
set_default_user_agent('MyCompanyBot/1.0 (contact@example.com)')
# Or use per-request User-Agent
download_pdfs_from_webpage(
url='https://example.com',
user_agent='SpecificBot/2.0'
)
API Reference
download_pdfs_from_webpage()
Main function to find and download PDFs from a webpage.
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str | required | The webpage URL to search |
recursion_depth |
int | 0 | How many levels of links to follow (max 5) |
mode |
str | 'separate' | 'merge' or 'separate' |
write_dir |
str | 'output' | Output directory for PDFs |
allowed_domains |
set | None | Restrict crawling to these domains |
request_delay |
float | 0.5 | Seconds between requests |
timeout |
int | 30 | Request timeout in seconds |
respect_robots |
bool | False | Whether to respect robots.txt |
user_agent |
str | None | Custom User-Agent (uses default if None) |
dry_run |
bool | False | Preview URLs without downloading |
show_progress |
bool | False | Show progress bars |
filter_config |
FilterConfig | None | PDF filtering configuration |
find_pdfs_from_webpage()
Find PDF URLs without downloading.
process_pdfs()
Download and save a list of PDF URLs.
FilterConfig
Configuration for PDF filtering:
from fetcharoo import FilterConfig
config = FilterConfig(
filename_include=['*.pdf'], # Patterns to include
filename_exclude=['*draft*'], # Patterns to exclude
url_include=['*/reports/*'], # URL patterns to include
url_exclude=['*/temp/*'], # URL patterns to exclude
min_size=1000, # Minimum size in bytes
max_size=100000000 # Maximum size in bytes
)
Utility Functions
merge_pdfs()- Merge multiple PDF documentsis_valid_url()- Validate URL format and schemeis_safe_domain()- Check if domain is allowedsanitize_filename()- Prevent path traversal attackscheck_robots_txt()- Check robots.txt permissionsset_default_user_agent()- Set default User-Agentget_default_user_agent()- Get current default User-Agent
Security Features
fetcharoo includes several security measures:
- Domain restriction: Limit recursive crawling to specified domains (SSRF protection)
- Path traversal protection: Sanitizes filenames to prevent directory escape
- Rate limiting: Configurable delays between requests
- Timeout handling: Prevents hanging on slow servers
- URL validation: Only allows http/https schemes
- robots.txt compliance: Optional respect for crawling rules
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a pull request
License
This project is licensed under the MIT License. See the LICENSE file for details.
Author
Developed by Mark A. Lifson, Ph.D.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fetcharoo-0.2.0.tar.gz.
File metadata
- Download URL: fetcharoo-0.2.0.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.12.12 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd47fb2dd32c8dedb97fbb9410e0960f15ff4712358c84536dba9ea28ebef110
|
|
| MD5 |
a14f71acf69dc06f5c90e28b8343a963
|
|
| BLAKE2b-256 |
30ff08b155bd38e1ca47aa88d7c95b87e8b11f444799da94b94959c7927bc903
|
File details
Details for the file fetcharoo-0.2.0-py3-none-any.whl.
File metadata
- Download URL: fetcharoo-0.2.0-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.12.12 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dcd67ecc04f9ada849bfac11577a121e7f210624142e11536a6d8e4a93c72704
|
|
| MD5 |
6459e3ae1559afcc1793ec8c0d35790b
|
|
| BLAKE2b-256 |
9cdf1ff2b619f3f6d2763abf8470cb6d3e51fe04e6fd844b457b791a76dd313e
|