A Python library for downloading PDF files from webpages, with support for recursive link following and PDF merging.

These details have not been verified by PyPI

Project links

Project description

fetcharoo

A Python library for downloading PDF files from webpages with support for recursive link following, PDF merging, and security hardening.

Features

Download PDF files from a specified webpage
Recursive crawling with configurable depth (up to 5 levels)
Merge downloaded PDFs into a single file or save separately
Command-line interface for quick downloads
robots.txt compliance for ethical web crawling
Custom User-Agent support
Dry-run mode to preview downloads
Progress bars with tqdm integration
PDF filtering by filename, URL patterns, and size
Security hardening: Domain restriction, path traversal protection, rate limiting
Configurable timeouts and request delays

Requirements

Python 3.10 or higher
Dependencies: requests, pymupdf, beautifulsoup4, tqdm

Installation

Using pip

pip install fetcharoo

From GitHub (latest)

pip install git+https://github.com/MALathon/fetcharoo.git

Using Poetry

poetry add fetcharoo

From source

git clone https://github.com/MALathon/fetcharoo.git
cd fetcharoo
poetry install

Command-Line Interface

fetcharoo includes a CLI for quick PDF downloads:

# Download PDFs from a webpage
fetcharoo https://example.com

# Download with recursion and merge into one file
fetcharoo https://example.com -d 2 -m

# List PDFs without downloading (dry run)
fetcharoo https://example.com --dry-run

# Download with custom options
fetcharoo https://example.com -o my_pdfs --delay 1.0 --progress

# Filter PDFs by pattern
fetcharoo https://example.com --include "report*.pdf" --exclude "*draft*"

CLI Options

Option	Description
`-o, --output DIR`	Output directory (default: output)
`-d, --depth N`	Recursion depth (default: 0)
`-m, --merge`	Merge all PDFs into a single file
`--dry-run`	List PDFs without downloading
`--delay SECONDS`	Delay between requests (default: 0.5)
`--timeout SECONDS`	Request timeout (default: 30)
`--user-agent STRING`	Custom User-Agent string
`--respect-robots`	Respect robots.txt rules
`--progress`	Show progress bars
`--include PATTERN`	Include PDFs matching pattern
`--exclude PATTERN`	Exclude PDFs matching pattern
`--min-size BYTES`	Minimum PDF size
`--max-size BYTES`	Maximum PDF size

Quick Start

from fetcharoo import download_pdfs_from_webpage

# Download PDFs from a webpage and merge them into a single file
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1,
    mode='merge',
    write_dir='output'
)

Usage

Basic Usage

from fetcharoo import download_pdfs_from_webpage

# Download and save PDFs as separate files
download_pdfs_from_webpage(
    url='https://example.com/documents',
    recursion_depth=0,  # Only search the specified page
    mode='separate',
    write_dir='downloads'
)

With robots.txt Compliance

from fetcharoo import download_pdfs_from_webpage

# Respect robots.txt rules
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=2,
    mode='merge',
    write_dir='output',
    respect_robots=True,
    user_agent='MyBot/1.0'
)

Dry-Run Mode

from fetcharoo import download_pdfs_from_webpage

# Preview what would be downloaded
result = download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1,
    dry_run=True
)

print(f"Found {result['count']} PDFs:")
for url in result['urls']:
    print(f"  - {url}")

With Progress Bars

from fetcharoo import download_pdfs_from_webpage

# Show progress during download
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=2,
    write_dir='output',
    show_progress=True
)

PDF Filtering

from fetcharoo import download_pdfs_from_webpage, FilterConfig

# Filter by filename patterns and size
filter_config = FilterConfig(
    filename_include=['report*.pdf', 'annual*.pdf'],
    filename_exclude=['*draft*', '*temp*'],
    min_size=10000,  # 10KB minimum
    max_size=50000000  # 50MB maximum
)

download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1,
    write_dir='output',
    filter_config=filter_config
)

With Security Options

from fetcharoo import download_pdfs_from_webpage

# Restrict crawling to specific domains
download_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=2,
    mode='merge',
    write_dir='output',
    allowed_domains={'example.com', 'docs.example.com'},
    request_delay=1.0,  # 1 second between requests
    timeout=60  # 60 second timeout
)

Finding PDFs Without Downloading

from fetcharoo import find_pdfs_from_webpage

# Just get the list of PDF URLs
pdf_urls = find_pdfs_from_webpage(
    url='https://example.com',
    recursion_depth=1
)

for url in pdf_urls:
    print(url)

Custom User-Agent

from fetcharoo import download_pdfs_from_webpage, set_default_user_agent

# Set a global default User-Agent
set_default_user_agent('MyCompanyBot/1.0 (contact@example.com)')

# Or use per-request User-Agent
download_pdfs_from_webpage(
    url='https://example.com',
    user_agent='SpecificBot/2.0'
)

API Reference

`download_pdfs_from_webpage()`

Main function to find and download PDFs from a webpage.

Parameter	Type	Default	Description
`url`	str	required	The webpage URL to search
`recursion_depth`	int	0	How many levels of links to follow (max 5)
`mode`	str	'separate'	'merge' or 'separate'
`write_dir`	str	'output'	Output directory for PDFs
`allowed_domains`	set	None	Restrict crawling to these domains
`request_delay`	float	0.5	Seconds between requests
`timeout`	int	30	Request timeout in seconds
`respect_robots`	bool	False	Whether to respect robots.txt
`user_agent`	str	None	Custom User-Agent (uses default if None)
`dry_run`	bool	False	Preview URLs without downloading
`show_progress`	bool	False	Show progress bars
`filter_config`	FilterConfig	None	PDF filtering configuration

`find_pdfs_from_webpage()`

Find PDF URLs without downloading.

`process_pdfs()`

Download and save a list of PDF URLs.

`FilterConfig`

Configuration for PDF filtering:

from fetcharoo import FilterConfig

config = FilterConfig(
    filename_include=['*.pdf'],      # Patterns to include
    filename_exclude=['*draft*'],    # Patterns to exclude
    url_include=['*/reports/*'],     # URL patterns to include
    url_exclude=['*/temp/*'],        # URL patterns to exclude
    min_size=1000,                   # Minimum size in bytes
    max_size=100000000               # Maximum size in bytes
)

Utility Functions

merge_pdfs() - Merge multiple PDF documents
is_valid_url() - Validate URL format and scheme
is_safe_domain() - Check if domain is allowed
sanitize_filename() - Prevent path traversal attacks
check_robots_txt() - Check robots.txt permissions
set_default_user_agent() - Set default User-Agent
get_default_user_agent() - Get current default User-Agent

Security Features

fetcharoo includes several security measures:

Domain restriction: Limit recursive crawling to specified domains (SSRF protection)
Path traversal protection: Sanitizes filenames to prevent directory escape
Rate limiting: Configurable delays between requests
Timeout handling: Prevents hanging on slow servers
URL validation: Only allows http/https schemes
robots.txt compliance: Optional respect for crawling rules

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes with tests
Submit a pull request

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Developed by Mark A. Lifson, Ph.D.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Dec 13, 2025

0.1.0

Dec 13, 2025

0.0.1

May 7, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fetcharoo-0.2.0.tar.gz (15.3 kB view details)

Uploaded Dec 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fetcharoo-0.2.0-py3-none-any.whl (16.8 kB view details)

Uploaded Dec 13, 2025 Python 3

File details

Details for the file fetcharoo-0.2.0.tar.gz.

File metadata

Download URL: fetcharoo-0.2.0.tar.gz
Upload date: Dec 13, 2025
Size: 15.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.12.12 Linux/6.11.0-1018-azure

File hashes

Hashes for fetcharoo-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`dd47fb2dd32c8dedb97fbb9410e0960f15ff4712358c84536dba9ea28ebef110`
MD5	`a14f71acf69dc06f5c90e28b8343a963`
BLAKE2b-256	`30ff08b155bd38e1ca47aa88d7c95b87e8b11f444799da94b94959c7927bc903`

See more details on using hashes here.

File details

Details for the file fetcharoo-0.2.0-py3-none-any.whl.

File metadata

Download URL: fetcharoo-0.2.0-py3-none-any.whl
Upload date: Dec 13, 2025
Size: 16.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.2.1 CPython/3.12.12 Linux/6.11.0-1018-azure

File hashes

Hashes for fetcharoo-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dcd67ecc04f9ada849bfac11577a121e7f210624142e11536a6d8e4a93c72704`
MD5	`6459e3ae1559afcc1793ec8c0d35790b`
BLAKE2b-256	`9cdf1ff2b619f3f6d2763abf8470cb6d3e51fe04e6fd844b457b791a76dd313e`

See more details on using hashes here.

fetcharoo 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fetcharoo

Features

Requirements

Installation

Using pip

From GitHub (latest)

Using Poetry

From source

Command-Line Interface

CLI Options

Quick Start

Usage

Basic Usage

With robots.txt Compliance

Dry-Run Mode

With Progress Bars

PDF Filtering

With Security Options

Finding PDFs Without Downloading

Custom User-Agent

API Reference

download_pdfs_from_webpage()

find_pdfs_from_webpage()

process_pdfs()

FilterConfig

Utility Functions

Security Features

Contributing

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`download_pdfs_from_webpage()`

`find_pdfs_from_webpage()`

`process_pdfs()`

`FilterConfig`