High-performance agentic web scraping library combining curl-cffi speed with Playwright browser capabilities
Project description
PhantomFetch
PhantomFetch is a high-performance, agentic web scraping library for Python. It seamlessly combines the speed of curl-cffi with the capabilities of Playwright, offering a unified API for all your data extraction needs.
Why PhantomFetch?
Most web scraping requires choosing between speed (httpx, requests) or browser capabilities (Playwright, Selenium). PhantomFetch gives you both with a unified interface:
| Feature | PhantomFetch | requests/httpx | Playwright/Selenium |
|---|---|---|---|
| Speed | ⚡ Fast (curl-cffi) | ⚡ Fast | 🐌 Slow |
| JavaScript Support | ✅ Yes (Playwright) | ❌ No | ✅ Yes |
| Anti-Detection | ✅ Built-in | ❌ No | ⚠️ Manual |
| Smart Caching | ✅ Configurable | ❌ No | ❌ No |
| Proxy Rotation | ✅ Automatic | ⚠️ Manual | ⚠️ Manual |
| Async-First | ✅ Yes | ⚠️ Partial | ✅ Yes |
| Unified API | ✅ One interface | N/A | N/A |
| OpenTelemetry | ✅ Built-in | ❌ No | ❌ No |
Key Benefits:
- 🎯 Start Fast, Scale Smart: Use curl for quick requests, switch to browser when needed
- 🧠 Intelligent: Automatic retry logic, exponential backoff, fingerprint rotation
- 🚀 Production-Ready: Built-in observability, caching, and error handling
- 🛠️ Developer-Friendly: Intuitive API, comprehensive type hints, rich documentation
Features
- 🚀 Unified API: Switch between
curl(fast, lightweight) andbrowser(JavaScript-capable) engines with a single parameter - 🧠 Smart Caching: Configurable caching strategies (
all,resources,conservative) to speed up development and save bandwidth - 🤖 Agentic Actions: Define browser interactions (click, scroll, input, wait) declaratively
- 🛡️ Anti-Detection: Built-in support for proxy rotation and fingerprinting protection (via
curl-cffi) - ⚡ Async First: Built on
asynciofor high concurrency - 🔄 Smart Retries: Configurable retry logic with exponential backoff
- 🍪 Cookie Management: Automatic cookie handling across engines
- 📊 Observability: OpenTelemetry integration out of the box
Installation
pip install phantomfetch
# or with uv (recommended)
uv pip install phantomfetch
After installation, install Playwright browsers:
playwright install chromium
Quick Start
Basic Fetch (Curl Engine)
import asyncio
from phantomfetch import Fetcher
async def main():
async with Fetcher() as f:
response = await f.fetch("https://httpbin.org/get")
print(response.json())
if __name__ == "__main__":
asyncio.run(main())
Browser Fetch with Caching
Use the resources strategy to cache static assets (images, CSS, scripts) while keeping the main page fresh.
from phantomfetch import Fetcher, FileSystemCache
async def main():
# Cache sub-resources to speed up subsequent fetches
cache = FileSystemCache(strategy="resources")
async with Fetcher(browser_engine="cdp", cache=cache) as f:
# First run: downloads everything
resp = await f.fetch("https://example.com", engine="browser")
# Second run: uses cached resources, only fetches main HTML
resp = await f.fetch("https://example.com", engine="browser")
print(resp.text)
Browser Actions
Perform interactions like clicking, scrolling, and taking screenshots:
from phantomfetch import Fetcher
actions = [
{"action": "wait", "selector": "#search-input"},
{"action": "input", "selector": "#search-input", "value": "phantomfetch"},
{"action": "click", "selector": "#search-button"},
{"action": "wait_for_load"},
{"action": "screenshot", "value": "search_results.png"}
]
async with Fetcher(browser_engine="cdp") as f:
resp = await f.fetch("https://example.com", actions=actions, engine="browser")
# Screenshot saved to search_results.png
Advanced: Retry Configuration
Fine-tune retry behavior per request:
from phantomfetch import Fetcher
async with Fetcher() as f:
# Custom retry logic for flaky endpoints
resp = await f.fetch(
"https://api.example.com/data",
max_retries=5, # Override default retries
timeout=60.0, # Longer timeout for slow APIs
)
Cookie Handling
Pass cookies to any engine and retrieve them from the response:
from phantomfetch import Fetcher, Cookie
async with Fetcher() as f:
# Set cookies
resp = await f.fetch(
"https://httpbin.org/cookies",
cookies={"session_id": "secret_token"}
)
print(resp.json())
# Get cookies (including from redirects)
resp = await f.fetch("https://httpbin.org/cookies/set/foo/bar")
for cookie in resp.cookies:
print(f"{cookie.name}: {cookie.value}")
Configuration
Caching Strategies
all: Caches everything, including the main document. Good for offline developmentresources(Default): Caches sub-resources (images, styles, scripts) but fetches the main document fresh. Best for scraping dynamic sitesconservative: Caches only heavy static assets like images and fonts
Example:
from phantomfetch import FileSystemCache, Fetcher
cache = FileSystemCache(
cache_dir=".cache",
strategy="resources"
)
async with Fetcher(cache=cache) as f:
# Resources will be cached automatically
resp = await f.fetch("https://example.com", engine="browser")
Proxy Rotation
Multiple proxy strategies available:
from phantomfetch import Fetcher, Proxy, ProxyPool
# 1. Define Typed Proxies
proxies = [
Proxy(
url="http://user:pass@residential-us.com:8080",
location="US",
vendor="BrightData",
proxy_type="residential",
weight=10
),
Proxy(
url="http://user:pass@datacenter-de.com:8080",
location="DE",
vendor="OxyLabs",
proxy_type="datacenter",
weight=1
),
]
# 2. Create a Smart Pool
pool = ProxyPool(proxies, strategy="geo_match")
async with Fetcher(proxies=pool) as f:
# Uses US proxy from pool (geo-match)
await f.fetch("https://google.com", location="US")
# Uses any available proxy (fallback)
await f.fetch("https://example.com")
# 3. Explicit Override (Bypass Pool)
# Useful for debugging or specific routing needs
await f.fetch(
"https://httpbin.org/ip",
proxy="http://user:pass@specific-proxy:8080"
)
Observability (OpenTelemetry)
PhantomFetch is fully instrumented with OpenTelemetry:
from phantomfetch.telemetry import configure_telemetry
from opentelemetry.sdk.trace.export import ConsoleSpanExporter
# Setup OTel with custom service name
configure_telemetry(service_name="my-scraper")
async with Fetcher() as f:
await f.fetch("https://example.com")
# Spans automatically created and exported
Or use standard OpenTelemetry environment variables:
export OTEL_SERVICE_NAME="my-scraper"
export OTEL_TRACES_EXPORTER="console"
python my_scraper.py
Troubleshooting
Playwright Installation Issues
If you encounter browser-related errors:
# Install all browsers
playwright install
# Or just chromium (recommended)
playwright install chromium
# Check installation
playwright install --help
SSL Certificate Errors
For development/testing, you can disable SSL verification:
# Note: Only use this in development!
async with Fetcher() as f:
# SSL verification is handled by curl-cffi and Playwright
# For curl engine, certificates are validated by default
resp = await f.fetch("https://self-signed.badssl.com/")
Memory Issues with Caching
If cache grows too large:
from phantomfetch import FileSystemCache
cache = FileSystemCache(cache_dir=".cache")
# Manually clear expired entries
cache.clear_expired()
# Or just delete the cache directory
import shutil
shutil.rmtree(".cache", ignore_errors=True)
Browser Engine Not Working
Common issues:
- Playwright not installed: Run
playwright install chromium - Marimo notebook issues: Browser engines may not work in some notebook environments
- Port conflicts: CDP uses random ports, but firewall rules might block them
Debug with:
async with Fetcher(browser_engine="cdp") as f:
# Enable verbose logging
import logging
logging.basicConfig(level=logging.DEBUG)
resp = await f.fetch("https://example.com", engine="browser")
Rate Limiting / 429 Errors
Use retry configuration and delays:
import asyncio
async with Fetcher(max_retries=5) as f:
for url in urls:
resp = await f.fetch(url)
await asyncio.sleep(1) # Be nice to servers
Scrapeless Session Recording
When using Scrapeless's CDP endpoint for session recording, PhantomFetch automatically reuses existing browser windows:
async with Fetcher(
browser_engine="cdp",
browser_engine_config={
"cdp_endpoint": "wss://YOUR_SESSION.scrapeless.com/chrome/cdp"
# use_existing_page=True (default) ensures recording compatibility
}
) as f:
# Uses existing window - Scrapeless records this! ✓
resp = await f.fetch("https://example.com", engine="browser")
Why this matters: Scrapeless can only record a single window. By default (use_existing_page=True), PhantomFetch detects and reuses the existing browser page in your Scrapeless session instead of creating new windows.
To disable (not recommended for recording): Set use_existing_page=False in browser_engine_config.
See examples/scrapeless_cdp_recording.py for a complete example.
Next Steps
Ready to dive deeper? Here's what to explore:
- Examples - See retry configuration and advanced patterns
- CHANGELOG - See what's new
- Contributing Guide - Help improve PhantomFetch
Community & Support
- 🐛 Found a bug? Open an issue
- 💡 Have a feature idea? Request a feature
- ❓ Questions? Start a discussion
- 📖 Documentation issues? Improve the docs
Contributing
We love contributions! PhantomFetch is built by developers, for developers. Whether you're:
- 🐛 Fixing bugs
- ✨ Adding features
- 📝 Improving documentation
- 🧪 Writing tests
Check out our Contributing Guide to get started!
Quick Start for Contributors
# Clone and setup
git clone https://github.com/iristech-systems/PhantomFetch.git
cd phantomfetch
uv sync
uv run pre-commit install
# Run tests
uv run pytest
# Make changes and commit
git checkout -b feature/amazing-feature
# ... make changes ...
uv run pre-commit run --all-files
git commit -m "feat: add amazing feature"
License
MIT License - see LICENSE for details.
Acknowledgments
Built on the shoulders of giants:
- curl-cffi - Amazing curl bindings with anti-detection
- Playwright - Best-in-class browser automation
- msgspec - Fast serialization
- OpenTelemetry - Observability standard
Special thanks to all contributors who help make PhantomFetch better!
Made with ❤️ for the web scraping community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phantomfetch-0.5.4.tar.gz.
File metadata
- Download URL: phantomfetch-0.5.4.tar.gz
- Upload date:
- Size: 43.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Pop!_OS","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d97ee0e595762cb0e0e762be411351612b533c22bcefe01cd1cfa1183103deeb
|
|
| MD5 |
215688e684fbba95bb1a86cdd74c9072
|
|
| BLAKE2b-256 |
ed80aed0792b96d008f3ca849f2d6429b422d9ce1141e6a2c55fab24c018aa16
|
File details
Details for the file phantomfetch-0.5.4-py3-none-any.whl.
File metadata
- Download URL: phantomfetch-0.5.4-py3-none-any.whl
- Upload date:
- Size: 49.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Pop!_OS","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0cd717aee28089bcc4bf28fdf5218f66ed6bfb3afb985aae5a41c5441bd849aa
|
|
| MD5 |
c978157b821e451d6324a56337880d23
|
|
| BLAKE2b-256 |
948142b498b96347ef236f0fcf14ba39ceea021d9d21a3778403b7f35568a92c
|