Skip to main content

High-performance agentic web scraping library combining curl-cffi speed with Playwright browser capabilities

Project description

PhantomFetch

Python 3.13+ License: MIT Code style: ruff

PhantomFetch is a high-performance, agentic web scraping library for Python. It seamlessly combines the speed of curl-cffi with the capabilities of Playwright, offering a unified API for all your data extraction needs.

Why PhantomFetch?

Most web scraping requires choosing between speed (httpx, requests) or browser capabilities (Playwright, Selenium). PhantomFetch gives you both with a unified interface:

Feature PhantomFetch requests/httpx Playwright/Selenium
Speed ⚡ Fast (curl-cffi) ⚡ Fast 🐌 Slow
JavaScript Support ✅ Yes (Playwright) ❌ No ✅ Yes
Anti-Detection ✅ Built-in ❌ No ⚠️ Manual
Smart Caching ✅ Configurable ❌ No ❌ No
Proxy Rotation ✅ Automatic ⚠️ Manual ⚠️ Manual
Async-First ✅ Yes ⚠️ Partial ✅ Yes
Unified API ✅ One interface N/A N/A
OpenTelemetry ✅ Built-in ❌ No ❌ No

Key Benefits:

  • 🎯 Start Fast, Scale Smart: Use curl for quick requests, switch to browser when needed
  • 🧠 Intelligent: Automatic retry logic, exponential backoff, fingerprint rotation
  • 🚀 Production-Ready: Built-in observability, caching, and error handling
  • 🛠️ Developer-Friendly: Intuitive API, comprehensive type hints, rich documentation

Features

  • 🚀 Unified API: Switch between curl (fast, lightweight) and browser (JavaScript-capable) engines with a single parameter
  • 🧠 Smart Caching: Configurable caching strategies (all, resources, conservative) to speed up development and save bandwidth
  • 🤖 Agentic Actions: Define browser interactions (click, scroll, input, wait) declaratively
  • 🛡️ Anti-Detection: Built-in support for proxy rotation and fingerprinting protection (via curl-cffi)
  • Async First: Built on asyncio for high concurrency
  • 🔄 Smart Retries: Configurable retry logic with exponential backoff
  • 🍪 Cookie Management: Automatic cookie handling across engines
  • 📊 Observability: OpenTelemetry integration out of the box

Installation

pip install phantomfetch
# or with uv (recommended)
uv pip install phantomfetch

After installation, install Playwright browsers:

playwright install chromium

Quick Start

Basic Fetch (Curl Engine)

import asyncio
from phantomfetch import Fetcher

async def main():
    async with Fetcher() as f:
        response = await f.fetch("https://httpbin.org/get")
        print(response.json())

if __name__ == "__main__":
    asyncio.run(main())

Browser Fetch with Caching

Use the resources strategy to cache static assets (images, CSS, scripts) while keeping the main page fresh.

from phantomfetch import Fetcher, FileSystemCache

async def main():
    # Cache sub-resources to speed up subsequent fetches
    cache = FileSystemCache(strategy="resources")

    async with Fetcher(browser_engine="cdp", cache=cache) as f:
        # First run: downloads everything
        resp = await f.fetch("https://example.com", engine="browser")

        # Second run: uses cached resources, only fetches main HTML
        resp = await f.fetch("https://example.com", engine="browser")
        print(resp.text)

Browser Actions

Perform interactions like clicking, scrolling, and taking screenshots:

from phantomfetch import Fetcher

actions = [
    {"action": "wait", "selector": "#search-input"},
    {"action": "input", "selector": "#search-input", "value": "phantomfetch"},
    {"action": "click", "selector": "#search-button"},
    {"action": "wait_for_load"},
    {"action": "screenshot", "value": "search_results.png"}
]

async with Fetcher(browser_engine="cdp") as f:
    resp = await f.fetch("https://example.com", actions=actions, engine="browser")
    # Screenshot saved to search_results.png

Advanced: Retry Configuration

Fine-tune retry behavior per request:

from phantomfetch import Fetcher

async with Fetcher() as f:
    # Custom retry logic for flaky endpoints
    resp = await f.fetch(
        "https://api.example.com/data",
        max_retries=5,  # Override default retries
        timeout=60.0,   # Longer timeout for slow APIs
    )

Cookie Handling

Pass cookies to any engine and retrieve them from the response:

from phantomfetch import Fetcher, Cookie

async with Fetcher() as f:
    # Set cookies
    resp = await f.fetch(
        "https://httpbin.org/cookies",
        cookies={"session_id": "secret_token"}
    )
    print(resp.json())

    # Get cookies (including from redirects)
    resp = await f.fetch("https://httpbin.org/cookies/set/foo/bar")
    for cookie in resp.cookies:
        print(f"{cookie.name}: {cookie.value}")

Configuration

Caching Strategies

  • all: Caches everything, including the main document. Good for offline development
  • resources (Default): Caches sub-resources (images, styles, scripts) but fetches the main document fresh. Best for scraping dynamic sites
  • conservative: Caches only heavy static assets like images and fonts

Example:

from phantomfetch import FileSystemCache, Fetcher

cache = FileSystemCache(
    cache_dir=".cache",
    strategy="resources"
)

async with Fetcher(cache=cache) as f:
    # Resources will be cached automatically
    resp = await f.fetch("https://example.com", engine="browser")

Proxy Rotation

Multiple proxy strategies available:

from phantomfetch import Fetcher, Proxy, ProxyPool

# 1. Define Typed Proxies
proxies = [
    Proxy(
        url="http://user:pass@residential-us.com:8080", 
        location="US", 
        vendor="BrightData",
        proxy_type="residential",
        weight=10
    ),
    Proxy(
        url="http://user:pass@datacenter-de.com:8080", 
        location="DE", 
        vendor="OxyLabs",
        proxy_type="datacenter",
        weight=1
    ),
]

# 2. Create a Smart Pool
pool = ProxyPool(proxies, strategy="geo_match")

async with Fetcher(proxies=pool) as f:
    # Uses US proxy from pool (geo-match)
    await f.fetch("https://google.com", location="US")

    # Uses any available proxy (fallback)
    await f.fetch("https://example.com")
    
    # 3. Explicit Override (Bypass Pool)
    # Useful for debugging or specific routing needs
    await f.fetch(
        "https://httpbin.org/ip", 
        proxy="http://user:pass@specific-proxy:8080"
    )

Observability (OpenTelemetry)

PhantomFetch is fully instrumented with OpenTelemetry:

from phantomfetch.telemetry import configure_telemetry
from opentelemetry.sdk.trace.export import ConsoleSpanExporter

# Setup OTel with custom service name
configure_telemetry(service_name="my-scraper")

async with Fetcher() as f:
    await f.fetch("https://example.com")
    # Spans automatically created and exported

Or use standard OpenTelemetry environment variables:

export OTEL_SERVICE_NAME="my-scraper"
export OTEL_TRACES_EXPORTER="console"
python my_scraper.py

Troubleshooting

Playwright Installation Issues

If you encounter browser-related errors:

# Install all browsers
playwright install

# Or just chromium (recommended)
playwright install chromium

# Check installation
playwright install --help

SSL Certificate Errors

For development/testing, you can disable SSL verification:

# Note: Only use this in development!
async with Fetcher() as f:
    # SSL verification is handled by curl-cffi and Playwright
    # For curl engine, certificates are validated by default
    resp = await f.fetch("https://self-signed.badssl.com/")

Memory Issues with Caching

If cache grows too large:

from phantomfetch import FileSystemCache

cache = FileSystemCache(cache_dir=".cache")

# Manually clear expired entries
cache.clear_expired()

# Or just delete the cache directory
import shutil
shutil.rmtree(".cache", ignore_errors=True)

Browser Engine Not Working

Common issues:

  1. Playwright not installed: Run playwright install chromium
  2. Marimo notebook issues: Browser engines may not work in some notebook environments
  3. Port conflicts: CDP uses random ports, but firewall rules might block them

Debug with:

async with Fetcher(browser_engine="cdp") as f:
    # Enable verbose logging
    import logging
    logging.basicConfig(level=logging.DEBUG)

    resp = await f.fetch("https://example.com", engine="browser")

Rate Limiting / 429 Errors

Use retry configuration and delays:

import asyncio

async with Fetcher(max_retries=5) as f:
    for url in urls:
        resp = await f.fetch(url)
        await asyncio.sleep(1)  # Be nice to servers

Scrapeless Session Recording

When using Scrapeless's CDP endpoint for session recording, PhantomFetch automatically reuses existing browser windows:

async with Fetcher(
    browser_engine="cdp",
    browser_engine_config={
        "cdp_endpoint": "wss://YOUR_SESSION.scrapeless.com/chrome/cdp"
        # use_existing_page=True (default) ensures recording compatibility
    }
) as f:
    # Uses existing window - Scrapeless records this! ✓
    resp = await f.fetch("https://example.com", engine="browser")

Why this matters: Scrapeless can only record a single window. By default (use_existing_page=True), PhantomFetch detects and reuses the existing browser page in your Scrapeless session instead of creating new windows.

To disable (not recommended for recording): Set use_existing_page=False in browser_engine_config.

See examples/scrapeless_cdp_recording.py for a complete example.

Next Steps

Ready to dive deeper? Here's what to explore:

  1. Examples - See retry configuration and advanced patterns
  2. CHANGELOG - See what's new
  3. Contributing Guide - Help improve PhantomFetch

Community & Support

Contributing

We love contributions! PhantomFetch is built by developers, for developers. Whether you're:

  • 🐛 Fixing bugs
  • ✨ Adding features
  • 📝 Improving documentation
  • 🧪 Writing tests

Check out our Contributing Guide to get started!

Quick Start for Contributors

# Clone and setup
git clone https://github.com/iristech-systems/PhantomFetch.git
cd phantomfetch
uv sync
uv run pre-commit install

# Run tests
uv run pytest

# Make changes and commit
git checkout -b feature/amazing-feature
# ... make changes ...
uv run pre-commit run --all-files
git commit -m "feat: add amazing feature"

License

MIT License - see LICENSE for details.

Acknowledgments

Built on the shoulders of giants:

Special thanks to all contributors who help make PhantomFetch better!


Made with ❤️ for the web scraping community

⭐ Star us on GitHub📦 Install from PyPI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phantomfetch-0.5.4.tar.gz (43.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phantomfetch-0.5.4-py3-none-any.whl (49.7 kB view details)

Uploaded Python 3

File details

Details for the file phantomfetch-0.5.4.tar.gz.

File metadata

  • Download URL: phantomfetch-0.5.4.tar.gz
  • Upload date:
  • Size: 43.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Pop!_OS","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for phantomfetch-0.5.4.tar.gz
Algorithm Hash digest
SHA256 d97ee0e595762cb0e0e762be411351612b533c22bcefe01cd1cfa1183103deeb
MD5 215688e684fbba95bb1a86cdd74c9072
BLAKE2b-256 ed80aed0792b96d008f3ca849f2d6429b422d9ce1141e6a2c55fab24c018aa16

See more details on using hashes here.

File details

Details for the file phantomfetch-0.5.4-py3-none-any.whl.

File metadata

  • Download URL: phantomfetch-0.5.4-py3-none-any.whl
  • Upload date:
  • Size: 49.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Pop!_OS","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for phantomfetch-0.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0cd717aee28089bcc4bf28fdf5218f66ed6bfb3afb985aae5a41c5441bd849aa
MD5 c978157b821e451d6324a56337880d23
BLAKE2b-256 948142b498b96347ef236f0fcf14ba39ceea021d9d21a3778403b7f35568a92c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page